Summary:
Some of the errors like data race and heap-after-use are error out based on crash test reporting them as error by relying on stderr. So reverting back to original form unless we come up with a more reliable solution to error out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12335
Reviewed By: cbi42
Differential Revision: D53534781
Pulled By: akankshamahajan15
fbshipit-source-id: b19aa560d1560ac2281f7bc04e13961ed751f178
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.
I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.
With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.
A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).
Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.
I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308
Test Plan: existing tests, CI
Reviewed By: ltamasi
Differential Revision: D53204947
Pulled By: pdillinger
fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
Summary:
# Summary
Following up jowlyzhang 's comment in https://github.com/facebook/rocksdb/issues/12283 .
- Remove `ARG_TTL` from help which is not relevant to `multi_get` command
- Treat NotFound status as non-error case for both `Get` and `MultiGet` and updated the unit test, `ldb_test.py`
- Print key along with value in `multi_get` command
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12332
Test Plan:
**Unit Test**
```
$>python3 tools/ldb_test.py
...
Ran 25 tests in 17.447s
OK
```
**Manual Run**
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000000D8 0x0000000000000009000000000000002678787878BEEF
0x0000000000000009000000000000012B00000000000000D8 ==> 0x47000000434241404F4E4D4C4B4A494857565554535251505F5E5D5C5B5A595867666564636261606F6E6D6C6B6A696877767574737271707F7E7D7C7B7A797807060504030201000F0E0D0C0B0A090817161514131211101F1E1D1C1B1A1918
Key not found: 0x0000000000000009000000000000002678787878BEEF
```
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex get 0x00000000000000090000000000
Key not found
```
Reviewed By: jowlyzhang
Differential Revision: D53450164
Pulled By: jaykorean
fbshipit-source-id: 9ccec78ad3695e65b1ed0c147c7cbac502a1bd48
Summary:
I've always found this name difficult to read, because it sounds like it's for collecting int(eger)
table properties.
I'm fixing this now to set up for a change that I have stubbed out in the public API (table_properties.h):
a new adapter function `TablePropertiesCollector::AsInternal()` that allows RocksDB-provided
TablePropertiesCollectors (such as CompactOnDeletionCollector) to implement the easier-to-upgrade
internal interface while still (superficially) implementing the public interface. In addition to added flexibility,
this should be a performance improvement as the adapter class UserKeyTablePropertiesCollector can be
avoided for such cases where a RocksDB-provided collector is used (AsInternal() returns non-nullptr).
table_properties.h is the only file with changes that aren't simple find-replace renaming.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12320
Test Plan: existing tests, CI
Reviewed By: ajkr
Differential Revision: D53336945
Pulled By: pdillinger
fbshipit-source-id: 02535bcb30bbfb00e29e8478af62e5dad50a63b8
Summary:
sst_dump --command=check can now compare number of keys in a file with num_entries in table property and reports corruption is there is a mismatch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12322
Test Plan:
- new unit test for API `SstFileDumper::ReadSequential`
- ran sst_dump on a good and a bad file:
```
sst_dump --file=./32316112.sst
options.env is 0x7f68bfcb5000
Process ./32316112.sst
Sst file format: block-based
from [] to []
sst_dump --file=./32316115.sst
options.env is 0x7f6d0d2b5000
Process ./32316115.sst
Sst file format: block-based
from [] to []
./32316115.sst: Corruption: Table property has num_entries = 6050408 but scanning the table returns 6050406 records.
```
Reviewed By: jowlyzhang
Differential Revision: D53320481
Pulled By: cbi42
fbshipit-source-id: d84c996346a9575a5a2ea5f5fb09a9d3ee672cd6
Summary:
`check_flush_compaction_key_order` option was introduced for the key order checking online validation. It gave users the ability to disable the validation without downgrade in case the validation caused inefficiencies or false positives. Over time this validation has shown to be cheap and correct, so the option to disable it can now be removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12311
Reviewed By: cbi42
Differential Revision: D53233379
Pulled By: ajkr
fbshipit-source-id: 1384361104021d6e3e580dce2ec123f9f99ce637
Summary:
For the user defined timestamps in memtable only feature, some special handling for range deletion blocks are needed since both the key (start_key) and the value (end_key) of a range tombstone can contain user-defined timestamps. Handling for the key is taken care of in the same way as the other data blocks in the block based table. This PR adds the special handling needed for the value (end_key) part. This includes:
1) On the write path, when L0 SST files are first created from flush, user-defined timestamps are removed from an end key of a range tombstone. There are places where it's logically removed (replaced with a min timestamp) because there is still logic with the running comparator that expects a user key that contains timestamp. And in the block based builder, it is eventually physically removed before persisted in a block.
2) On the read path, when range deletion block is being read, we artificially pad a min timestamp to the end key of a range tombstone in `BlockBasedTableReader`.
3) For file boundary `FileMetaData.largest`, we artificially pad a max timestamp to it if it contains a range deletion sentinel. Anytime when range deletion end_key is used to update file boundaries, it's using max timestamp instead of the range tombstone's actual timestamp to mark it as an exclusive end. d69628e6ce/db/dbformat.h (L923-L935)
This max timestamp is removed when in memory `FileMetaData.largest` is persisted into Manifest, we pad it back when it's read from Manifest while handling related `VersionEdit` in `VersionEditHandler`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12254
Test Plan: Added unit test and enabled this feature combination's stress test.
Reviewed By: cbi42
Differential Revision: D52965527
Pulled By: jowlyzhang
fbshipit-source-id: e8315f8a2c5268e2ae0f7aec8012c266b86df985
Summary:
While working on Meta's internal test triaging process, I found that `db_crashtest.py` was printing out `stdout` and `stderr` altogether. Adding an option to print `stderr` separately so that it's easy to extract only `stderr` from the test run.
`print_stderr_separately` is introduced as an optional parameter with default value `False` to keep the existing behavior as is (except a few minor changes).
Minor changes to the existing behavior
- We no longer print `stderr has error message:` and `***` prefix to each line. We simply print `stderr:` before printing `stderr` if stderr is printed in stdout and print `stderr` as is.
- We no longer print `times error occurred in output is ...` which doesn't appear to have any values
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12301
Test Plan:
**Default Behavior (blackbox)**
Run printed everything as is
```
$> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 2> /tmp/error.log
Running blackbox-crash-test with
interval_between_crash=120
total-duration=6000
...
Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0
Integrated BlobDB: blob cache disabled
DB path: [/tmp/jewoongh/rocksdb_crashtest_blackboxwh7yxpec]
(Re-)verified 0 unique IDs
2024/01/29-09:16:30 Initializing worker threads
Crash-recovery verification passed :)
2024/01/29-09:16:35 Starting database operations
2024/01/29-09:16:35 Starting verification
Stress Test : 543.600 micros/op 8802 ops/sec
: Wrote 0.00 MB (0.27 MB/sec) (50% of 10 ops)
: Wrote 5 times
: Deleted 1 times
: Single deleted 0 times
: 4 read and 0 found the key
: Prefix scanned 0 times
: Iterator size sum is 0
: Iterated 0 times
: Deleted 0 key-ranges
: Range deletions covered 0 keys
: Got errors 0 times
: 0 CompactFiles() succeed
: 0 CompactFiles() did not succeed
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
Error : jewoongh injected test error This is not a real failure.
Verification failed :(
```
Nothing in stderr
```
$> cat /tmp/error.log
```
**Default Behavior (whitebox)**
Run printed everything as is
```
$> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304 2> /tmp/error.log
Running whitebox-crash-test with
total-duration=10000
...
(Re-)verified 571 unique IDs
2024/01/29-09:33:53 Initializing worker threads
Crash-recovery verification passed :)
2024/01/29-09:35:16 Starting database operations
2024/01/29-09:35:16 Starting verification
Stress Test : 97248.125 micros/op 10 ops/sec
: Wrote 0.00 MB (0.00 MB/sec) (12% of 8 ops)
: Wrote 1 times
: Deleted 0 times
: Single deleted 0 times
: 4 read and 1 found the key
: Prefix scanned 1 times
: Iterator size sum is 120868
: Iterated 4 times
: Deleted 0 key-ranges
: Range deletions covered 0 keys
: Got errors 0 times
: 0 CompactFiles() succeed
: 0 CompactFiles() did not succeed
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
Error : jewoongh injected test error This is not a real failure.
New cache capacity = 4865393
Verification failed :(
TEST FAILED. See kill option and exit code above!!!
```
Nothing in stderr
```
$> cat /tmp/error.log
```
**New option added (blackbox)**
```
$> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --print_stderr_separately 2> /tmp/error.log
Running blackbox-crash-test with
interval_between_crash=120
total-duration=6000
...
Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0
Integrated BlobDB: blob cache disabled
DB path: [/tmp/jewoongh/rocksdb_crashtest_blackbox7ybna32z]
(Re-)verified 0 unique IDs
Compaction filter factory: DbStressCompactionFilterFactory
2024/01/29-09:05:39 Initializing worker threads
Crash-recovery verification passed :)
2024/01/29-09:05:46 Starting database operations
2024/01/29-09:05:46 Starting verification
Stress Test : 235.917 micros/op 16000 ops/sec
: Wrote 0.00 MB (0.16 MB/sec) (16% of 12 ops)
: Wrote 2 times
: Deleted 1 times
: Single deleted 0 times
: 9 read and 0 found the key
: Prefix scanned 0 times
: Iterator size sum is 0
: Iterated 0 times
: Deleted 0 key-ranges
: Range deletions covered 0 keys
: Got errors 0 times
: 0 CompactFiles() succeed
: 0 CompactFiles() did not succeed
```
stderr printed separately
```
$> cat /tmp/error.log
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
Error : jewoongh injected test error This is not a real failure.
New cache capacity = 19461571
Verification failed :(
```
**New option added (whitebox)**
```
$> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304 --print_stderr_separately 2> /tmp/error.log
Running whitebox-crash-test with
total-duration=10000
...
Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0
Integrated BlobDB: blob cache disabled
DB path: [/tmp/jewoongh/rocksdb_crashtest_whiteboxtwj0ihn6]
(Re-)verified 157 unique IDs
2024/01/29-09:39:59 Initializing worker threads
Crash-recovery verification passed :)
2024/01/29-09:40:16 Starting database operations
2024/01/29-09:40:16 Starting verification
Stress Test : 742.474 micros/op 11801 ops/sec
: Wrote 0.00 MB (0.27 MB/sec) (36% of 19 ops)
: Wrote 7 times
: Deleted 1 times
: Single deleted 0 times
: 8 read and 0 found the key
: Prefix scanned 0 times
: Iterator size sum is 0
: Iterated 4 times
: Deleted 0 key-ranges
: Range deletions covered 0 keys
: Got errors 0 times
: 0 CompactFiles() succeed
: 0 CompactFiles() did not succeed
TEST FAILED. See kill option and exit code above!!!
```
stderr printed separately
```
$> cat /tmp/error.log
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
Error : jewoongh injected test error This is not a real failure.
Error : jewoongh injected test error This is not a real failure.
Error : jewoongh injected test error This is not a real failure.
New cache capacity = 4865393
Verification failed :(
```
Reviewed By: akankshamahajan15
Differential Revision: D53187491
Pulled By: jaykorean
fbshipit-source-id: 76f9100d08b96d014e41b7b88b206d69f0ae932b
Summary:
Right now crash_test relies on std::errors too to check for only errors/failures along with verification. However, that's not a reliable solution and many internal services logs benign errors/warnings in which case our test script fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12265
Test Plan: Keep std::errors but printout instead of failing and will monitor crash tests internally to see if there is any scenario which solely relies on std::error, in which case stress tests can be improve.
Reviewed By: ajkr, cbi42
Differential Revision: D52967000
Pulled By: akankshamahajan15
fbshipit-source-id: 5328c8b69480c7946fe6a9c72f9ffeede70ac2ad
Summary:
The current implementation of the ldb_cmd tool involves commenting out the user-passed column_family_descriptors, resulting in the tool consistently constructing its column_family_descriptors from the pre-existing OPTIONS file.
The proposed fix prioritizes user-passed column family descriptors, ensuring they take precedence over those specified in the OPTIONS file. This modification enhances the tool's adaptability and responsiveness to user configurations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12261
Reviewed By: cbi42
Differential Revision: D52965877
Pulled By: ajkr
fbshipit-source-id: 334a83a8e1004c271b19e7ca09381a0e7cf87b03
Summary:
While investigating test failures due to the inconsistency between `Get()` and `MultiGet()`, I realized that LDB currently doesn't support `MultiGet()`. This PR introduces the `MultiGet()` support in LDB. Tested the command manually. Unit test will follow in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12283
Test Plan:
When key not found
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000002AB
Status for key 0x0000000000000009000000000000012B00000000000002AB: NotFound:
```
Compare the same key with get
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex get 0x0000000000000009000000000000012B00000000000002AB
Failed: Get failed: NotFound:
```
Multiple keys not found
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000002AB 0x0000000000000009000000000000012B00000000000002AC Status for key 0x0000000000000009000000000000012B00000000000002AB: NotFound:
Status for key 0x0000000000000009000000000000012B00000000000002AC: NotFound:
```
One of the keys found
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000002AB 0x00000000000000090000000000000026787878787878
Status for key 0x0000000000000009000000000000012B00000000000002AB: NotFound:
0x22000000262724252A2B28292E2F2C2D32333031363734353A3B38393E3F3C3D02030001060704050A0B08090E0F0C0D12131011161714151A1B18191E1F1C1D
```
All of the keys found
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000000D8 0x00000000000000090000000000000026787878787878 15:57:03
0x47000000434241404F4E4D4C4B4A494857565554535251505F5E5D5C5B5A595867666564636261606F6E6D6C6B6A696877767574737271707F7E7D7C7B7A797807060504030201000F0E0D0C0B0A090817161514131211101F1E1D1C1B1A1918
0x22000000262724252A2B28292E2F2C2D32333031363734353A3B38393E3F3C3D02030001060704050A0B08090E0F0C0D12131011161714151A1B18191E1F1C1D
```
Reviewed By: hx235
Differential Revision: D53048519
Pulled By: jaykorean
fbshipit-source-id: a6217905464c5f460a222e2b883bdff47b9dd9c7
Summary:
with release notes for 8.11.fb, format_compatible test update, and version.h update.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12256
Test Plan: CI
Reviewed By: cbi42
Differential Revision: D52926051
Pulled By: pdillinger
fbshipit-source-id: adcf7119b065758599e904c16cbdf1d28811e0b4
Summary:
Currently, we treat the long-running whitebox_crash_test as passing. However, we were not cleaning up after ourselves when we killed the running test for running too long, which often caused out-of-space errors in subsequent tests (e.g., blackbox_crash_test after whitebox_crash_test).
Unless we want to start treating these timeouts as failures and need the DB output for investigation now, we should properly clean up the tmp dir.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12248
Test Plan:
```
$> make crash_test -j
```
Reviewed By: ajkr
Differential Revision: D52885342
Pulled By: jaykorean
fbshipit-source-id: 7c1f2ca7cf03d0705bb14155ee44d5d7a411c132
Summary:
Add ```CompressionOptions``` to ```CompressedSecondaryCacheOptions``` to allow users to set options such as compression level. It allows performance to be fine tuned.
Tests -
Run db_bench and verify compression options in the LOG file
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12234
Reviewed By: ajkr
Differential Revision: D52758133
Pulled By: anand1976
fbshipit-source-id: af849fbffce6f84704387c195d8edba40d9548f6
Summary:
We test LockWAL() and UnlockWAL() by checking that latest sequence number is not changed: 1a1f9f1660/db_stress_tool/db_stress_test_base.cc (L920-L937). With writeprepared transaction, sequence number can be advanced in SwitchMemtable::WriteRecoverableState() when writing recoverable state: 1a1f9f1660/db/db_impl/db_impl_write.cc (L1560)
This PR disables LockWAL() tests for writeprepared transaction for now. We probably need to change how we test LockWAL() for writeprepared before re-enabling this test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12221
Reviewed By: ajkr
Differential Revision: D52677076
Pulled By: cbi42
fbshipit-source-id: 27ee694878edf63e8f4ad52f769d4db401f511bc
Summary:
Currently, when `block_cache_trace_analyzer` analyzes the cache miss ratio, it only analyzes the total miss ratio.
But it seems also important to analyze the cache miss ratio of each caller. To achieve this, we can calculate and print the miss ratio of each caller in the analyzer.
## Before modification
```
Running for 1 seconds: Processed 85732 records/second. Trace duration 58 seconds. Observed miss ratio 7.97
```
## After modification
```
Running for 1 seconds: Processed 85732 records/second. Trace duration 58 seconds. Observed miss ratio 7.97
Caller Get: Observed miss ratio 6.31
Caller Iterator: Observed miss ratio 11.86
***************************************************************
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10823
Reviewed By: ajkr
Differential Revision: D52632764
Pulled By: hx235
fbshipit-source-id: 40994d6039b73dc38fe78ea1b4adce187bb98909
Summary:
This feature combination is not fully working yet. Disable them so the stress tests have less noise.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12218
Reviewed By: cbi42
Differential Revision: D52643957
Pulled By: jowlyzhang
fbshipit-source-id: 8815a18a3b5814cad4f7ec41f3fb94869302081e
Summary:
FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200.
In this change,
* Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption.
* Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior.
* Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer
* Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues.
Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206
Test Plan:
* Manually verified that the new check aids in debugging.
* Unit test added, that fails if either fix is missed.
* Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200
Reviewed By: akankshamahajan15
Differential Revision: D52551701
Pulled By: pdillinger
fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
Summary:
Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174
Test Plan: Unit test, and some manual verification
Reviewed By: ajkr
Differential Revision: D52451184
Pulled By: anand1976
fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83
Summary:
Currently, some numbers in the `tracer_analyzer_tool` may be a little confusing and unfriendly for people who want to add new query types.
It may be better to replace them with the existing enumeration type to improve readability.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10827
Reviewed By: ajkr
Differential Revision: D40576023
Pulled By: hx235
fbshipit-source-id: 0eb16820a15f365d53e848a3a8efd92928420429
Summary:
I landed https://github.com/facebook/rocksdb/issues/12159 which had the below compiler error when using `-DROCKSDB_NAMESPACE`, which broke the CircleCI "build-linux-static_lib-alt_namespace-status_checked" job:
```
tools/ldb_cmd_test.cc:1213:21: error: 'rocksdb' does not name a type
1213 | int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override {
| ^~~~~~~
tools/ldb_cmd_test.cc:1213:35: error: expected unqualified-id before '&' token
1213 | int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override {
| ^
tools/ldb_cmd_test.cc:1213:35: error: expected ')' before '&' token
1213 | int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override {
| ~ ^
| )
tools/ldb_cmd_test.cc:1213:35: error: expected ';' at end of member declaration
1213 | int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override {
| ^
| ;
tools/ldb_cmd_test.cc:1213:37: error: 'a' does not name a type
1213 | int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override {
| ^
...
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12173
Test Plan:
```
$ make clean && make OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" ldb_cmd_test -j56
```
Reviewed By: pdillinger
Differential Revision: D52373797
Pulled By: ajkr
fbshipit-source-id: 8597aaae65a5333831fef66d85072827c5fb1187
Summary:
According to this [Q&A](https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ#:~:text=Q%3A%20If%20I%20use%20non%2Ddefault%20comparators%20or%20merge%20operators%2C%20can%20I%20still%20use%20ldb%20tool%3F), user should be able to use LDB with passing a customized comparator into the option.
In the process of opening DB in order to perform ldb commands, there is a exception saying comparator not match even if a option with customized comparator is provided. After initializing the column family to open DB, the `LDBCommand::OverrideBaseCFOptions` method does not update the comparator inside column family descriptor using the passed in options. This can cause a mismatch while doing version edit, and in function `ToggleUDT CompareComparator` it will failed and return a exception saying comparator not match.
Propose fix by updating the column family descriptor's option using the user passed in option. Also a test case is provided to illustrate the steps.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12159
Reviewed By: hx235
Differential Revision: D52267367
Pulled By: ajkr
fbshipit-source-id: c240f93f440e02cb485893de058a46c6dbf9654b
Summary:
**Context/Summary:**
Continued from https://github.com/facebook/rocksdb/pull/12127, we can randomly reduce the # max key to coerce more operations on the same key. My experimental run shows it surfaced more issue than just https://github.com/facebook/rocksdb/pull/12127.
I also randomly reduce the related parameters, write buffer size and target file base, to adapt to randomly lower number of # max key. This creates 4 situations of testing, 3 of which are new:
1. **high** # max key with **high** write buffer size and target file base (existing)
2. **high** # max key with **low** write buffer size and target file base (new, will go through some rehearsal testing to ensure we don't run out of space with many files)
3. **low** # max key with **high** write buffer size and target file base (new, keys will stay in memory longer)
4. **low** # max key with **low** write buffer size and target file base (new, experimental runs show it surfaced even more issues)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12148
Test Plan:
- [Ongoing] Rehearsal stress test
- Monitor production stress test
Reviewed By: jaykorean
Differential Revision: D52174980
Pulled By: hx235
fbshipit-source-id: bd5e11280826819ca9314c69bbbf05d481c6d105
Summary:
This PR adds initial stress testing for the user-defined timestamps in memtable only feature. Each flavor of the `*_ts` crash test get a 1 in 3 chance to run with timestamps not persisted, this setting is initialized once and kept consistent across the following re-runs.
This initial stress test included these things besides disabling incompatible feature combinations to make the test run more stably:
1) It currently only run test methods that validates db state with expected state. Not the ones that validate db state by comparing result from one API to another API. Such as `TestMultiGet` (compared with `Get`), similarly `TestMultiGetEntity`, `TestIterate` (compare src iterator to a control iterator). Due to timestamps being removed, results from one API to another API is not directly comparable as it is now. More test logic to handle that need to be added, will do that in a follow up.
2) Even when comparing db state to expected state, sometimes the db can receive `InvalidArgument` too due to timestamps getting flushed and removed. Added some logic to handle that.
3) When timestamps are not persisted, we don't try to read with older timestamp. Since that's making it easier to get `InvalidArgument`. And this capability is not yet needed by our customer so it's disabled for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12124
Test Plan: running multiple flavor of this test on continuous run for sometime before checkin
Reviewed By: ltamasi
Differential Revision: D51916267
Pulled By: jowlyzhang
fbshipit-source-id: 3f3eb5f9618d05d296062820e0ef5cb8edc7c2b2
Summary:
**Context/Summary:**
My experimental stress runs with more frequent "xxx_one_in" surfaced a couple interesting bugs/issues with RocksDB or crash test framework in the past. We now consider changing the default value so they are run more frequently in production testing environment.
Increase frequency by 2 orders of magnitude for most parameters, except for error-prone features e.g, manual compaction and file ingestion (increased by 3 orders) and expensive features e.g, checksum verification (increased by 1 order)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12127
Test Plan: Monitor CI to see if it did surface more interesting bugs/issues. If not, we may consider intensify even more.
Reviewed By: pdillinger
Differential Revision: D51954235
Pulled By: hx235
fbshipit-source-id: 92046cb7c52a37212f19ab7965b40f77b90b08b1
Summary:
This is a simple refactor for the crash test script to put shared logic for parsing stderr into a function. There is no functional change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12109
Test Plan: manually tested the script
Reviewed By: ajkr
Differential Revision: D51692172
Pulled By: jowlyzhang
fbshipit-source-id: d346d64e981d9c489c380ff6ce33296a224b5877
Summary:
Add the option to have a 3-tier block cache (uncompressed RAM, compressed RAM, and local flash) in db_bench, as well as specifying secondary cache admission policy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12104
Reviewed By: ajkr
Differential Revision: D51629092
Pulled By: anand1976
fbshipit-source-id: 6a208f853bc85d3d8b437d91cb1b0142d9a99e53
Summary:
Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery. 3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted. This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`.
These are the types of files being actively tracked in quarantine in this PR:
1) new table files and blob files from a background job
2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed.
Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening.
We track these files' file numbers because they share the same file number space.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030
Test Plan: Modified existing unit tests
Reviewed By: ajkr
Differential Revision: D51036774
Pulled By: jowlyzhang
fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716
Summary:
Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted.
This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them. When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK.
This PR also makes unit test that do not need force behavior to do a regular enable file deletion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001
Reviewed By: ajkr
Differential Revision: D51214683
Pulled By: jowlyzhang
fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4
Summary:
I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions.
It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`).
The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have
* Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain.
* Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty.
* Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain.
This is easily fixed by holding on to a reference until inserted onto the chain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046
Test Plan:
As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed.
Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000)
Reviewed By: jowlyzhang
Differential Revision: D51016979
Pulled By: pdillinger
fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b
Summary:
db_stress flag `verify_iterator_with_expected_state_one_in` is only enabled for in crash test if --simple flag is set. This PR enables it for all supported crash tests by enabling it by default. This adds coverage for --txn and --enable_ts crash tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12040
Test Plan:
ran crash tests that disabled this flag before for a few hours
```
python3 ./tools/db_crashtest.py blackbox --verify_iterator_with_expected_state_one_in=1 --txn --txn_write_policy=[0,1,2]
python3 ./tools/db_crashtest.py blackbox --verify_iterator_with_expected_state_one_in=1 --enable_ts
```
Reviewed By: ajkr, hx235
Differential Revision: D50980001
Pulled By: cbi42
fbshipit-source-id: 3daf6b4c32bdddc5df057240068162aa1a907587
Summary:
As mentioned in https://github.com/facebook/rocksdb/issues/11893, we are going to use the offpeak time information to pre-process TTL-based compactions. To do so, we need to access `daily_offpeak_time_utc` in `VersionStorageInfo::ComputeCompactionScore()` where we pick the files to compact. This PR is to make the offpeak time information available at the time of compaction-scoring. We are not changing any compaction scoring logic just yet. Will follow up in a separate PR.
There were two ways to achieve what we want.
1. Make `MutableDBOptions` available in `ColumnFamilyData` and `ComputeCompactionScore()` take `MutableDBOptions` along with `ImmutableOptions` and `MutableCFOptions`.
2. Make `daily_offpeak_time_utc` and `IsNowOffpeak()` available in `VersionStorageInfo`.
We chose the latter as it involves smaller changes.
This change includes the following
- Introduction of `OffpeakTimeInfo` and `IsNowOffpeak()` has been moved from `MutableDBOptions`
- `OffpeakTimeInfo` added to `VersionSet` and it can be set during construction and by `ChangeOffpeakTimeInfo()`
- During `SetDBOptions()`, if offpeak time info needs to change, it calls `MaybeScheduleFlushOrCompaction()` to re-compute compaction scores and process compactions as needed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12018
Test Plan:
- `DBOptionsTest::OffpeakTimes` changed to include checks for `MaybeScheduleFlushOrCompaction()` calls and `VersionSet`'s OffpeakTimeInfo value change during `SetDBOptions()`.
- `VersionSetTest::OffpeakTimeInfoTest` added to test `ChangeOffpeakTimeInfo()`. `IsNowOffpeak()` tests moved from `DBOptionsTest::OffpeakTimes`
Reviewed By: pdillinger
Differential Revision: D50723881
Pulled By: jaykorean
fbshipit-source-id: 3cff0291936f3729c0e9c7750834b9378fb435f6
Summary:
- Right now in blackbox test we don't exit if there are std::error as we do in whitebox crash tests. As result those errors are swallowed.
It only errors out if state is unexpected.
One example that was noticed in blackbox crash test -
```
stderr has error message:
***Error restoring historical expected values: Corruption: DB is older than any restorable expected state***
Running db_stress with pid=30454: /packages/rocksdb_db_stress_internal_repo/rocks_db_stress ....
```
- This diff also provided support to export files - db_crashtest.py file to be used by different repo.
Reviewed By: ajkr
Differential Revision: D50564889
fbshipit-source-id: 7bafbbc6179dc79467ca2b680fe83afc7850616a
Summary:
**Context/Summary:**
We ignore trace writing status e.g, 543191f2ea/db/db_impl/db_impl_write.cc (L221-L222)
If a write into the trace file fails, subsequent trace write will continue onto the same file.
This will trigger the assertion `assert(sync_without_flush_called_)` intended to catch write to a file that has previously seen error, added in https://github.com/facebook/rocksdb/pull/10489, https://github.com/facebook/rocksdb/pull/10555
Alternative (rejected) is to handle trace writing status at a higher level at e.g, 543191f2ea/db/db_impl/db_impl_write.cc (L221-L222). However, it makes sense to ignore such status considering tracing is not a critical but assistant component to db operation. And this alternative requires more code change. So it's better to handle the failure at a lower level as this PR
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11996
Test Plan: Add new UT failed before this PR and pass after
Reviewed By: akankshamahajan15
Differential Revision: D50532467
Pulled By: hx235
fbshipit-source-id: f2032abafd94917adbf89a20841d15b448782a33
Summary:
... until I can reproduce and resolve assertion failures (mostly in PurgeImplLocked) seen in crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12000
Test Plan: make blackbox_crash_test
Reviewed By: hx235
Differential Revision: D50565984
Pulled By: pdillinger
fbshipit-source-id: 5eea1638ff2683c41b4f65ee1ffc2398071911e7
Summary:
When `secondary_cache_uri` is non-empty and the `cache_type` is not a tiered cache, then sanitize `compressed_secondary_cache_size` to 0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11967
Test Plan: Run crash test
Reviewed By: akankshamahajan15
Differential Revision: D50346157
Pulled By: anand1976
fbshipit-source-id: 57bcbad2ec81fa736f1539a0a41ed6854ded2077
Summary:
Thanks ltamasi and ajkr for initial investigations on the test failure. Per the investigations, the following scenario is likely causing the test to fail.
1. Recovery is needed (could be any reason during crash test)
2. Trying to recover from the latest manifest fails (likely due to read error injection)
3. DB opens with recovery from the next manifest which is different from step 2.
4. Expected state is based on the manifest we tried and failed in step 2.
5. Two manifests used in step 2 and 3 are confirmed to have difference in LSM trees (Thanks ltamasi again for the finding).
```
2023/10/05-11:24:18.942189 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007184
...
2023/10/05-11:24:18.978007 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007180
```
```
[ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007184_renamed_ > 2
[ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007180_renamed_ > 1
[ltamasi@devbig1024.prn1 /tmp/x]$ diff 1 2
--- 1 2023-10-09 10:29:16.966215207 -0700
+++ 2 2023-10-09 10:29:11.984241645 -0700
@@ -13,7 +13,7 @@
7174:3950254[1875617 .. 2203952]['000000000003415B000000000000012B000000000000007D' seq:1906214, type:1 .. '000000000003CA59000000000000012B000000000000005C' seq:2039838, type:1]
7175:88060[2074748 .. 2203892]['000000000003CA6300000000000000CF78787878787878' seq:2167539, type:2 .. '000000000003D08F000000000000012B0000000000000130' seq:2112478, type:0]
--- level 6 --- version# 1 ---
- 7057:3132633[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1]
+ 7219:2135565[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1]
7061:827724[0 .. 2046131]['0000000000005F95000000000000000778787878787878' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000113' seq:0, type:1]
6763:1352[0 .. 0]['000000000000784F000000000000012B0000000000000129' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000129' seq:0, type:1]
7173:4812291[0 .. 2203957]['000000000000784F000000000000012B0000000000000138' seq:0, type:1 .. '0000000000020FAE787878787878' seq:0, type:1]
@@ -77,4 +77,4 @@
--- level 61 --- version# 1 ---
--- level 62 --- version# 1 ---
--- level 63 --- version# 1 ---
-next_file_number 7182 last_sequence 2203963 prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015
+next_file_number 7221 last_sequence 2203963 prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015
```
We have two options to fix this. Either skip verification against expected state or disable read injection when BE recovery is enabled. I chose to skip verification against expected state per discussion. (See comments in this PR)
Please note that some linter changes were included in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11938
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery
```
Reviewed By: ltamasi
Differential Revision: D50136341
Pulled By: jaykorean
fbshipit-source-id: ac7434d592aebc148bfc3a4fcaa34936f136b95c
Summary:
This PR depends on https://github.com/facebook/rocksdb/issues/11879 . Enable write fault injection for the basic whitebox, blackbox, and cf_consistency modes. For other test modes like multiops_txn, best_efforts_recovery etc., leave it disabled for now until we can do more testing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11924
Reviewed By: ajkr
Differential Revision: D50178252
Pulled By: anand1976
fbshipit-source-id: 5794f81c14cded1eb28762b2de818dfff1c1a34c
Summary:
In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF.
Also in this change:
* Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern
* Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible)
* Improved comments in several places
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920
Test Plan: tests added / updated
Reviewed By: jaykorean
Differential Revision: D49919147
Pulled By: pdillinger
fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11913
The `max_successive_merges` logic currently does not handle wide-column base values correctly, since it uses the `Get` API, which only returns the value of the default column. The patch fixes this by switching to `GetEntity` and passing all columns (if applicable) to the merge operator.
Reviewed By: jaykorean
Differential Revision: D49795097
fbshipit-source-id: 75eb7cc9476226255062cdb3d43ab6bd1cc2faa3
Summary:
Users may run into an issue when running ldb on db that's in a different version and they have different set of options: `Failed: Invalid argument: Could not find option: <MISSING_OPTION>`
They can work around this by setting `--ignore_unknown_options`, but the error message is not clear for users to find why the option is missing. It's also hard for the users to find the `ignore_unknown_options` option especially if they are not familiar with the codebase or `ldb` tool.
This PR changes the error message to help users to find out what's wrong and possible workaround for the issue
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11907
Test Plan:
Testing by reproducing the issue locally
```
❯./ldb --db=/data/users/jewoongh/db_crash_whitebox_T164195541/ get a
Failed: Invalid argument: Could not find option: : unknown_option_test
This tool was built with version 8.8.0. If your db is in a different version, please try again with option --ignore_unknown_options.
```
Reviewed By: jowlyzhang
Differential Revision: D49762291
Pulled By: jaykorean
fbshipit-source-id: 895570150fde886d5ec524908c4b2664c9230ac9
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11906
The patch adds stress test coverage for the wide-column aware `FullMergeV3` API by implementing a new `DBStressWideMergeOperator`. This operator is similar to `PutOperator` / `PutOperatorV2` in the sense that its result is based on the last merge operand; however, the merge result can be either a plain value or a wide-column entity, depending on the value base encoded into the operand and the value of the `use_put_entity_one_in` stress test parameter. Following the same rule for merge results that we do for writes ensures that the queries issued by the validation logic receive the expected results. The new operator is used instead of `PutOperatorV2` whenever `use_put_entity_one_in` is positive. Note that the patch also makes it possible to set `use_put_entity_one_in` and `use_merge` (but not `use_full_merge_v1`) at the same time, giving `use_put_entity_one_in` precedence, so the stress test will use `PutEntity` for writes passing the `use_put_entity_one_in` check described above and `Merge` for any other writes.
Reviewed By: jaykorean
Differential Revision: D49760024
fbshipit-source-id: 3893602c3e7935381b484f4f5026f1983e3a04a9
Summary:
Crash tests are failing with recent change of auto_readahead_size. Disable it in stress tests and enable it with fix to clear the crash tests failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11883
Reviewed By: pdillinger
Differential Revision: D49597854
Pulled By: akankshamahajan15
fbshipit-source-id: 0af8ca7414ee9b92f244ee0fb811579c3c052b41
Summary:
This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache.
The basic design is as follows -
1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache.
2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block.
3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache.
4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded.
We can easily implement additional admission policies if desired.
Todo (In a subsequent PR):
1. Add to db_bench and run benchmarks
2. Add to db_stress
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812
Reviewed By: pdillinger
Differential Revision: D49461842
Pulled By: anand1976
fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
Summary:
This PR contains two fixes:
1. disable write fault injection since it caused several other kinds of internal stress test failures. I'll try to fix those separately before enabling it again.
2. Fix segfault like
```
https://github.com/facebook/rocksdb/issues/5 0x000000000083dc43 in rocksdb::port::Mutex::Lock (this=0x30) at internal_repo_rocksdb/repo/port/port_posix.cc:80
80 internal_repo_rocksdb/repo/port/port_posix.cc: No such file or directory.
https://github.com/facebook/rocksdb/issues/6 0x0000000000465142 in rocksdb::MutexLock::MutexLock (mu=0x30, this=<optimized out>) at internal_repo_rocksdb/repo/util/mutexlock.h:37
37 internal_repo_rocksdb/repo/util/mutexlock.h: No such file or directory.
https://github.com/facebook/rocksdb/issues/7 rocksdb::FaultInjectionTestFS::DisableWriteErrorInjection (this=0x0) at internal_repo_rocksdb/repo/utilities/fault_injection_fs.h:505
505 internal_repo_rocksdb/repo/utilities/fault_injection_fs.h: No such file or directory.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11859
Test Plan: db_stress with no fault injection: `./db_stress --write_fault_one_in=0 --read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --sync_fault_injection=0`
Reviewed By: jaykorean
Differential Revision: D49408247
Pulled By: cbi42
fbshipit-source-id: 0ca01f20e6e81bf52af77818b50d562ef7462165
Summary:
* db_crashtest.py now may set `write_fault_one_in` to 500 for blackbox and whitebox simple test.
* Error injection only applies to writing to SST files. Flush error will cause DB to pause background operations and auto-resume. Compaction error will just re-schedule later.
* File ingestion and back up tests are updated to check if the result status is due to an injected error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11829
Test Plan:
a full round of whitebox simple and blackbox simple crash test
* `python3 ./tools/db_crashtest.py whitebox/blackbox --simple --write_fault_one_in=500`
Reviewed By: ajkr
Differential Revision: D49256962
Pulled By: cbi42
fbshipit-source-id: 68e0c9648d8e03bad39c7672b25d5500fc286d97
Summary:
An internal user wants to be able to dynamically switch between Bloom and Ribbon filters, without a custom FilterPolicy. Making `filter_policy` mutable would actually make issue https://github.com/facebook/rocksdb/issues/10079 worse, because it would be a race on a pointer field, not just on scalars.
As a reasonable compromise until that is fixed, I am enabling dynamic control over Bloom vs. Ribbon choice by making
RibbonFilterPolicy::bloom_before_level mutable, and doing that safely by using an atomic.
I've also slightly tweaked the interpretation of that field so that setting it to INT_MAX really means "always Bloom."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11838
Test Plan: unit tests added/extended. crash test updated for SetOptions call and tested under TSAN with amplified probability (lower set_options_one_in).
Reviewed By: ajkr
Differential Revision: D49296284
Pulled By: pdillinger
fbshipit-source-id: e4251c077510df9a9c719876f482448c0d15402a
Summary:
**Context/Summary:**
It allows db bench reflect the default behavior of this option. For example, we recently changed its default value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11831
Test Plan: No code change
Reviewed By: cbi42
Differential Revision: D49253690
Pulled By: hx235
fbshipit-source-id: 445d4e54f62b4b538626e301a3014d2f00849d30
Summary:
- Fixed misspellings of "inject"
- Made user read errors retryable when `FLAGS_inject_error_severity == 1`
- Added compaction read errors when `FLAGS_read_fault_one_in > 0`. These are always retryable so that the DB will keep accepting writes
- Reenabled setting `compaction_readahead_size` in crash test. The reason for disabling it was to "keep the test clean", which is not a good enough reason to skip testing it
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11789
Test Plan:
With https://github.com/facebook/rocksdb/issues/11782 reverted, reproduced the bug:
- Build: `make -j56 db_stress`
- Command: `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --interval=10 --max_key=1000000`
- Output:
```
stderr has error message:
***put or merge error: Corruption: Compaction number of input keys does not match number of keys processed.***
```
Reviewed By: cbi42
Differential Revision: D48939994
Pulled By: ajkr
fbshipit-source-id: a1efb799efecdfd5d9cfd185e4a6321db8fccfbb
Summary:
This change add an experimental next-generation HyperClockCache (HCC) with automatic sizing of the underlying hash table. Both the existing version (stable) and the new version (experimental for now) of HCC are available depending on whether an estimated average entry charge is provided in HyperClockCacheOptions.
Internally, we call the two implementations AutoHyperClockCache (new) and FixedHyperClockCache (existing). The performance characteristics and much of the underlying logic are similar enough that AutoHCC is likely to make FixedHCC obsolete, and so it's best considered an evolution of the same technology or solution rather than an alternative. More specifically, both implementations share essentially the same logic for managing the state of individual entries in the cache, including metadata for reference counting and counting clocks for eviction. This metadata, which I like to call the "low-level HCC protocol," includes a read-write lock on entries, but relaxed consistency requirements on the cache (e.g. allowing rare duplication) means high-level cache operations never wait for these low-level per-entry locks. FixedHCC is fully wait-free.
AutoHCC is different in how entries are indexed into an efficient hash table. AutoHCC is "essentially wait-free" as there is no pattern of typical high-level operations on a large cache that can lead to one thread waiting on another to complete some work, though it can happen in some unusual/unlucky cases, or atypical uses such as erasing specific cache keys. Table growth and entry reclamation is more complex in AutoHCC compared to FixedHCC, so uses some localized locking to manage that. AutoHCC uses linear hashing to grow the table as needed, with low latency and to a precise size. AutoHCC depends on anonymous mmap support from the OS (currently verified working on Linux, MacOS, and Windows) to allow the array underlying a hash table to grow in place without wasting resident memory on space reserved but unused. AutoHCC uses a form of chaining while FixedHCC uses open addressing and double hashing.
More specifics:
* In developing this PR, a rare availability bug (minor) was noticed in the existing HCC implementation of Release()+erase_if_last_ref, which is now inherited into AutoHCC. Fixing this without a performance regression will not be simple, so is left for follow-up work.
* Some existing unit tests required adjustment of operational parameters or conditions to work with the new behaviors of AutoHCC. A number of bugs were found and fixed in the validation process, including getting unit tests in good working order.
* Added an option to cache_bench, `-degenerate_hash_bits` for correctness stress testing described below. For this, the tool uses the reverse-engineered hash function for HCC to generate keys in which the specified number of hash bits, in critical positions, have a fixed value. Essentially each degenerate hash bit will half the number of chain heads utilized and double the average chain length.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11738
Test Plan:
unit tests updated, and already added to db crash test. Also
## Correctness
The code includes generous assertions to check for unexpected states, especially at destruction time, so should be able to detect critical concurrency bugs. Less serious "availability bugs" in which cache data is hidden or cleanly lost are more difficult to detect, but also less scary for data correctness (as long as performance is good and the design is sound).
In average operation, the structure is extremely low stress and low contention (see next section) so stressing the corner case logic requires artificially stressing the operating conditions. First, we keep the structure small to increase the number of threads hitting the same chain or entry, and just one cache shard. Second, we artificially degrade the hashing so that chains are much longer than typical, using the new `-degenerate_hash_bits` option to cache_bench. Third, we re-create the structure from scratch frequently in order to exercise the Grow logic repeatedly and to get the benefit of the consistency checks in the structure's destructor in debug builds. For cache_bench this also means disabling the single-threaded "populate cache" step (normally used for steady state performance testing). And of course use many more threads than cores to have many preemptions.
An effective test for working out bugs was this (using debug build of course):
```
while ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=8000000 -threads=100 -populate_cache=0 -ops_per_thread=10000 -degenerate_hash_bits=6 -num_shard_bits=0; do :; done
```
Or even smaller cases. This setup has around 27 utilized chains, with around 35 entries each, and yield-waits more than 1 million times per second (very high contention; see next section). I have let this run for hours searching for any lingering issues.
I've also run cache_bench under ASAN, UBSAN, and TSAN.
## Essentially wait free
There is a counter for number of yield() calls when one thread is waiting on another. When we pre-populate the structure in a single thread,
```
./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -populate_cache=1 -ops_per_thread=200000 2>&1 | grep Yield
```
We see something on the order of 1 yield call per second across 16 threads, even when we load the system other other jobs (parallel compilation). With -populate_cache=0, there are more yield opportunities with parallel table growth. On an otherwise unloaded system, we still see very small (single digit) yield counts, with a chance of getting into the thousands, and getting into 10s of thousands per second during table growth phase if the system is loaded with other jobs. However, I am not worried about this if performance is still good (see next section).
## Overall performance
Although cache_bench initially suggested performance very close to FixedHCC, there was a very noticeable performance hit under a db_bench setup like used in validating https://github.com/facebook/rocksdb/issues/10626. Much of the difference has been reduced by optimizing Lookup with a "naive" pass that will almost always find entries quickly, and only falling back to the careful Lookup algorithm when not found in the first pass.
Setups (chosen to be sensitive to block cache performance), and compiled with USE_CLANG=1 JEMALLOC=1 PORTABLE=0 DEBUG_LEVEL=0:
```
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
### No regression on FixedHCC
Running before & after builds at the same time on a 48 core machine.
```
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=readrandom[-X10],block_cache_entry_stats,cache_report_problems -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=24 -cache_type=fixed_hyper_clock_cache -seed=1234
```
Before:
readrandom [AVG 10 runs] : 847234 (± 8150) ops/sec; 59.2 (± 0.6) MB/sec
703MB max RSS
After:
readrandom [AVG 10 runs] : 851021 (± 7929) ops/sec; 59.5 (± 0.6) MB/sec
706MB max RSS
Probably no material difference.
### Single-threaded performance
Using `[-X2]` and `-threads=1` and `-duration=30`, running all three at the same time:
lru_cache: 55100 ops/sec, then 55862 ops/sec (627MB max RSS)
fixed_hyper_clock_cache: 60496 ops/sec, then 61231 ops/sec (626MB max RSS)
auto_hyper_clock_cache: 47560 ops/sec, then 56081 ops/sec (626MB max RSS)
So AutoHCC has more ramp-up cost in the first pass as the cache grows to the appropriate size. (In single-threaded operation, the parallelizability and per-op low latency of table growth is overall slower.) However, once up to size, its performance is comparable to LRUCache. FixedHCC's lean operations still win overall when a good estimate is available.
If we look at HCC table stats, we can see that this configuration is not favorable to AutoHCC (and I have verified that other memory sizes do not yield substantially different results, until shards are under-sized for the full filters):
FixedHCC:
Slot occupancy stats: Overall 47% (124991/262144), Min/Max/Window = 28%/64%/500, MaxRun{Pos/Neg} = 17/22
AutoHCC:
Slot occupancy stats: Overall 59% (125781/209682), Min/Max/Window = 43%/82%/500, MaxRun{Pos/Neg} = 76/16
Head occupancy stats: Overall 43% (92259/209682), Min/Max/Window = 24%/74%/500, MaxRun{Pos/Neg} = 19/26
Entries at home count: 53350
FixedHCC configuration is relatively good for speed, and not ideal for space utilization. As is typical, AutoHCC has tighter control on metadata usage (209682 x 64 bytes rather than 262144 x 64 bytes), and the higher load factor is slightly worse for speed. LRUCache also has more metadata usage, at 199680 x 96 bytes of tracked metadata (plus roughly another 10% of that untracked in the head pointers), and that metadata is subject to fragmentation.
### Parallel performance, high hit rate
Now using `[-X10]` and `-threads=10`, all three at the same time
lru_cache: [AVG 10 runs] : 263629 (± 1425) ops/sec; 18.4 (± 0.1) MB/sec
655MB max RSS, 97.1% cache hit rate
fixed_hyper_clock_cache: [AVG 10 runs] : 479590 (± 8114) ops/sec; 33.5 (± 0.6) MB/sec
651MB max RSS, 97.1% cache hit rate
auto_hyper_clock_cache: [AVG 10 runs] : 418687 (± 5915) ops/sec; 29.3 (± 0.4) MB/sec
657MB max RSS, 97.1% cache hit rate
Even with just 10-way parallelism for each cache (though 30+/48 cores busy overall), LRUCache is already showing performance degradation, while AutoHCC is in the neighborhood of FixedHCC. And that brings us to the question of how AutoHCC holds up under extreme parallelism, so now independent runs with `-threads=100` (overloading 48 cores).
lru_cache: 438613 ops/sec, 827MB max RSS
fixed_hyper_clock_cache: 1651310 ops/sec, 812MB max RSS
auto_hyper_clock_cache: 1505875 ops/sec, 821MB max RSS (Yield count: 1089 over 30s)
Clearly, AutoHCC holds up extremely well under extreme parallelism, even closing some of the modest performance gap with FixedHCC.
### Parallel performance, low hit rate
To get down to roughly 50% cache hit rate, we use `-cache_index_and_filter_blocks=0 -cache_size=1650000000` with `-threads=10`. Here the extra cost of running counting clock eviction, especially on the chains of AutoHCC, are evident, especially with the lower contention of cache_index_and_filter_blocks=0:
lru_cache: 725231 ops/sec, 1770MB max RSS, 51.3% hit rate
fixed_hyper_clock_cache: 638620 ops/sec, 1765MB max RSS, 50.2% hit rate
auto_hyper_clock_cache: 541018 ops/sec, 1777MB max RSS, 50.8% hit rate
Reviewed By: jowlyzhang
Differential Revision: D48784755
Pulled By: pdillinger
fbshipit-source-id: e79813dc087474ac427637dd282a14fa3011a6e4
Summary:
wide_columns can now be pretty-printed in the following commands
- `./ldb dump_wal`
- `./ldb dump`
- `./ldb idump`
- `./ldb dump_live_files`
- `./ldb scan`
- `./sst_dump --command=scan`
There are opportunities to refactor to reduce some nearly identical code. This PR is initial change to add wide column support in `ldb` and `sst_dump` tool. More PRs to come for the refactor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11754
Test Plan:
**New Tests added**
- `WideColumnsHelperTest::DumpWideColumns`
- `WideColumnsHelperTest::DumpSliceAsWideColumns`
**Changes added to existing tests**
- `ExternalSSTFileTest::BasicMixed` added to cover mixed case (This test should have been added in https://github.com/facebook/rocksdb/issues/11688). This test does not verify the ldb or sst_dump output. This test was used to create test SST files having some rows with wide columns and some without and the generated SST files were used to manually test sst_dump_tool.
- `createSST()` in `sst_dump_test` now takes `wide_column_one_in` to add wide column value in SST
**dump_wal**
```
./ldb dump_wal --walfile=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/000004.log --print_value --header
```
```
Sequence,Count,ByteSize,Physical Offset,Key(s) : value
1,1,59,0,PUT_ENTITY(0) : 0x:0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
2,1,34,42,PUT_ENTITY(0) : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
3,1,17,7d,PUT(0) : 0x7468697264 : 0x62617A
```
**idump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ idump
```
```
'first' seq:1, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:2, type:22 => attr_one:two attr_three:four
'third' seq:3, type:1 => baz
Internal keys in range: 3
```
**SST Dump from dump_live_files**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ compact
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump_live_files
```
```
...
==============================
SST Files
==============================
/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst level:1
------------------------------
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
...
```
**dump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump
```
```
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz
Keys in range: 3
```
**scan**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ scan
```
```
first : :hello attr_name1:foo attr_name2:bar
second : attr_one:two attr_three:four
third : baz
```
**sst_dump**
```
./sst_dump --file=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst --command=scan
```
```
options.env is 0x7ff54b296000
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
from [] to []
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
```
Reviewed By: ltamasi
Differential Revision: D48837999
Pulled By: jaykorean
fbshipit-source-id: b0280f0589d2b9716bb9b50530ffcabb397d140f
Summary: Python3 makes the use of `(object)` in class inheritance unnecessary. Let's modernize our code by eliminating this.
Reviewed By: itamaro
Differential Revision: D48673915
fbshipit-source-id: a1a6ae8572271eb2898b748c8216ea68e362f06a
Summary:
Same as title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11729
Test Plan: make crash_test -j32
Reviewed By: anand1976
Differential Revision: D48534820
Pulled By: akankshamahajan15
fbshipit-source-id: 3a2a28af98dfad164b82ddaaf9fddb94c53a652e
Summary:
In blackbox tests, db_stress command always run with timeout. Timeout can happen during validation, leaving some of the keys not checked. Since key validation is done in order, it is quite likely that keys those are towards to the end of the set are never validated. This PR adds a final execution, without timeout, to ensure validation is executed for all keys, at least once.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11592
Reviewed By: cbi42
Differential Revision: D48003998
Pulled By: hx235
fbshipit-source-id: 72543475a932f12cf0f57534b7e3b6e07e87080f
Summary:
For some ldb commands that doesn't need to open the DB, it's still useful to use the DB's existing OPTIONS file if it's available.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11721
Reviewed By: pdillinger
Differential Revision: D48485540
Pulled By: jowlyzhang
fbshipit-source-id: 2d2db837523044066f1a2c4b59a5c03f6cd35e6b
Summary:
For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions.
* The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701
Test Plan:
* Add two unit tests to test when bottommost_file_compaction_delay > 0.
* Ran crash test with the new option.
Reviewed By: jaykorean, ajkr
Differential Revision: D48331564
Pulled By: cbi42
fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8
Summary:
* The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation.
* Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache).
* Simplify some existing tests not to depend directly on cache type.
* Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation.
* Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test.
* Add another templated class BaseHyperClockCache to reduce future copy-paste
* Added ReportProblems support to cache_bench
* Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC:
```
2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17
```
In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692
Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in.
Reviewed By: jowlyzhang
Differential Revision: D48247413
Pulled By: pdillinger
fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da
Summary:
**Context/Summary**
After https://github.com/facebook/rocksdb/pull/11058, we no longer lock the key range to iterate in TestIterateAgainstExpected, except for working with timestamp feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11695
Test Plan: no code change
Reviewed By: ajkr
Differential Revision: D48276668
Pulled By: hx235
fbshipit-source-id: dc92a3708b2281dc737c0877fb755548bf03a9fc
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
- For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
- New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.
- Misc
- More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
- Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444
Test Plan:
- CI fake db crash/stress test
- Microbenchmarking
**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR
**Test**
Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```
Result
```
Coming soon
```
AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```
Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,
PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```
Reviewed By: ajkr
Differential Revision: D45918925
Pulled By: hx235
fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
Summary:
I'm anticipating using the public name HyperClockCache for both the current version with a fixed-size table and the upcoming version with an automatically growing table. However, for simplicity of testing them as substantially distinct implementations, I want to give them distinct internal names, like FixedHyperClockCache and AutoHyperClockCache.
This change anticipates that by renaming to FixedHyperClockCache and assuming for now that all the unit tests run on HCC will run and behave similarly for the automatic HCC. Obviously updates will need to be made, but I'm trying to avoid uninteresting find & replace updates in what will be a large and engineering-heavy PR for AutoHCC
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11676
Test Plan: no behavior change intended, except logging will now use the name FixedHyperClockCache
Reviewed By: ajkr
Differential Revision: D48103165
Pulled By: pdillinger
fbshipit-source-id: a33f1901488fea102164c2318e2f2b156aaba736
Summary:
Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358
Test Plan:
* New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions`
* Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions.
Reviewed By: ajkr
Differential Revision: D46582680
Pulled By: cbi42
fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da
Summary:
Updated the main branch for the 8.5.fb branch cut. Also made unreleased_history/release.sh backdate to the last commit instead of the current date in case the release manager is a laggard like myself.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11642
Reviewed By: cbi42
Differential Revision: D47783574
Pulled By: ajkr
fbshipit-source-id: 4e2a80f5ccd542dc7dd0d22dfd7e59cb136325a1
Summary:
## Context checksum
All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:
* Data larger than one SST block is replaced by data from elsewhere in
the same or another SST file. Especially with block_align=true, the
probability of exact block size match is probably around 1 in 100, making
the FN probability around that same. Without `block_align=true` the
probability of same block start location is probably around 1 in 10,000,
for FN probability around 1 in a million.
To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example
* blocks within about 4GB of each other are guaranteed to use different context
* blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
different context
* files generated by the same process are guaranteed to use different context
for the same offsets, until wrap-around after 2^32 - 1 files
Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.
## Footer checksum
This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.
## More detail: making space in footer
In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:
* Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
it immediately precedes the footer, and by assuming it is < 4GB.
* Index block handle is moved into metaindex. (I don't know why it was
in footer to begin with.)
## Performance
In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep
with the only field used in that structure after construction: `prefix_extractor`.
This makes the PR an overall performance improvement (results below).
Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).
```
(for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out
```
Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness
Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058
Test Plan:
unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.
Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably *usually* true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.
The format compatibility test does not need updating because we have not updated the default
format_version.
Reviewed By: ajkr, mrambacher
Differential Revision: D33100915
Pulled By: pdillinger
fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
Summary:
Add path existence check in the script to avoid script running even when db_bench executable does not exist or relative path is not right.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11621
Reviewed By: jowlyzhang
Differential Revision: D47552590
Pulled By: ajkr
fbshipit-source-id: f09ea069f69e067212b249a22ad755b76bc6063a
Summary:
`sst_dump --command=verify` did not set read_options.verify_checksum to true so it was not verifying checksum.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11576
Test Plan:
ran the same command on an SST file with bad checksum:
```
sst_dump --command=verify --file=...sst_file_with_bad_block_checksum
Before this PR:
options.env is 0x6ba048
Process ...sst_file_with_bad_block_checksum
Sst file format: block-based
The file is ok
After this PR:
options.env is 0x7f43f6690000
Process ...sst_file_with_bad_block_checksum
Sst file format: block-based
... is corrupted: Corruption: block checksum mismatch: stored = 2170109798, computed = 2170097510, type = 4 ...
```
Reviewed By: ajkr
Differential Revision: D47136284
Pulled By: cbi42
fbshipit-source-id: 07d68db715c00347145e5b83d649aef2c3f2acd9
Summary:
Add `skip_tmpdir_check` argument in crash script. If `tmp_dir` is on remote storage and exist, `isdir` will be false (checking on local storage) leading to exit. By passing `skip_tmpdir_check` with `crashtest.py`, the dir check can be skipped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11539
Test Plan: Ran locally
Reviewed By: anand1976
Differential Revision: D46740456
Pulled By: akankshamahajan15
fbshipit-source-id: 8726882ef53d2c84b604c7515e84eda6d1bf797c
Summary:
Context:
OptimisticTransactionDB has not been covered by db_stress (including crash test) like TransactionDB.
1. Adding the following gflag options to to test OptimisticTransactionDB
- `use_optimistic_txn`: When true, open OptimisticTransactionDB to test
- `occ_validation_policy`: `OccValidationPolicy::kValidateParallel = 1` by default.
- `share_occ_lock_buckets`: Use shared occ locks
- `occ_lock_bucket_count`: 500 by default. Number of buckets to use for shared occ lock.
2. Opening OptimisticTransactionDB and NewTxn/Commit added per `use_optimistic_txn` flag in `db_stress_test_base.cc`
3. OptimisticTransactionDB blackbox/whitebox test added in crash_test.mk
Please note that the existing flag `use_txn` is being used here. When `use_txn == true` and `use_optimistic_txn == false`, we use `TransactionDB` (a.k.a. pessimistic transaction db). When both `use_txn` and `use_optimistic_txn` are true, we use `OptimisticTransactionDB`. If `use_txn == false` but `use_optimistic_txn == true` throw error with message _"You cannot set use_optimistic_txn true while use_txn is false. Please set use_txn true if you want to use OptimisticTransactionDB"_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11513
Test Plan:
**Crash Test**
Serial Validation
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=0"
make crash_test -j
```
Parallel Validation (no share bucket)
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=0"
make crash_test -j
```
Parallel Validation (share bucket)
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=1 --occ_lock_bucket_count=500"
make crash_test -j
```
**Stress Test**
```
./db_stress -use_optimistic_txn -threads=32
```
Reviewed By: pdillinger
Differential Revision: D46547387
Pulled By: jaykorean
fbshipit-source-id: ca19819ca6e0281694966998014b40d95d4e5960
Summary:
after https://github.com/facebook/rocksdb/issues/11321 and https://github.com/facebook/rocksdb/issues/11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size.
The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit f742be330c and b1928e42b3 which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape.
Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in https://github.com/facebook/rocksdb/issues/10057.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11525
Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change.
Reviewed By: ajkr
Differential Revision: D46654256
Pulled By: cbi42
fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1
Summary:
This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows.
When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches.
The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows -
|--------- Primary Cache Configured Capacity -----------|
|---Secondary Cache Budget----|----Primary Cache Budget-----|
A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired.
|--Reservation for Sec Cache--|-Pri Cache Usable Capacity---|
When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation.
For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion -
|-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-|
Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount.
Other alternatives -
1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead.
2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them.
The PR contains the following changes -
1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```.
3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```.
4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```.
5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted.
Benchmarks -
Baseline
```
time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true
```
```
readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s
readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found)
real 6m31.052s
user 152m5.660s
sys 26m18.738s
```
With TieredVolatileCache
```
time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true
```
```
readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s
readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found)
real 6m42.743s
user 157m58.972s
sys 33m16.671
```
```
readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s
readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found)
real 6m31.469s
user 155m16.570s
sys 27m47.834s
```
ToDo -
1. Add to db_stress
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449
Reviewed By: pdillinger
Differential Revision: D46197388
Pulled By: anand1976
fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
Summary:
- add TEST_TMPDIR_EXPECTED env in crash test if expected dir is on different filesystem. If TEST_TMPDIR_EXPECTED is not specified, it'll fallback to default value of TEST_TMPDIR (Same as before)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11448
Test Plan: Ran locally
Reviewed By: anand1976
Differential Revision: D45870268
Pulled By: akankshamahajan15
fbshipit-source-id: 52a7b961d3647dde023dcf7f20341558e8a5b528
Summary:
In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.
No public API changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408
Test Plan: existing tests
Reviewed By: ltamasi
Differential Revision: D45456696
Pulled By: pdillinger
fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373
Summary:
See motivation and description in new ShardedCacheOptions::hash_seed option.
Updated db_bench so that its seed param is used for the cache hash seed.
Made its code more safe to ensure seed is set before use.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11391
Test Plan:
unit tests added / updated
**Performance** - no discernible difference seen running cache_bench repeatedly before & after. With lru_cache and hyper_clock_cache.
Reviewed By: hx235
Differential Revision: D45557797
Pulled By: pdillinger
fbshipit-source-id: 40bf4da6d66f9d41a8a0eb8e5cf4246a4aa07934
Summary:
when I use g++-13 to exec the `make all` command, the output throws the warnings.
```
db/compaction/compaction_job_test.cc: In member function ‘void rocksdb::CompactionJobTestBase::AddMockFile(const rocksdb::mock::KVVector&, int)’:
db/compaction/compaction_job_test.cc:376:57: error: redundant move in initialization [-Werror=redundant-move]
376 | env_, GenerateFileName(file_number), std::move(contents)));
| ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
375 | EXPECT_OK(mock_table_factory_->CreateMockTable(
| ^~~~~~~~~
db/compaction/compaction_job_test.cc:376:57: note: remove ‘std::move’ call
376 | env_, GenerateFileName(file_number), std::move(contents)));
| ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
375 | EXPECT_OK(mock_table_factory_->CreateMockTable(
| ^~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [Makefile:2507: db/compaction/compaction_job_test.o] Error 1
```
and I also add some `(void)unused_variable` statements because of the cmake argument `-Wunused-but-set-variable -Wunused-but-set-variable`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11418
Reviewed By: akankshamahajan15
Differential Revision: D45528223
Pulled By: ajkr
fbshipit-source-id: fee1a77c30039a56b481de953f0a834cc788abbc
Summary:
add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
1. checksum construction and verification in block.cc/h
2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
3. unit tests/crash test updates
Tests:
* Added unit tests
* Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`
Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.
Performance:
Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):
```
SETUP
make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none
BENCHMARK
./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE
The readrandom ops/sec looks like the following:
Block cache size: 2GB 1.2GB * 0.9 1.2GB * 0.8 1.2GB * 0.5 8MB
Main 240805 223604 198176 161653 139040
PR prot_bytes=0 238691 226693 200127 161082 141153
PR prot_bytes=1 214983 193199 178532 137013 108211
prot_bytes=1 vs -10% -15% -10.8% -15% -23%
prot_bytes=0
```
The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287
Reviewed By: ajkr
Differential Revision: D43970708
Pulled By: cbi42
fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.
**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
- Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288
Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob)
- May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689
// PlainTable
Does not apply
```
- **Db bench 2: performance**
**Read**
SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`
**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none
rocksdb.sst.read.micros COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec`
Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec`
Reviewed By: ajkr
Differential Revision: D44007011
Pulled By: hx235
fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
Summary:
... which increases default number of shards from 16 to 64. Although the default block cache size is only recommended for applications where RocksDB is not performance-critical, under stress conditions, block cache mutex contention could become a performance bottleneck. This change of default should alleviate that.
Note that reducing the size of cache shards (recommended minimum 512MB) could cause thrashing, e.g. on filter blocks, so capacity needs to increase to safely increase number of shards.
The 8MB default dates back to 2011 or earlier (f779e7a5), when the most simultaneous threads you could get from a single CPU socket was 20 (e.g. Intel Xeon E7-8870). Now more than 100 is available.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11350
Test Plan: unit tests updated
Reviewed By: cbi42
Differential Revision: D44674873
Pulled By: pdillinger
fbshipit-source-id: 91ed3070789b42679283c7e6dc97c41a6a97bdf4
Summary:
Similarly to `GetEntity` prior to https://github.com/facebook/rocksdb/issues/11303, the `MultiGetEntity` API is currently
only used in the DB verification logic of the stress tests. The patch introduces
a new mode where all point lookups are performed using `MultiGetEntity`,
and implements the corresponding logic in the non-batched, batched, and
CF consistency tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11336
Test Plan: Ran simple blackbox tests for the various stress test flavors.
Reviewed By: akankshamahajan15
Differential Revision: D44513285
Pulled By: ltamasi
fbshipit-source-id: c3db098501bf875b6a356b09fc676a0268d92c35
Summary:
…evel_bytes
During DB open, if a column family uses level compaction with level_compaction_dynamic_level_bytes=true, trivially move its files down in the LSM such that the bottommost files are in Lmax, the second from bottommost level files are in Lmax-1 and so on. This is aimed to make it easier to migrate level_compaction_dynamic_level_bytes from false to true. Before this change, a full manual compaction is suggested for such migration. After this change, user can just restart DB to turn on this option. db_crashtest.py is updated to randomly choose value for level_compaction_dynamic_level_bytes.
Note that there may still be too many unnecessary levels if a user is migrating from universal compaction or level compaction with a smaller level multiplier. A full manual compaction may still be needed in that case before some PR that automatically drain unnecessary levels like https://github.com/facebook/rocksdb/issues/3921 lands. Eventually we may want to change the default value of option level_compaction_dynamic_level_bytes to true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11321
Test Plan:
1. Added unit tests.
2. Crash test: ran a variation of db_crashtest.py (like 32516507e77521ae887e45091b69139e32e8efb7) that turns level_compaction_dynamic_level_bytes on and off and switches between LC and UC for the same DB.
TODO: Update `OptionChangeMigration`, either after this PR or https://github.com/facebook/rocksdb/issues/3921.
Reviewed By: ajkr
Differential Revision: D44341930
Pulled By: cbi42
fbshipit-source-id: 013de19a915c6a0502be569f07c4cc8f1c3c6be2
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.
## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.
## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
## Crash/stress test
Updated to use the new combination.
## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```
**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127
**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)
This is an acceptable difference IMHO.
Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec
So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
**Before**:
multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec
Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
Let's also look at Get() in db_bench:
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec
**After**:
readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec
That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
**Before**:
readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec
**After**:
readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec
Reviewed By: ltamasi
Differential Revision: D44177044
Pulled By: pdillinger
fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
Summary:
The `GetEntity` API is currently used in the stress tests for verification purposes;
this patch extends the coverage by adding a mode where all point lookups in
the non-batched, batched, and CF consistency stress tests are done using this API.
The PR also includes a bit of refactoring to eliminate some boilerplate code around
the wide-column consistency checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11303
Test Plan: Ran stress tests of the batched, non-batched, and CF consistency varieties.
Reviewed By: akankshamahajan15
Differential Revision: D44148503
Pulled By: ltamasi
fbshipit-source-id: fecdbfd3e65a459bbf16ab7aa7b9173e19240077
Summary:
Fix hang in async_io benchmarks in regression script. I changed the order of benchmarks and that somehow fixed the issue of hang.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11285
Test Plan: Ran it manually
Reviewed By: pdillinger
Differential Revision: D43937431
Pulled By: akankshamahajan15
fbshipit-source-id: 7c43075d3be6b8f41d08e845664012768b769661
Summary:
Internally, the benchmark is going on hang state whereas when run on same host manually, it passes. Decrease the duration to 5s to figure out how much time it is taking to complete the benchmark.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11283
Test Plan: Ran manually internally
Reviewed By: hx235
Differential Revision: D43882260
Pulled By: akankshamahajan15
fbshipit-source-id: 9ea44164773d4df4fc05cd817b7e011426c4d428
Summary:
Provide support in benchmark regression to use different options to be used in async_io benchamark only - "$`MAX_READAHEAD_SIZE`", $`INITIAL_READAHEAD_SIZE`", "$`NUM_READS_FOR_READAHEAD_SIZE`".
If user wants to run set these parameters for all benchmarks then these parameters need to be set in OPTION file instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11262
Test Plan: Ran manually
Reviewed By: anand1976
Differential Revision: D43725567
Pulled By: akankshamahajan15
fbshipit-source-id: 28c3462dd785ffd646d44560fa9c92bc6a8066e5
Summary:
Fix regression script for async_io benchmark using incorrect ops and threads and wrong benchmark name during reporting results.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11224
Test Plan: Ran manually
Reviewed By: anand1976
Differential Revision: D43287658
Pulled By: akankshamahajan15
fbshipit-source-id: 433e2caa0e51268e72a875549ab8f7f92a7a4216
Summary:
Same as title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11215
Test Plan: Ran manually
Reviewed By: pdillinger
Differential Revision: D43194634
Pulled By: akankshamahajan15
fbshipit-source-id: 336a08a9076b222d7000e4eb2a87fc36b863b05b
Summary:
Need to scp the .so files. Switched to tar+ssh to support symlinks, faster handling of multiple files, and compression.
Also fixing some holes in 'make clean' as I've noticed files like 'librocksdb.so.7.7.0', 'librocksdb_test_debug.so', 'librocksdb_tools_debug.so' hanging around after `make clean`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11194
Test Plan:
Manually triggered regression test runs with change, manual `make clean`
https://fburl.com/sandcastle/gnxy5lvchttps://fburl.com/sandcastle/4pxodwh7
Reviewed By: cbi42
Differential Revision: D43069065
Pulled By: pdillinger
fbshipit-source-id: 48552b5980956784a1fdb40638d9e8ad6db51900
Summary:
Enable the set of crash test for when user defined timestamp is enabled in combination with BlobDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11163
Test Plan: `make check` and `db_stress`/`db_crashtest.py` with various combinations.
Reviewed By: ltamasi
Differential Revision: D42906457
Pulled By: jowlyzhang
fbshipit-source-id: 6bec6449a4213b536c787420ff30a7d17b676deb
Summary:
First, we made a small reduction in DURATION_RW as runs were exceeding 1 hour and colliding with subsequent runs.
See Mark Callaghan’s blog post at http://smalldatum.blogspot.com/2023/01/variance-in-rocksdb-benchmarks-on-cloud.html
Configuration parameters which are not consistent with the following email from Mark (see the blog post for more context) have been updated. Where Mark has defined the parameter and we haven't, we define it explicitly. We will need to further monitor for an expected reduction in variance of test times:
To match what I did:
---
nsecs=1800
dbdir=/data/m/rx
resultdir=bm.lc.nt1.cm1.d0
env WRITE_BUFFER_SIZE_MB=16 TARGET_FILE_SIZE_BASE_MB=16 MAX_BYTES_FOR_LEVEL_BASE_MB=64 MAX_BACKGROUND_JOBS=4 NUM_KEYS=20000000 CACHE_SIZE_MB=10240 DURATION_RW=$nsecs DURATION_RO=$nsecs MB_WRITE_PER_SEC=2 NUM_THREADS=1 COMPRESSION_TYPE=none CACHE_INDEX_AND_FILTER_BLOCKS=1 VALUE_SIZE=400 NUMA=1 MIN_LEVEL_TO_COMPRESS=3 COMPACTION_STYLE=leveled bash benchmark_compare.sh $dbdir $resultdir 7.8.fb
env WRITE_BUFFER_SIZE_MB=16 TARGET_FILE_SIZE_BASE_MB=16 MAX_BYTES_FOR_LEVEL_BASE_MB=64 MAX_BACKGROUND_JOBS=4 NUM_KEYS=200000000 CACHE_SIZE_MB=10240 DURATION_RW=$nsecs DURATION_RO=$nsecs MB_WRITE_PER_SEC=2 NUM_THREADS=1 COMPRESSION_TYPE=lz4 CACHE_INDEX_AND_FILTER_BLOCKS=1 VALUE_SIZE=400 NUMA=1 MIN_LEVEL_TO_COMPRESS=3 COMPACTION_STYLE=leveled bash benchmark_compare.sh $dbdir $resultdir 7.8.fb
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11074
Reviewed By: ajkr
Differential Revision: D42969668
Pulled By: cbi42
fbshipit-source-id: 1ea4e6a3901be4016108f93817eb58f74baac21a
Summary:
The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:
* Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
* Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
* Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
* Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
* Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
* Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143
Test Plan: unit tests added / updated, added to stress/crash test
Reviewed By: ajkr
Differential Revision: D42848627
Pulled By: pdillinger
fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
Most of changes were done through following comments:
unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
Test Plan: See CI
Reviewed By: pdillinger
Differential Revision: D42796341
fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
Summary:
Add the most basic support such that trace_analyzer commands no longer fail with
```
Cannot process the write batch in the trace
Cannot process the TraceRecord
PutEntityCF not implemented
Cannot process the trace
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11127
Reviewed By: cbi42
Differential Revision: D42732319
Pulled By: ajkr
fbshipit-source-id: 162d8a31318672a46539b1b042ec25f69b25c4ed
Summary:
Compressed block cache is replaced by compressed secondary cache. Remove the feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117
Test Plan: See CI passes
Reviewed By: pdillinger
Differential Revision: D42700164
fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d
Summary:
Same as title
Error:
```
block_cache_trace_analyzer: ./db/dbformat.h:421: uint64_t rocksdb::GetInternalKeySeqno(const rocksdb::Slice&): Assertion `n >= kNumInternalBytes' failed.
Aborted (core dumped)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11042
Test Plan:
- Added new unit test which fails without the fix.
- Also ran manually on traces to confirm.
Reviewed By: anand1976
Differential Revision: D42481587
Pulled By: akankshamahajan15
fbshipit-source-id: 7c33eb03a4a4d8ffbabcfbe0efa1e4d11bde3ba2
Summary:
When running `-benchmarks=compact`, `-subcompactions` does not take effect.
`-subcompactions` option comment says it is for L0-L1 compactions, it is natural to extend it to CompactionRangeOptions.max_subcompactions.
This PR set CompactionRangeOptions.max_subcompactions = FLAGS_subcompactions
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11077
Reviewed By: akankshamahajan15
Differential Revision: D42506251
Pulled By: ajkr
fbshipit-source-id: f77c9a99d32ff7af59f3c452c9e16aaeb0360304
Summary:
**Context/Summary:**
This reverts commit fc74abb436 and related HISTORY record.
The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004
Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D41572604
Pulled By: hx235
fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
Summary:
This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954
Test Plan: tests updated
Reviewed By: akankshamahajan15
Differential Revision: D41310123
Pulled By: pdillinger
fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c
Summary:
This option allows a custom cleanup command line for a non-Posix file system to be used by db_crashtest.py to cleanup between runs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10883
Test Plan: Run the whitebox crash test
Reviewed By: pdillinger
Differential Revision: D40726424
Pulled By: anand1976
fbshipit-source-id: b827f6b583ff78f9ca75ced2d96f7e58f5200432
Summary:
Some lines of .h and .cc files are not properly fomatted. Clear them up with clang format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10868
Test Plan: Watch existing CI to pass
Reviewed By: ajkr
Differential Revision: D40683485
fbshipit-source-id: 491fbb78b2cdcb948164f306829909ad816d5d0b
Summary:
**Context:**
Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
Repro:
```
COERCE_CONTEXT_SWICH=1 make -j56 db_stress
./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35
put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
```
**Summary:**
FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
- [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
- For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
- This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
- [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
- For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
- This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
- [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
- [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
- Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.
Some additional clean-up included in this PR:
- Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
- Added comment about `earliest_memtable_seqno` in related APIs
- Made parameter `earliest_memtable_seqno` constant and required
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777
Test Plan:
- make check
- New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
- Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only
Reviewed By: ajkr
Differential Revision: D40090485
Pulled By: hx235
fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
Summary:
Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter.
`DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`.
New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer.
This same philosophy can be applied to KV and IO tracing as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811
Test Plan:
existing unit tests
Old API DB::StartBlockTrace checked with db_bench tool
create database
```
./db_bench --benchmarks="fillseq" \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000
```
To trace block cache accesses when running readrandom benchmark:
```
./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \
--threads=16 \
-block_cache_trace_file="/tmp/binary_trace_test_example" \
-block_cache_trace_max_trace_file_size_in_bytes=1073741824 \
-block_cache_trace_sampling_frequency=1
```
Reviewed By: anand1976
Differential Revision: D40435289
Pulled By: akankshamahajan15
fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282
Summary:
Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.
For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828
Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI
Reviewed By: hx235
Differential Revision: D40447634
Pulled By: pdillinger
fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
Summary:
This is a small follow-up to https://github.com/facebook/rocksdb/pull/10821. The goal of that PR was to hold `test_batches_snapshots` fixed across all `db_stress` invocations; however, that patch didn't address the case when `test_batches_snapshots` is unset due to a conflicting `enable_compaction_filter` or `prefix_size` setting. This PR updates the logic so the other parameter is sanitized instead in the case of such conflicts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10830
Reviewed By: riversand963
Differential Revision: D40444548
Pulled By: ltamasi
fbshipit-source-id: 0331265704904b729262adec37139292fcbb7805
Summary:
We have recently made some stress test improvements that rely on decoding the "value base" from the values stored in the database. This logic does not currently support the case when some KVs are written by a non-batched ops run and some by a batched ops run. The patch temporarily disables mixing these two.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10821
Reviewed By: riversand963
Differential Revision: D40367326
Pulled By: ltamasi
fbshipit-source-id: 66f2e0cbc097ab6b1f9e4b39b833bd466f1aaab5
Summary:
This has several small improvements.
benchmark.sh
* add BYTES_PER_SYNC as an env variable
* use --prepopulate_block_cache when O_DIRECT is used
* use --undefok to list options that don't work for all 7.x releases
* print "failure" in report.tsv when a benchmark fails
* parse the slightly different throughput line used by db_bench for multireadrandom
* remove the trailing comma for BlobDB size before printing it in report.tsv
* use the last line of the output from /bin/time as there can be more than one line when db_bench has a non-zero exit
* fix more bash lint warnings
* add ",stats" to the --benchmark=... lines to get stats at the end of each benchmark
benchmark_compare.sh
* run revrange immediately after fillseq to let compaction debt get removed
* add --multiread_batched when --benchmarks=multireadrandom is used
* use --benchmarks=overwriteandwait when supported to get a more accurate measure of write-amp
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10803
Test Plan: Run it for leveled, universal and BlobDB
Reviewed By: jay-zhuang
Differential Revision: D40278315
Pulled By: mdcallag
fbshipit-source-id: 793134ddc7d48d05a07436cd8942c375a23983a7
Summary:
This change is motivated by ensuring that `ldb update_manifest` or `UpdateManifestForFilesState` can run without expecting files to open when the old temperature is provided (in case the FileSystem strictly interprets non-kUnknown), but ended up fixing a problem in `OfflineManifestWriter` (used by `ldb unsafe_remove_sst_file`) where it would open some SST files during recovery and expect them to match the prior manifest state, even if not required by the intended new state.
Also update BackupEngine to retry with Temperature kUnknown when reading file with potentially "wrong" temperature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10796
Test Plan: tests added/updated, that fail before the change(s) and now pass
Reviewed By: jay-zhuang
Differential Revision: D40232645
Pulled By: jay-zhuang
fbshipit-source-id: b5aa2688aecfe0c320b80a7da689b315414c20be
Summary:
We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.)
This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768
Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test)
Reviewed By: anand1976
Differential Revision: D40034747
Pulled By: anand1976
fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68
Summary:
`SstFileWriter` currently does not support the `PutEntity` API, so in `TestIngestExternalFile` all key-values are written using regular `Put`s. This violates the assumption that whether or not a key corresponds to a plain old key-value or a wide-column entity can be determined by solely looking at the "value base" used when generating the value. The patch fixes this issue by disabling ingestion when `PutEntity` is enabled in the stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10769
Test Plan: Ran a simple blackbox stress test.
Reviewed By: akankshamahajan15
Differential Revision: D40042132
Pulled By: ltamasi
fbshipit-source-id: 93e75ff55545b7b69fa4ddef1d96093c961158a0
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661
Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case.
| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|
Reviewed By: riversand963
Differential Revision: D39441192
Pulled By: cbi42
fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
Summary:
**Context/Summary:**
Introduce `manual_wal_flush_one_in` as titled.
- When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in`
Incompatibilities fixed along the way:
```
db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed.
```
- It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false.
- As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698
Test Plan:
- Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3`
- Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs
Reviewed By: ajkr
Differential Revision: D39593752
Pulled By: ajkr
fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce