rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-26 16:30:56 +00:00

Author	SHA1	Message	Date
Changyu Bi	ba022dd44c	Disable `enable_checksum_handoff` in crash test (#12431 ) Summary: since it been causing a few crash tests failures, I suspect it'll be easy to repro locally. Also fixed how to print its corruption message so it does not crash with output cannot be utf-8 decoded. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12431 Reviewed By: hx235 Differential Revision: D54881023 Pulled By: cbi42 fbshipit-source-id: 47208a637cd69b30d2545154849405e37db62ed3	2024-03-13 18:03:55 -07:00
Hui Xiao	30243c6573	Add missing db crash options (#12414 ) Summary: Context/Summary: We are doing a sweep in all public options, including but not limited to the `Options`, `Read/WriteOptions`, `IngestExternalFileOptions`, cache options.., to find and add the uncovered ones into db crash. The options included in this PR require minimum changes to db crash other than adding the options themselves. A bonus change: to surface new issues by improved coverage in stderror, we decided to fail/terminate crash test for manual compactions (CompactFiles, CompactRange()) on meaningful errors. See https://github.com/facebook/rocksdb/pull/12414/files#diff-5c4ced6afb6a90e27fec18ab03b2cd89e8f99db87791b4ecc6fa2694284d50c0R2528-R2532, https://github.com/facebook/rocksdb/pull/12414/files#diff-5c4ced6afb6a90e27fec18ab03b2cd89e8f99db87791b4ecc6fa2694284d50c0R2330-R2336 for more. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12414 Test Plan: - Run `python3 ./tools/db_crashtest.py --simple blackbox` for 10 minutes to ensure no trivial failure - Run `python3 tools/db_crashtest.py --simple blackbox --compact_files_one_in=1 --compact_range_one_in=1 --read_fault_one_in=1 --write_fault_one_in=1 --interval=50` for a while to ensure the bonus change does not result in trivial crash/termination of stress test Reviewed By: ajkr, jowlyzhang, cbi42 Differential Revision: D54691774 Pulled By: hx235 fbshipit-source-id: 50443dfb6aaabd8e24c79a2e42b68c6de877be88	2024-03-12 17:24:12 -07:00
Changyu Bi	36c1b0aded	Allow SstFileReader to verify number of entries in SST files (#12418 ) Summary: Add `SstFileReader::VerifyNumEntries()` for this purpose. I added the same functionality to `sst_dump` in https://github.com/facebook/rocksdb/issues/12322. Since sst_file_reader.h is exposed to users while sst_dump.h is not, it seems more appropriate to add SST files related APIs here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12418 Test Plan: `./sst_file_reader_test --gtest_filter="VerifyNumEntries"` Reviewed By: jowlyzhang Differential Revision: D54764271 Pulled By: cbi42 fbshipit-source-id: 22ebfe04bbb0b152762cee13d4210b147b36d3e9	2024-03-12 11:05:20 -07:00
Andrew Kryczka	27a2473668	Best-effort recovery support for atomic flush (#12406 ) Summary: This PR updates `VersionEditHandlerPointInTime` to recover all or none of the updates in an AtomicGroup. This makes best-effort recovery properly handle atomic flushes during recovery, so the features are now allowed to both be enabled at once. The new logic requires that AtomicGroups do not contain column family additions or removals. AtomicGroups are currently written for atomic flush, which does not include such edits. Column family additions or removals are recovered independently of AtomicGroups. The new logic needs to be aware of removal, though, so that a dropped CF does not prevent completion of an AtomicGroup recovery. The new logic treats each AtomicGroup as if it contains updates for all existing column families, even though it is possible to create AtomicGroups that only affect a subset of column families. This simplifies the logic at the expense of recovering less data in certain edge case scenarios. The usage of `MaybeCreateVersion()` is pretty tricky. The goal is to create a barrier at the start of an AtomicGroup such that all valid states up to that point will be applied to `versions_`. Here is a summary. - `MaybeCreateVersion(..., false)` creates a `Version` on a negative edge trigger (transition from valid to invalid). It was previously called when applying each update. Now, it is only called when applying non-AtomicGroup updates. - `MaybeCreateVersion(..., true)` creates a `Version` on a positive level trigger (valid state). It was previously called only at the end of iteration. Now, it is additionally called before processing an AtomicGroup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12406 Reviewed By: jaykorean, cbi42 Differential Revision: D54494904 Pulled By: ajkr fbshipit-source-id: 0114a9fe1d04b471d086dcab5978ea8a3a56ad52	2024-03-06 14:40:40 -08:00
Peter Dillinger	d780e7a561	Remove `bottommost_temperature` (#12389 ) Summary: deprecated option already replaced by `last_level_temperature`. (Keeping recognition of the option in old options files.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12389 Test Plan: tests updated Reviewed By: jowlyzhang, cbi42 Differential Revision: D54267946 Pulled By: pdillinger fbshipit-source-id: 65c49b15e7394829c1f3b44edd4179d2daff6017	2024-02-27 14:48:00 -08:00
Andrew Kryczka	a43481b3d0	Decouple `RateLimiter` burst size and refill period (#12379 ) Summary: When the rate limiter does not have any waiting requests, the first request to arrive may consume all of the available bandwidth, despite potentially having lower priority than requests that arrive later in the same refill interval. Then, those higher priority requests must wait for a refill. So even in scenarios in which we have an overall bandwidth surplus, the highest priority requests can be sporadically delayed up to a whole refill period. Alone, this isn't necessarily problematic as the refill period is configurable via `refill_period_us` and can be tuned down as needed until the max sporadic delay is tolerable. However, tuning down `refill_period_us` had a side effect of reducing burst size. Some users require a certain burst size to issue optimal I/O sizes to the underlying storage system. To satisfy those users, this PR decouples the refill period from the burst size. That way, the max sporadic delay can be limited without impacting I/O sizes issued to the underlying storage system. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12379 Test Plan: The goal is to show we can now limit the max sporadic delay without impacting compaction's I/O size. The benchmark runs compaction with a large I/O size, while user reads simultaneously run at a low rate that does not consume all of the available bandwidth. The max sporadic delay is measured using the P100 of rocksdb.file.read.get.micros. I just used strace to verify the compaction reads follow `rate_limiter_single_burst_bytes` Setup: `./db_bench -benchmarks=fillrandom,flush -write_buffer_size=67108864 -disable_auto_compactions=true -value_size=256 -num=1048576` Benchmark: `./db_bench -benchmarks=readrandom -use_existing_db=true -num=1048576 -duration=10 -benchmark_read_rate_limit=4096 -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -rate_limit_user_ops=true -statistics=true -cache_size=0 -stats_level=5 -compaction_readahead_size=16777216 -use_direct_reads=true` Results: refill_micros \| rocksdb.file.read.get.micros (P100) -- \| -- 10000 \| 10802 100000 \| 100240 1000000 \| 922061 For verifying compaction read sizes: `strace -fye pread64 ./db_bench -benchmarks=compact -use_existing_db=true -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -compaction_readahead_size=16777216 -use_direct_reads=true` Reviewed By: hx235 Differential Revision: D54165675 Pulled By: ajkr fbshipit-source-id: c5968486316cbfb7ff8e5b7d75d3589883dd1105	2024-02-26 16:55:13 -08:00
jrchyang	70cb330a4a	optimize file size statistics in benchmark script (#12363 ) Summary: Execute `ls` once when counting the file size of the `DB_DIR` and remove unused file number counter variable `c` . The test information as follow : ```Shell # benchmark command NUM_KEYS=30000000 CACHE_SIZE=6442450944 DB_DIR=/mnt/rocksdb_test WAL_DIR=/mnt/rocksdb_test ../tools/benchmark.sh fillseq_disable_wal # before modification cat /tmp/benchmark_fillseq.wal_disabled.v400.log.stats.sizes 0.0 0.0 0.0 0.0 195250 1.1 1.1 0.0 0.0 195300 2.5 2.5 0.0 0.0 195310 3.8 3.7 0.0 0.0 195320 5.1 5.1 0.0 0.0 195330 max sizes (GB): 5.1 all, 5.1 sst, 0.0 log, 0.0 blob # after modification cat /tmp/benchmark_fillseq.wal_disabled.v400.log.stats.sizes 0.0 0.0 0.0 0.0 194839 1.2 1.2 0.0 0.0 194849 2.6 2.6 0.0 0.0 194859 4.0 4.0 0.0 0.0 194909 5.4 5.4 0.0 0.0 194919 max sizes (GB): 5.4 all, 5.4 sst, 0.0 log, 0.0 blob ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12363 Reviewed By: hx235 Differential Revision: D54005427 Pulled By: ajkr fbshipit-source-id: fae149705eb3fcda48d7381c42836a150f35ddc4	2024-02-21 15:45:18 -08:00
Andrew Kryczka	8e29f243c9	No filesystem reads during `Merge()` writes (#12365 ) Summary: This occasional filesystem read in the write path has caused user pain. It doesn't seem very useful considering it only limits one component's merge chain length, and only helps merge uncached (i.e., infrequently read) values. This PR proposes allowing `max_successive_merges` to be exceeded when the value cannot be read from in-memory components. I included a rollback flag (`strict_max_successive_merges`) just in case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12365 Test Plan: "rocksdb.block.cache.data.add" is number of data blocks read from filesystem. Since the benchmark is write-only, compaction is disabled, and flush doesn't read data blocks, any nonzero value means the user write issued the read. ``` $ for s in false true; do echo -n "strict_max_successive_merges=$s: " && ./db_bench -value_size=64 -write_buffer_size=131072 -writes=128 -num=1 -benchmarks=mergerandom,flush,mergerandom -merge_operator=stringappend -disable_auto_compactions=true -compression_type=none -strict_max_successive_merges=$s -max_successive_merges=100 -statistics=true \|& grep 'block.cache.data.add COUNT' ; done strict_max_successive_merges=false: rocksdb.block.cache.data.add COUNT : 0 strict_max_successive_merges=true: rocksdb.block.cache.data.add COUNT : 1 ``` Reviewed By: hx235 Differential Revision: D53982520 Pulled By: ajkr fbshipit-source-id: e40f761a60bd601f232417ac0058e4a33ee9c0f4	2024-02-21 13:15:27 -08:00
Yu Zhang	31dfc81e18	Start 9.1.0 release (#12360 ) Summary: with release notes for 9.0.fb, format_compatible test update, and version.h update. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12360 Test Plan: CI Reviewed By: cbi42 Differential Revision: D53879416 Pulled By: jowlyzhang fbshipit-source-id: 29598893d9ce2d0bb181345ddb78f9b1529aee75	2024-02-16 18:26:48 -08:00
Peter Dillinger	bfd00bba9c	Use format_version=6 by default (#12352 ) Summary: It's in production for a large storage service, and it was initially released 6 months ago (8.6.0). IMHO that's enough room for "easy downgrade" to most any user's previously integrated version, even if they only update a few times a year. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12352 Test Plan: tests updated, including format capatibility test table_test: ApproximateOffsetOfCompressed is affected because adding index block to metaindex adds about 13 bytes to SST files in format_version 6. This test has historically been problematic and one reason is that, apparently, not only could it pass/fail depending on snappy compression version, but also how long your host name is, because of db_host_id. I've cleared that out for the test, which takes care of format_version=6 and hopefully improves long-term reliability. Suggested follow-up: FinishImpl in table_test.cc takes a table_options that is ignored in some cases and might not match the ioptions.table_factory configuration unless the caller is very careful. This should be cleaned up somehow. Reviewed By: anand1976 Differential Revision: D53786884 Pulled By: pdillinger fbshipit-source-id: 1964cbd40d3ab0a821fdc01c458031df716fcf51	2024-02-15 11:23:48 -08:00
Yu Zhang	4bea83aa44	Remove the force mode for EnableFileDeletions API (#12337 ) Summary: There is no strong reason for user to need this mode while on the other hand, its behavior is destructive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337 Reviewed By: hx235 Differential Revision: D53630393 Pulled By: jowlyzhang fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78	2024-02-13 18:36:25 -08:00
Jay Huh	8c7c0a38f1	Minor refactor with printing stdout in blackbox tests (#12350 ) Summary: As title. Adding a missing stdout printing in `blackbox_crash_main()` # Test Blackbox ``` $> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 ``` ``` ... stdout: Choosing random keys with no overwrite DB path: [/tmp/jewoongh/rocksdb_crashtest_blackbox34jwn9of] (Re-)verified 0 unique IDs 2024/02/13-12:27:33 Initializing worker threads Crash-recovery verification passed :) 2024/02/13-12:27:36 Starting database operations ... jewoongh stdout test jewoongh stdout test ... jewoongh stdout test stderr: jewoongh injected error ``` Whitebox ``` $> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304 ``` ``` ... stdout: Choosing random keys with no overwrite Creating 24415 locks ... 2024/02/13-12:31:51 Initializing worker threads Crash-recovery verification passed :) 2024/02/13-12:31:54 Starting database operations jewoongh stdout test jewoongh stdout test jewoongh stdout test ... stderr: jewoongh injected error ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12350 Reviewed By: akankshamahajan15, cbi42 Differential Revision: D53728910 Pulled By: jaykorean fbshipit-source-id: ec90ed3b5e6a1102d1fb55d357d0371e5072a173	2024-02-13 14:15:52 -08:00
Changyu Bi	b46f5707c4	Fix unexpected keyword argument 'print_as_stderr' in crash test (#12339 ) Summary: Fix crash test failure like https://github.com/facebook/rocksdb/actions/runs/7821514511/job/21338625372#step:5:530 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12339 Test Plan: CI Reviewed By: jaykorean Differential Revision: D53545053 Pulled By: cbi42 fbshipit-source-id: b466a8dc9c0ded0377e8677937199c6f959f96ef	2024-02-07 15:44:17 -08:00
Akanksha Mahajan	9a2d7485f0	Print stderr in crash test script and exit on stderr (#12335 ) Summary: Some of the errors like data race and heap-after-use are error out based on crash test reporting them as error by relying on stderr. So reverting back to original form unless we come up with a more reliable solution to error out. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12335 Reviewed By: cbi42 Differential Revision: D53534781 Pulled By: akankshamahajan15 fbshipit-source-id: b19aa560d1560ac2281f7bc04e13961ed751f178	2024-02-07 12:34:40 -08:00
Peter Dillinger	54cb9c77d9	Prefer static_cast in place of most reinterpret_cast (#12308 ) Summary: The following are risks associated with pointer-to-pointer reinterpret_cast: * Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do. * Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally. I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement: * Going through `void` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have `struct Derived: public Base1, public Base2`. If we have `Derived` -> `void` -> `Base2` -> `Derived` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic. * Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived` -> `Base2` -> `Derived` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance. With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain. A couple of related interventions included here: Previously Cache::Handle was not actually derived from in the implementations and just used as a `void` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle. Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse). Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work. I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308 Test Plan: existing tests, CI Reviewed By: ltamasi Differential Revision: D53204947 Pulled By: pdillinger fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2	2024-02-07 10:44:11 -08:00
Jay Huh	0088f77788	Multiget LDB Followup (#12332 ) Summary: # Summary Following up jowlyzhang 's comment in https://github.com/facebook/rocksdb/issues/12283 . - Remove `ARG_TTL` from help which is not relevant to `multi_get` command - Treat NotFound status as non-error case for both `Get` and `MultiGet` and updated the unit test, `ldb_test.py` - Print key along with value in `multi_get` command Pull Request resolved: https://github.com/facebook/rocksdb/pull/12332 Test Plan: Unit Test ``` $>python3 tools/ldb_test.py ... Ran 25 tests in 17.447s OK ``` Manual Run ``` $> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000000D8 0x0000000000000009000000000000002678787878BEEF 0x0000000000000009000000000000012B00000000000000D8 ==> 0x47000000434241404F4E4D4C4B4A494857565554535251505F5E5D5C5B5A595867666564636261606F6E6D6C6B6A696877767574737271707F7E7D7C7B7A797807060504030201000F0E0D0C0B0A090817161514131211101F1E1D1C1B1A1918 Key not found: 0x0000000000000009000000000000002678787878BEEF ``` ``` $> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex get 0x00000000000000090000000000 Key not found ``` Reviewed By: jowlyzhang Differential Revision: D53450164 Pulled By: jaykorean fbshipit-source-id: 9ccec78ad3695e65b1ed0c147c7cbac502a1bd48	2024-02-05 20:11:35 -08:00
Hui Xiao	1a885fe730	Remove deprecated Options::access_hint_on_compaction_start (#11654 ) Summary: Context: `Options::access_hint_on_compaction_start ` is marked deprecated and now ready to be removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11654 Test Plan: Multiple db_stress runs with pre-PR and post-PR binary randomly to ensure forward/backward compatibility on options `36a5686ec0`?fbclid=IwAR2IcdAUdTvw9O9V5GkHEYJRGMVR9p7Ei-LMa-9qiXlj3z80DxjkxlGnP1E `python3 tools/db_crashtest.py --simple blackbox --interval=30` Reviewed By: cbi42 Differential Revision: D47892459 Pulled By: hx235 fbshipit-source-id: a62f46a0377fe143be7638e218978d5431c15c56	2024-02-05 13:35:19 -08:00
Peter Dillinger	1d6dbfb8b7	Rename IntTblPropCollector -> InternalTblPropColl (#12320 ) Summary: I've always found this name difficult to read, because it sounds like it's for collecting int(eger) table properties. I'm fixing this now to set up for a change that I have stubbed out in the public API (table_properties.h): a new adapter function `TablePropertiesCollector::AsInternal()` that allows RocksDB-provided TablePropertiesCollectors (such as CompactOnDeletionCollector) to implement the easier-to-upgrade internal interface while still (superficially) implementing the public interface. In addition to added flexibility, this should be a performance improvement as the adapter class UserKeyTablePropertiesCollector can be avoided for such cases where a RocksDB-provided collector is used (AsInternal() returns non-nullptr). table_properties.h is the only file with changes that aren't simple find-replace renaming. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12320 Test Plan: existing tests, CI Reviewed By: ajkr Differential Revision: D53336945 Pulled By: pdillinger fbshipit-source-id: 02535bcb30bbfb00e29e8478af62e5dad50a63b8	2024-02-02 14:14:43 -08:00
Changyu Bi	c6b1f6d182	Augment sst_dump tool to verify num_entries in table property (#12322 ) Summary: sst_dump --command=check can now compare number of keys in a file with num_entries in table property and reports corruption is there is a mismatch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12322 Test Plan: - new unit test for API `SstFileDumper::ReadSequential` - ran sst_dump on a good and a bad file: ``` sst_dump --file=./32316112.sst options.env is 0x7f68bfcb5000 Process ./32316112.sst Sst file format: block-based from [] to [] sst_dump --file=./32316115.sst options.env is 0x7f6d0d2b5000 Process ./32316115.sst Sst file format: block-based from [] to [] ./32316115.sst: Corruption: Table property has num_entries = 6050408 but scanning the table returns 6050406 records. ``` Reviewed By: jowlyzhang Differential Revision: D53320481 Pulled By: cbi42 fbshipit-source-id: d84c996346a9575a5a2ea5f5fb09a9d3ee672cd6	2024-02-01 14:35:03 -08:00
Andrew Kryczka	f9d45358ca	Removed `check_flush_compaction_key_order` (#12311 ) Summary: `check_flush_compaction_key_order` option was introduced for the key order checking online validation. It gave users the ability to disable the validation without downgrade in case the validation caused inefficiencies or false positives. Over time this validation has shown to be cheap and correct, so the option to disable it can now be removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12311 Reviewed By: cbi42 Differential Revision: D53233379 Pulled By: ajkr fbshipit-source-id: 1384361104021d6e3e580dce2ec123f9f99ce637	2024-01-31 16:30:26 -08:00
Peter Dillinger	76c834e441	Remove 'virtual' when implied by 'override' (#12319 ) Summary: ... to follow modern C++ style / idioms. Used this hack: ``` for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319 Test Plan: existing tests, CI Reviewed By: jaykorean Differential Revision: D53275303 Pulled By: pdillinger fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377	2024-01-31 13:14:42 -08:00
Yu Zhang	071a146fa0	Add support for range deletion when user timestamps are not persisted (#12254 ) Summary: For the user defined timestamps in memtable only feature, some special handling for range deletion blocks are needed since both the key (start_key) and the value (end_key) of a range tombstone can contain user-defined timestamps. Handling for the key is taken care of in the same way as the other data blocks in the block based table. This PR adds the special handling needed for the value (end_key) part. This includes: 1) On the write path, when L0 SST files are first created from flush, user-defined timestamps are removed from an end key of a range tombstone. There are places where it's logically removed (replaced with a min timestamp) because there is still logic with the running comparator that expects a user key that contains timestamp. And in the block based builder, it is eventually physically removed before persisted in a block. 2) On the read path, when range deletion block is being read, we artificially pad a min timestamp to the end key of a range tombstone in `BlockBasedTableReader`. 3) For file boundary `FileMetaData.largest`, we artificially pad a max timestamp to it if it contains a range deletion sentinel. Anytime when range deletion end_key is used to update file boundaries, it's using max timestamp instead of the range tombstone's actual timestamp to mark it as an exclusive end. `d69628e6ce/db/dbformat.h (L923-L935)` This max timestamp is removed when in memory `FileMetaData.largest` is persisted into Manifest, we pad it back when it's read from Manifest while handling related `VersionEdit` in `VersionEditHandler`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12254 Test Plan: Added unit test and enabled this feature combination's stress test. Reviewed By: cbi42 Differential Revision: D52965527 Pulled By: jowlyzhang fbshipit-source-id: e8315f8a2c5268e2ae0f7aec8012c266b86df985	2024-01-29 11:37:34 -08:00
Jay Huh	8829ba9fe1	print stderr separately per option (#12301 ) Summary: While working on Meta's internal test triaging process, I found that `db_crashtest.py` was printing out `stdout` and `stderr` altogether. Adding an option to print `stderr` separately so that it's easy to extract only `stderr` from the test run. `print_stderr_separately` is introduced as an optional parameter with default value `False` to keep the existing behavior as is (except a few minor changes). Minor changes to the existing behavior - We no longer print `stderr has error message:` and `*` prefix to each line. We simply print `stderr:` before printing `stderr` if stderr is printed in stdout and print `stderr` as is. - We no longer print `times error occurred in output is ...` which doesn't appear to have any values Pull Request resolved: https://github.com/facebook/rocksdb/pull/12301 Test Plan: Default Behavior (blackbox) Run printed everything as is ``` $> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 2> /tmp/error.log Running blackbox-crash-test with interval_between_crash=120 total-duration=6000 ... Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0 Integrated BlobDB: blob cache disabled DB path: [/tmp/jewoongh/rocksdb_crashtest_blackboxwh7yxpec] (Re-)verified 0 unique IDs 2024/01/29-09:16:30 Initializing worker threads Crash-recovery verification passed :) 2024/01/29-09:16:35 Starting database operations 2024/01/29-09:16:35 Starting verification Stress Test : 543.600 micros/op 8802 ops/sec : Wrote 0.00 MB (0.27 MB/sec) (50% of 10 ops) : Wrote 5 times : Deleted 1 times : Single deleted 0 times : 4 read and 0 found the key : Prefix scanned 0 times : Iterator size sum is 0 : Iterated 0 times : Deleted 0 key-ranges : Range deletions covered 0 keys : Got errors 0 times : 0 CompactFiles() succeed : 0 CompactFiles() did not succeed stderr: WARNING: prefix_size is non-zero but memtablerep != prefix_hash Error : jewoongh injected test error This is not a real failure. Verification failed :( ``` Nothing in stderr ``` $> cat /tmp/error.log ``` Default Behavior (whitebox) Run printed everything as is ``` $> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304 2> /tmp/error.log Running whitebox-crash-test with total-duration=10000 ... (Re-)verified 571 unique IDs 2024/01/29-09:33:53 Initializing worker threads Crash-recovery verification passed :) 2024/01/29-09:35:16 Starting database operations 2024/01/29-09:35:16 Starting verification Stress Test : 97248.125 micros/op 10 ops/sec : Wrote 0.00 MB (0.00 MB/sec) (12% of 8 ops) : Wrote 1 times : Deleted 0 times : Single deleted 0 times : 4 read and 1 found the key : Prefix scanned 1 times : Iterator size sum is 120868 : Iterated 4 times : Deleted 0 key-ranges : Range deletions covered 0 keys : Got errors 0 times : 0 CompactFiles() succeed : 0 CompactFiles() did not succeed stderr: WARNING: prefix_size is non-zero but memtablerep != prefix_hash Error : jewoongh injected test error This is not a real failure. New cache capacity = 4865393 Verification failed :( TEST FAILED. See kill option and exit code above!!! ``` Nothing in stderr ``` $> cat /tmp/error.log ``` New option added (blackbox) ``` $> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --print_stderr_separately 2> /tmp/error.log Running blackbox-crash-test with interval_between_crash=120 total-duration=6000 ... Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0 Integrated BlobDB: blob cache disabled DB path: [/tmp/jewoongh/rocksdb_crashtest_blackbox7ybna32z] (Re-)verified 0 unique IDs Compaction filter factory: DbStressCompactionFilterFactory 2024/01/29-09:05:39 Initializing worker threads Crash-recovery verification passed :) 2024/01/29-09:05:46 Starting database operations 2024/01/29-09:05:46 Starting verification Stress Test : 235.917 micros/op 16000 ops/sec : Wrote 0.00 MB (0.16 MB/sec) (16% of 12 ops) : Wrote 2 times : Deleted 1 times : Single deleted 0 times : 9 read and 0 found the key : Prefix scanned 0 times : Iterator size sum is 0 : Iterated 0 times : Deleted 0 key-ranges : Range deletions covered 0 keys : Got errors 0 times : 0 CompactFiles() succeed : 0 CompactFiles() did not succeed ``` stderr printed separately ``` $> cat /tmp/error.log WARNING: prefix_size is non-zero but memtablerep != prefix_hash Error : jewoongh injected test error This is not a real failure. New cache capacity = 19461571 Verification failed :( ``` New option added (whitebox)** ``` $> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304 --print_stderr_separately 2> /tmp/error.log Running whitebox-crash-test with total-duration=10000 ... Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0 Integrated BlobDB: blob cache disabled DB path: [/tmp/jewoongh/rocksdb_crashtest_whiteboxtwj0ihn6] (Re-)verified 157 unique IDs 2024/01/29-09:39:59 Initializing worker threads Crash-recovery verification passed :) 2024/01/29-09:40:16 Starting database operations 2024/01/29-09:40:16 Starting verification Stress Test : 742.474 micros/op 11801 ops/sec : Wrote 0.00 MB (0.27 MB/sec) (36% of 19 ops) : Wrote 7 times : Deleted 1 times : Single deleted 0 times : 8 read and 0 found the key : Prefix scanned 0 times : Iterator size sum is 0 : Iterated 4 times : Deleted 0 key-ranges : Range deletions covered 0 keys : Got errors 0 times : 0 CompactFiles() succeed : 0 CompactFiles() did not succeed TEST FAILED. See kill option and exit code above!!! ``` stderr printed separately ``` $> cat /tmp/error.log WARNING: prefix_size is non-zero but memtablerep != prefix_hash Error : jewoongh injected test error This is not a real failure. Error : jewoongh injected test error This is not a real failure. Error : jewoongh injected test error This is not a real failure. New cache capacity = 4865393 Verification failed :( ``` Reviewed By: akankshamahajan15 Differential Revision: D53187491 Pulled By: jaykorean fbshipit-source-id: 76f9100d08b96d014e41b7b88b206d69f0ae932b	2024-01-29 11:09:47 -08:00
akankshamahajan	36704e9227	Improve crash test script to not rely on std::errors for failures. (#12265 ) Summary: Right now crash_test relies on std::errors too to check for only errors/failures along with verification. However, that's not a reliable solution and many internal services logs benign errors/warnings in which case our test script fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12265 Test Plan: Keep std::errors but printout instead of failing and will monitor crash tests internally to see if there is any scenario which solely relies on std::error, in which case stress tests can be improve. Reviewed By: ajkr, cbi42 Differential Revision: D52967000 Pulled By: akankshamahajan15 fbshipit-source-id: 5328c8b69480c7946fe6a9c72f9ffeede70ac2ad	2024-01-26 11:39:47 -08:00
chuhao zeng	d82d179a5e	Enhance ldb_cmd_tool to enable user pass in customized cfds (#12261 ) Summary: The current implementation of the ldb_cmd tool involves commenting out the user-passed column_family_descriptors, resulting in the tool consistently constructing its column_family_descriptors from the pre-existing OPTIONS file. The proposed fix prioritizes user-passed column family descriptors, ensuring they take precedence over those specified in the OPTIONS file. This modification enhances the tool's adaptability and responsiveness to user configurations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12261 Reviewed By: cbi42 Differential Revision: D52965877 Pulled By: ajkr fbshipit-source-id: 334a83a8e1004c271b19e7ca09381a0e7cf87b03	2024-01-24 16:16:18 -08:00
Jay Huh	59f4cbef8c	MultiGet support in ldb (#12283 ) Summary: While investigating test failures due to the inconsistency between `Get()` and `MultiGet()`, I realized that LDB currently doesn't support `MultiGet()`. This PR introduces the `MultiGet()` support in LDB. Tested the command manually. Unit test will follow in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12283 Test Plan: When key not found ``` $> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000002AB Status for key 0x0000000000000009000000000000012B00000000000002AB: NotFound: ``` Compare the same key with get ``` $> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex get 0x0000000000000009000000000000012B00000000000002AB Failed: Get failed: NotFound: ``` Multiple keys not found ``` $> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000002AB 0x0000000000000009000000000000012B00000000000002AC Status for key 0x0000000000000009000000000000012B00000000000002AB: NotFound: Status for key 0x0000000000000009000000000000012B00000000000002AC: NotFound: ``` One of the keys found ``` $> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000002AB 0x00000000000000090000000000000026787878787878 Status for key 0x0000000000000009000000000000012B00000000000002AB: NotFound: 0x22000000262724252A2B28292E2F2C2D32333031363734353A3B38393E3F3C3D02030001060704050A0B08090E0F0C0D12131011161714151A1B18191E1F1C1D ``` All of the keys found ``` $> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000000D8 0x00000000000000090000000000000026787878787878 15:57:03 0x47000000434241404F4E4D4C4B4A494857565554535251505F5E5D5C5B5A595867666564636261606F6E6D6C6B6A696877767574737271707F7E7D7C7B7A797807060504030201000F0E0D0C0B0A090817161514131211101F1E1D1C1B1A1918 0x22000000262724252A2B28292E2F2C2D32333031363734353A3B38393E3F3C3D02030001060704050A0B08090E0F0C0D12131011161714151A1B18191E1F1C1D ``` Reviewed By: hx235 Differential Revision: D53048519 Pulled By: jaykorean fbshipit-source-id: a6217905464c5f460a222e2b883bdff47b9dd9c7	2024-01-24 11:35:12 -08:00
Peter Dillinger	800cfae987	Start 9.0.0 release (#12256 ) Summary: with release notes for 8.11.fb, format_compatible test update, and version.h update. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12256 Test Plan: CI Reviewed By: cbi42 Differential Revision: D52926051 Pulled By: pdillinger fbshipit-source-id: adcf7119b065758599e904c16cbdf1d28811e0b4	2024-01-20 08:38:20 -08:00
Jay Huh	d982260b63	Clean up after long-running whitebox crashtest (#12248 ) Summary: Currently, we treat the long-running whitebox_crash_test as passing. However, we were not cleaning up after ourselves when we killed the running test for running too long, which often caused out-of-space errors in subsequent tests (e.g., blackbox_crash_test after whitebox_crash_test). Unless we want to start treating these timeouts as failures and need the DB output for investigation now, we should properly clean up the tmp dir. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12248 Test Plan: ``` $> make crash_test -j ``` Reviewed By: ajkr Differential Revision: D52885342 Pulled By: jaykorean fbshipit-source-id: 7c1f2ca7cf03d0705bb14155ee44d5d7a411c132	2024-01-19 16:25:39 -08:00
anand76	b49f9cdd3c	Add CompressionOptions to the compressed secondary cache (#12234 ) Summary: Add ```CompressionOptions``` to ```CompressedSecondaryCacheOptions``` to allow users to set options such as compression level. It allows performance to be fine tuned. Tests - Run db_bench and verify compression options in the LOG file Pull Request resolved: https://github.com/facebook/rocksdb/pull/12234 Reviewed By: ajkr Differential Revision: D52758133 Pulled By: anand1976 fbshipit-source-id: af849fbffce6f84704387c195d8edba40d9548f6	2024-01-16 12:21:27 -08:00
Changyu Bi	9d58e3f63a	Disable LockWAL() for multiops_wp_txn stress test (#12221 ) Summary: We test LockWAL() and UnlockWAL() by checking that latest sequence number is not changed: `1a1f9f1660/db_stress_tool/db_stress_test_base.cc (L920-L937)`. With writeprepared transaction, sequence number can be advanced in SwitchMemtable::WriteRecoverableState() when writing recoverable state: `1a1f9f1660/db/db_impl/db_impl_write.cc (L1560)` This PR disables LockWAL() tests for writeprepared transaction for now. We probably need to change how we test LockWAL() for writeprepared before re-enabling this test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12221 Reviewed By: ajkr Differential Revision: D52677076 Pulled By: cbi42 fbshipit-source-id: 27ee694878edf63e8f4ad52f769d4db401f511bc	2024-01-11 15:54:11 -08:00
Qiaolin Yu	fa0190f885	Block cache analyzer: Calculate miss ratio for each caller (#10823 ) Summary: Currently, when `block_cache_trace_analyzer` analyzes the cache miss ratio, it only analyzes the total miss ratio. But it seems also important to analyze the cache miss ratio of each caller. To achieve this, we can calculate and print the miss ratio of each caller in the analyzer. ## Before modification ``` Running for 1 seconds: Processed 85732 records/second. Trace duration 58 seconds. Observed miss ratio 7.97 ``` ## After modification ``` Running for 1 seconds: Processed 85732 records/second. Trace duration 58 seconds. Observed miss ratio 7.97 Caller Get: Observed miss ratio 6.31 Caller Iterator: Observed miss ratio 11.86 *************************************************************** ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10823 Reviewed By: ajkr Differential Revision: D52632764 Pulled By: hx235 fbshipit-source-id: 40994d6039b73dc38fe78ea1b4adce187bb98909	2024-01-10 14:02:14 -08:00
Yu Zhang	c5fbfd7ad8	Disable blobDB and UDT in memtable only combination in stress test (#12218 ) Summary: This feature combination is not fully working yet. Disable them so the stress tests have less noise. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12218 Reviewed By: cbi42 Differential Revision: D52643957 Pulled By: jowlyzhang fbshipit-source-id: 8815a18a3b5814cad4f7ec41f3fb94869302081e	2024-01-09 17:37:01 -08:00
Peter Dillinger	ed46981bea	Fix and defend against FilePrefetchBuffer combined with mmap reads (#12206 ) Summary: FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200. In this change, * Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption. * Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior. * Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer * Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues. Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206 Test Plan: * Manually verified that the new check aids in debugging. * Unit test added, that fails if either fix is missed. * Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200 Reviewed By: akankshamahajan15 Differential Revision: D52551701 Pulled By: pdillinger fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc	2024-01-04 18:39:05 -08:00
Peter Dillinger	ea6ed0d56e	Re-enable ingest_external_file with mmap_read in crash test (#12201 ) Summary: I suspect the issue called out in https://github.com/facebook/rocksdb/issues/9357 was fixed in https://github.com/facebook/rocksdb/issues/11328 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12201 Test Plan: `make blackbox_crash_test` for hours Reviewed By: ajkr Differential Revision: D52543075 Pulled By: pdillinger fbshipit-source-id: b705a6bdb2799a5f51ad2746df2083aa82f360a2	2024-01-04 13:46:07 -08:00
Hui Xiao	06e593376c	Group SST write in flush, compaction and db open with new stats (#11910 ) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre\|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff	2023-12-29 15:29:23 -08:00
anand76	a036525809	Lightweight verification of MANIFEST file after close on shutdown (#12174 ) Summary: Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174 Test Plan: Unit test, and some manual verification Reviewed By: ajkr Differential Revision: D52451184 Pulled By: anand1976 fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83	2023-12-28 18:25:29 -08:00
Qiaolin Yu	f799c73d28	Trace analyzer: replace number with enumeration type (#10827 ) Summary: Currently, some numbers in the `tracer_analyzer_tool` may be a little confusing and unfriendly for people who want to add new query types. It may be better to replace them with the existing enumeration type to improve readability. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10827 Reviewed By: ajkr Differential Revision: D40576023 Pulled By: hx235 fbshipit-source-id: 0eb16820a15f365d53e848a3a8efd92928420429	2023-12-27 10:38:53 -08:00
Andrew Kryczka	15487b84e4	fix ldb_cmd_test.cc build with nondefault -DROCKSDB_NAMESPACE (#12173 ) Summary: I landed https://github.com/facebook/rocksdb/issues/12159 which had the below compiler error when using `-DROCKSDB_NAMESPACE`, which broke the CircleCI "build-linux-static_lib-alt_namespace-status_checked" job: ``` tools/ldb_cmd_test.cc:1213:21: error: 'rocksdb' does not name a type 1213 \| int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override { \| ^~~~~~~ tools/ldb_cmd_test.cc:1213:35: error: expected unqualified-id before '&' token 1213 \| int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override { \| ^ tools/ldb_cmd_test.cc:1213:35: error: expected ')' before '&' token 1213 \| int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override { \| ~ ^ \| ) tools/ldb_cmd_test.cc:1213:35: error: expected ';' at end of member declaration 1213 \| int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override { \| ^ \| ; tools/ldb_cmd_test.cc:1213:37: error: 'a' does not name a type 1213 \| int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override { \| ^ ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12173 Test Plan: ``` $ make clean && make OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" ldb_cmd_test -j56 ``` Reviewed By: pdillinger Differential Revision: D52373797 Pulled By: ajkr fbshipit-source-id: 8597aaae65a5333831fef66d85072827c5fb1187	2023-12-21 12:22:02 -08:00
chuhao zeng	8d50a7c9df	Fix ldbcmd cant use custom comparator (#12159 ) Summary: According to this [Q&A](https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ#:~:text=Q%3A%20If%20I%20use%20non%2Ddefault%20comparators%20or%20merge%20operators%2C%20can%20I%20still%20use%20ldb%20tool%3F), user should be able to use LDB with passing a customized comparator into the option. In the process of opening DB in order to perform ldb commands, there is a exception saying comparator not match even if a option with customized comparator is provided. After initializing the column family to open DB, the `LDBCommand::OverrideBaseCFOptions` method does not update the comparator inside column family descriptor using the passed in options. This can cause a mismatch while doing version edit, and in function `ToggleUDT CompareComparator` it will failed and return a exception saying comparator not match. Propose fix by updating the column family descriptor's option using the user passed in option. Also a test case is provided to illustrate the steps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12159 Reviewed By: hx235 Differential Revision: D52267367 Pulled By: ajkr fbshipit-source-id: c240f93f440e02cb485893de058a46c6dbf9654b	2023-12-20 18:04:08 -08:00
Hui Xiao	5b981b64f4	Intensify operations on same key in crash test (#12148 ) Summary: Context/Summary: Continued from https://github.com/facebook/rocksdb/pull/12127, we can randomly reduce the # max key to coerce more operations on the same key. My experimental run shows it surfaced more issue than just https://github.com/facebook/rocksdb/pull/12127. I also randomly reduce the related parameters, write buffer size and target file base, to adapt to randomly lower number of # max key. This creates 4 situations of testing, 3 of which are new: 1. high # max key with high write buffer size and target file base (existing) 2. high # max key with low write buffer size and target file base (new, will go through some rehearsal testing to ensure we don't run out of space with many files) 3. low # max key with high write buffer size and target file base (new, keys will stay in memory longer) 4. low # max key with low write buffer size and target file base (new, experimental runs show it surfaced even more issues) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12148 Test Plan: - [Ongoing] Rehearsal stress test - Monitor production stress test Reviewed By: jaykorean Differential Revision: D52174980 Pulled By: hx235 fbshipit-source-id: bd5e11280826819ca9314c69bbbf05d481c6d105	2023-12-17 10:46:26 -08:00
Levi Tamasi	81765866c4	Update HISTORY/version/format compatibility script for the 8.10 release (#12154 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12154 Reviewed By: jaykorean, akankshamahajan15 Differential Revision: D52216271 Pulled By: ltamasi fbshipit-source-id: 13bab72802eeec8f6e3544be9ebcd7f725a64d2e	2023-12-15 14:44:23 -08:00
Yu Zhang	c2ab4e754b	Add initial support to stress test persist_user_defined_timestamps (#12124 ) Summary: This PR adds initial stress testing for the user-defined timestamps in memtable only feature. Each flavor of the `*_ts` crash test get a 1 in 3 chance to run with timestamps not persisted, this setting is initialized once and kept consistent across the following re-runs. This initial stress test included these things besides disabling incompatible feature combinations to make the test run more stably: 1) It currently only run test methods that validates db state with expected state. Not the ones that validate db state by comparing result from one API to another API. Such as `TestMultiGet` (compared with `Get`), similarly `TestMultiGetEntity`, `TestIterate` (compare src iterator to a control iterator). Due to timestamps being removed, results from one API to another API is not directly comparable as it is now. More test logic to handle that need to be added, will do that in a follow up. 2) Even when comparing db state to expected state, sometimes the db can receive `InvalidArgument` too due to timestamps getting flushed and removed. Added some logic to handle that. 3) When timestamps are not persisted, we don't try to read with older timestamp. Since that's making it easier to get `InvalidArgument`. And this capability is not yet needed by our customer so it's disabled for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12124 Test Plan: running multiple flavor of this test on continuous run for sometime before checkin Reviewed By: ltamasi Differential Revision: D51916267 Pulled By: jowlyzhang fbshipit-source-id: 3f3eb5f9618d05d296062820e0ef5cb8edc7c2b2	2023-12-12 09:35:29 -08:00
Hui Xiao	179d2c7646	Intensify "xxx_one_in"'s default value in crash test (#12127 ) Summary: Context/Summary: My experimental stress runs with more frequent "xxx_one_in" surfaced a couple interesting bugs/issues with RocksDB or crash test framework in the past. We now consider changing the default value so they are run more frequently in production testing environment. Increase frequency by 2 orders of magnitude for most parameters, except for error-prone features e.g, manual compaction and file ingestion (increased by 3 orders) and expensive features e.g, checksum verification (increased by 1 order) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12127 Test Plan: Monitor CI to see if it did surface more interesting bugs/issues. If not, we may consider intensify even more. Reviewed By: pdillinger Differential Revision: D51954235 Pulled By: hx235 fbshipit-source-id: 92046cb7c52a37212f19ab7965b40f77b90b08b1	2023-12-08 10:22:14 -08:00
Andrew Kryczka	06dc32ef25	internal_repo_rocksdb (435146444452818992) (#12115 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12115 Reviewed By: jowlyzhang Differential Revision: D51745742 Pulled By: ajkr fbshipit-source-id: 67000d07783b413924798dd9c1751da27e119d53	2023-12-01 11:15:17 -08:00
Yu Zhang	7eca51dfc3	Refactor crash test stderr parsing logic into a function (#12109 ) Summary: This is a simple refactor for the crash test script to put shared logic for parsing stderr into a function. There is no functional change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12109 Test Plan: manually tested the script Reviewed By: ajkr Differential Revision: D51692172 Pulled By: jowlyzhang fbshipit-source-id: d346d64e981d9c489c380ff6ce33296a224b5877	2023-12-01 11:01:29 -08:00
anand76	acc078f878	Add tiered cache options to db_bench (#12104 ) Summary: Add the option to have a 3-tier block cache (uncompressed RAM, compressed RAM, and local flash) in db_bench, as well as specifying secondary cache admission policy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12104 Reviewed By: ajkr Differential Revision: D51629092 Pulled By: anand1976 fbshipit-source-id: 6a208f853bc85d3d8b437d91cb1b0142d9a99e53	2023-11-28 14:54:08 -08:00
Jay Huh	ddb7df10ef	Update HISTORY.md and version.h for 8.9.fb release (#12074 ) Summary: Creating cut for 8.9 release Pull Request resolved: https://github.com/facebook/rocksdb/pull/12074 Test Plan: CI Reviewed By: ajkr Differential Revision: D51435289 Pulled By: jaykorean fbshipit-source-id: 3918a8250032839e5b71f67f26c8ba01cbc17a41	2023-11-21 18:07:19 -08:00
Yu Zhang	509947ce2c	Quarantine files in a limbo state after a manifest error (#12030 ) Summary: Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery. 3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted. This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`. These are the types of files being actively tracked in quarantine in this PR: 1) new table files and blob files from a background job 2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed. Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening. We track these files' file numbers because they share the same file number space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030 Test Plan: Modified existing unit tests Reviewed By: ajkr Differential Revision: D51036774 Pulled By: jowlyzhang fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716	2023-11-11 08:11:11 -08:00
Yu Zhang	c6c683a0ca	Remove the default force behavior for `EnableFileDeletion` API (#12001 ) Summary: Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted. This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them. When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK. This PR also makes unit test that do not need force behavior to do a regular enable file deletion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001 Reviewed By: ajkr Differential Revision: D51214683 Pulled By: jowlyzhang fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4	2023-11-10 14:35:54 -08:00
Peter Dillinger	92dc5f3e67	AutoHCC: fix a bug with "blind" Insert (#12046 ) Summary: I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions. It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`). The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have * Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain. * Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty. * Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain. This is easily fixed by holding on to a reference until inserted onto the chain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046 Test Plan: As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed. Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000) Reviewed By: jowlyzhang Differential Revision: D51016979 Pulled By: pdillinger fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b	2023-11-06 16:06:01 -08:00

1 2 3 4 5 ...

1556 commits