rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-12-03 14:52:53 +00:00

Author	SHA1	Message	Date
Andrew Ryan Chang	af2a36d2c7	Record newest_key_time as a table property (#13083 ) Summary: This PR does two things: 1. Adds a new table property `newest_key_time` 2. Uses this property to improve TTL and temperature change compaction. ### Context The current `creation_time` table property should really be named `oldest_ancestor_time`. For flush output files, this is the oldest key time in the file. For compaction output files, this is the minimum among all oldest key times in the input files. The problem with using the oldest ancestor time for TTL compaction is that we may end up dropping files earlier than we should. What we really want is the newest (i.e. "youngest") key time. Right now we take a roundabout way to estimate this value -- we take the value of the _oldest_ key time for the _next_ (newer) SST file. This is also why the current code has checks for `index >= 1`. Our new property `newest_key_time` is set to the file creation time during flushes, and the max over all input files for compactions. There were some additional smaller changes that I had to make for testing purposes: - Refactoring the mock table reader to support specifying my own table properties - Refactoring out a test utility method `GetLevelFileMetadatas` that would otherwise be copy/pasted in 3 places Credit to cbi42 for the problem explanation and proposed solution ### Testing - Added a dedicated unit test to my `newest_key_time` logic in isolation (i.e. are we populating the property on flush and compaction) - Updated the existing unit tests (for TTL/temperate change compaction), which were comprehensive enough to break when I first made my code changes. I removed the test setup code which set the file metadata `oldest_ancestor_time`, so we know we are actually only using the new table property instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13083 Reviewed By: cbi42 Differential Revision: D65298604 Pulled By: archang19 fbshipit-source-id: 898ef91b692ab33f5129a2a16b64ecadd4c32432	2024-11-01 10:08:35 -07:00
Peter Dillinger	9ad772e652	Start version 9.9.0 (#13093 ) Summary: Pull in HISTORY for 9.8.0, update version.h for next version, update check_format_compatible.sh Pull Request resolved: https://github.com/facebook/rocksdb/pull/13093 Test Plan: CI Reviewed By: jowlyzhang Differential Revision: D64987257 Pulled By: pdillinger fbshipit-source-id: a7cec329e3d245e63767760aa0298c08c3281695	2024-10-25 13:47:29 -07:00
Yu Zhang	9c94559de7	Optimize compaction for standalone range deletion files (#13078 ) Summary: This PR adds some optimization for compacting standalone range deletion files. A standalone range deletion file is one with just a single range deletion. Currently, such a file is used in bulk loading to achieve something like atomically delete old version of all data with one big range deletion and adding new version of data. These are the changes included in the PR: 1) When a standalone range deletion file is ingested via bulk loading, it's marked for compaction. 2) When picking input files during compaction picking, we attempt to only pick a standalone range deletion file when oldest snapshot is at or above the file's seqno. To do this, `PickCompaction` API is updated to take existing snapshots as an input. This is only done for the universal compaction + UDT disabled combination, we save querying for existing snapshots and not pass it for all other cases. 3) At `Compaction` construction time, the input files will be filtered to examine if any of them can be skipped for compaction iterator. For example, if all the data of the file is deleted by a standalone range tombstone, and the oldest snapshot is at or above such range tombstone, this file will be filtered out. 4) Every time a snapshot is released, we examine if any column family has standalone range deletion files that becomes eligible to be scheduled for compaction. And schedule one for it. Potential future improvements: - Add some dedicated statistics for the filtered files. - Extend this input filtering to L0 files' compactions cases when a newer L0 file could shadow an older L0 file Pull Request resolved: https://github.com/facebook/rocksdb/pull/13078 Test Plan: Added unit tests and stress tested a few rounds Reviewed By: cbi42 Differential Revision: D64879415 Pulled By: jowlyzhang fbshipit-source-id: 02b8683fddbe11f093bcaa0a38406deb39f44d9e	2024-10-25 09:32:14 -07:00
Levi Tamasi	c0be6a4b90	Support allow_unprepared_value for multi-CF iterators (#13079 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13079 The patch adds support for the new read option `allow_unprepared_value` to the multi-column-family iterators `CoalescingIterator` and `AttributeGroupIterator`. When this option is set, these iterators populate their value (`value()` + `columns()` or `attribute_groups()`) in an on-demand fashion when `PrepareValue()` is called. Calling `PrepareValue()` on the child iterators is similarly deferred until `PrepareValue()` is called on the main iterator. Reviewed By: jowlyzhang Differential Revision: D64570587 fbshipit-source-id: 783c8d408ad10074417dabca7b82c5e1fe5cab36	2024-10-20 20:53:08 -07:00
Changyu Bi	787730c859	Add an ingestion option to not fill block cache (#13067 ) Summary: add `IngestExternalFileOptions::fill_cache` to allow users to ingest files without loading index/filter/data and other blocks into block cache during file ingestion. This can be useful when users are ingesting files into a CF that is not available to readers yet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13067 Test Plan: * unit test: `ExternalSSTFileTest.NoBlockCache` * ran one round of crash test with fill_cache disabled: `python3 ./tools/db_crashtest.py --simple blackbox --ops_per_thread=1000000 --interval=30 --ingest_external_file_one_in=200 --level0_stop_writes_trigger=200 --level0_slowdown_writes_trigger=100 --sync_fault_injection=0 --disable_wal=0 --manual_wal_flush_one_in=0` Reviewed By: jowlyzhang Differential Revision: D64356424 Pulled By: cbi42 fbshipit-source-id: b380c26f5987238e1ed7d42ceef0390cfaa0b8e2	2024-10-16 14:11:22 -07:00
Levi Tamasi	ecc084d301	Support the on-demand loading of blobs during iteration (#13069 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13069 Currently, when using range scans with BlobDB, the iterator logic eagerly loads values from blob files when landing on a new entry. This can be wasteful in use cases where the values associated with some keys in the range are not used by the application. The patch introduces a new read option `allow_unprepared_value`; when specified, this option results in the above eager loading getting bypassed. Values needed by the application can be then loaded on an on-demand basis by calling the new iterator API `PrepareValue`. Note that currently, only regular single-CF iterators are supported; multi-CF iterators and transactions will be extended in later PRs. Reviewed By: jowlyzhang Differential Revision: D64360723 fbshipit-source-id: ee55502fa15dcb307a984922b9afc9d9da15d6e1	2024-10-16 12:34:57 -07:00
Peter Dillinger	351d2fd2b6	Make simple BlockBasedTableOptions mutable (#10021 ) Summary: In theory, there should be no danger in mutability, as table builders and readers work from copies of BlockBasedTableOptions. However, there is currently an unresolved read-write race that affecting SetOptions on BBTO fields. This should be generally acceptable for non-pointer options of 64 bits or less, but a fix is needed to make it mutability general here. See https://github.com/facebook/rocksdb/issues/10079 This change systematically sets all of those "simple" options (and future such options) as mutable. (Resurrecting this PR perhaps preferable to proposed https://github.com/facebook/rocksdb/issues/13063) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10021 Test Plan: Some unit test updates. XXX comment added to stress test code Reviewed By: cbi42 Differential Revision: D64360967 Pulled By: pdillinger fbshipit-source-id: ff220fa778331852fe331b42b76ac4adfcd2d760	2024-10-14 17:49:26 -07:00
Levi Tamasi	fbbb08770f	Update HISTORY.md, version.h, and the format compatibility check script for the 9.7 release (#13027 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13027 Reviewed By: jowlyzhang Differential Revision: D63158344 Pulled By: ltamasi fbshipit-source-id: e650a0024155d52c7aa2afd0f242b8363071279a	2024-09-20 19:19:06 -07:00
Peter Dillinger	a1a102ffce	Steps toward deprecating implicit prefix seek, related fixes (#13026 ) Summary: With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly. Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`. Related fixes / changes: * Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation). * Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.) Suggested follow-up: * Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness. * Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek. * Add `prefix_same_as_start` testing to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026 Test Plan: tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while. Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all). Reviewed By: ltamasi Differential Revision: D63147378 Pulled By: pdillinger fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e	2024-09-20 15:54:19 -07:00
anand76	6549b11714	Make Cache a customizable class (#13024 ) Summary: This PR allows a Cache object to be created using the object registry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13024 Reviewed By: pdillinger Differential Revision: D63043233 Pulled By: anand1976 fbshipit-source-id: 5bc3f7c29b35ad62638ff8205451303e2cecea9d	2024-09-20 12:13:19 -07:00
Changyu Bi	c62de54c7c	Record largest seqno in table properties and verify in file ingestion (#12951 ) Summary: this helps to avoid scanning input files when ingesting db generated files: `ecb844babd/db/external_sst_file_ingestion_job.cc (L917-L935)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12951 Test Plan: * `IngestDBGeneratedFileTest.FailureCase` is updated to verify that this table property is verified during ingestion * existing unit tests for other ingestion use cases. Reviewed By: jowlyzhang Differential Revision: D61608285 Pulled By: cbi42 fbshipit-source-id: b5b7aae9741531349ab247be6ffaa3f3628b76ca	2024-08-21 16:24:18 -07:00
Jay Huh	5b8f5cbcf4	Update main branch for 9.6 release (#12945 ) Summary: Main branch cut at `defd97bc9`. Updated HISTORY.md, version and format compatibility test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12945 Test Plan: CI Reviewed By: jowlyzhang Differential Revision: D61482149 Pulled By: jaykorean fbshipit-source-id: 4edf7c0a8c6e4df8fcc938bc778dfd02981d0c55	2024-08-19 17:36:23 -07:00
Changyu Bi	defd97bc9d	Add an option to verify memtable key order during reads (#12889 ) Summary: add a new CF option `paranoid_memory_checks` that allows additional data integrity validations during read/scan. Currently, skiplist-based memtable will validate the order of keys visited. Further data validation can be added in different layers. The option will be opt-in due to performance overhead. The motivation for this feature is for services where data correctness is critical and want to detect in-memory corruption earlier. For a corrupted memtable key, this feature can help to detect it during during reads instead of during flush with existing protections (OutputValidator that verifies key order or per kv checksum). See internally linked task for more context. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12889 Test Plan: * new unit test added for paranoid_memory_checks=true. * existing unit test for paranoid_memory_checks=false. * enable in stress test. Performance Benchmark: we check for performance regression in read path where data is in memtable only. For each benchmark, the script was run at the same time for main and this PR: * Memtable-only randomread ops/sec: ``` (for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --num=250000 --reads=500000 --seed=1723056275 2>&1 \| grep "readrandom"; done;) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; Main: 608146 PR with paranoid_memory_checks=false: 607727 (- %0.07) PR with paranoid_memory_checks=true: 521889 (-%14.2) ``` * Memtable-only sequential scan ops/sec: ``` (for I in $(seq 1 50); do ./db_bench--benchmarks=fillseq,readseq[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; Main: 9180077 PR with paranoid_memory_checks=false: 9536241 (+%3.8) PR with paranoid_memory_checks=true: 7653934 (-%16.6) ``` * Memtable-only reverse scan ops/sec: ``` (for I in $(seq 1 20); do ./db_bench --benchmarks=fillseq,readreverse[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; Main: 1285719 PR with integrity_checks=false: 1431626 (+%11.3) PR with integrity_checks=true: 811031 (-%36.9) ``` The `readrandom` benchmark shows no regression. The scanning benchmarks show improvement that I can't explain. Reviewed By: pdillinger Differential Revision: D60414267 Pulled By: cbi42 fbshipit-source-id: a70b0cbeea131f1a249a5f78f9dc3a62dacfaa91	2024-08-19 13:53:25 -07:00
Yu Zhang	295326b6ee	Best efforts recovery recover seqno prefix (#12938 ) Summary: This PR make best efforts recovery more permissive by allowing it to recover incomplete Version that presents a valid point in time view from the user's perspective. Currently, a Version is only valid and saved if all files consisting that Version can be found. With this change, if only a suffix of L0 files (and their associated blob files) are missing, a valid Version is also available to be saved and recover to. Note that we don't do this if the column family was atomically flushed. Because atomic flush also need a consistent view across the column families, we cannot guarantee that if we are recovering to incomplete version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12938 Test Plan: Existing tests and added unit tests. Reviewed By: anand1976 Differential Revision: D61414381 Pulled By: jowlyzhang fbshipit-source-id: f9b73deb34d35ad696ab42315928b656d586262a	2024-08-16 17:18:54 -07:00
Peter Dillinger	4d3518951a	Option to decouple index and filter partitions (#12939 ) Summary: Partitioned metadata blocks were introduced back in 2017 to deal more gracefully with large DBs where RAM is relatively scarce and some data might be much colder than other data. The feature allows metadata blocks to compete for memory in the block cache against data blocks while alleviating tail latencies and thrash conditions that can arise with large metadata blocks (sometimes megabytes each) that can arise with large SST files. In general, the cost to partitioned metadata is more CPU in accesses (especially for filters where more binary search is needed before hashing can be used) and a bit more memory fragmentation and related overheads. However the feature has always had a subtle limitation with a subtle effect on performance: index partitions and filter partitions must be cut at the same time, regardless of which wins the space race (hahaha) to metadata_block_size. Commonly filters will be a few times larger than indexes, so index partitions will be under-sized compared to filter (and data) blocks. While this does affect fragmentation and related overheads a bit, I suspect the bigger impact on performance is in the block cache. The coupling of the partition cuts would be defensible if the binary search done to find the filter block was used (on filter hit) to short-circuit binary search to an index partition, but that optimization has not been developed. Consider two metadata blocks, an under-sized one and a normal-sized one, covering proportional sections of the key space with the same density of read queries. The under-sized one will be more prone to eviction from block cache because it is used less often. This is unfair because of its despite its proportionally smaller cost of keeping in block cache, and most of the cost of a miss to re-load it (random IO) is not proportional to the size (similar latency etc. up to ~32KB). ## This change Adds a new table option decouple_partitioned_filters allows filter blocks and index blocks to be cut independently. To make this work, the partitioned filter block builder needs to know about the previous key, to generate an appropriate separator for the partition index. In most cases, BlockBasedTableBuilder already has easy access to the previous key to provide to the filter block builder. This change includes refactoring to pass that previous key to the filter builder when available, with the filter building caching the previous key itself when unavailable, such as during compression dictionary training and some unit tests. Access to the previous key eliminates the need to track the previous prefix, which results in a small SST construction CPU win in prefix filtering cases, regardless of coupling, and possibly a small regression for some non-prefix cases, regardless of coupling, but still overall improvement especially with https://github.com/facebook/rocksdb/issues/12931. Suggested follow-up: * Update confusing use of "last key" to refer to "previous key" * Expand unit test coverage with parallel compression and dictionary training * Consider an option or enhancement to alleviate under-sized metadata blocks "at the end" of an SST file due to no coordination or awareness of when files are cut. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12939 Test Plan: unit tests updated. Also did some unit test runs with "hard wired" usage of parallel compression and dictionary training code paths to ensure they were working. Also ran blackbox_crash_test for a while with the new feature. ## SST write performance (CPU) Using the same testing setup as in https://github.com/facebook/rocksdb/issues/12931 but with -decouple_partitioned_filters=1 in the "after" configuration, which benchmarking shows makes almost no difference in terms of SST write CPU. "After" vs. "before" this PR ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 923691 vs. 924851 (-0.13%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 921398 vs. 922973 (-0.17%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 902259 vs. 908756 (-0.71%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 917932 vs. 916901 (+0.60%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 912755 vs. 907298 (+0.60%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 899754 vs. 892433 (+0.82%) ``` I think this is a pretty good trade, especially in attracting more movement toward partitioned configurations. ## Read performance Let's see how decoupling affects read performance across various degrees of memory constraint. To simplify LSM structure, we're using FIFO compaction. Since decoupling will overall increase metadata block size, we control for this somewhat with an extra "before" configuration with larger metadata block size setting (8k instead of 4k). Basic setup: ``` (for CS in 0300 1200; do TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillrandom,flush,readrandom,block_cache_entry_stats -num=5000000 -duration=30 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=1 -statistics=1 -cache_size=${CS}000000 -metadata_block_size=4096 -decouple_partitioned_filters=1 2>&1 \| tee results-$CS; done) ``` And read ops/s results: ```CSV Cache size MB,After/decoupled/4k,Before/4k,Before/8k 3,15593,15158,12826 6,16295,16693,14134 10,20427,20813,18459 20,27035,26836,27384 30,33250,31810,33846 60,35518,32585,35329 100,36612,31805,35292 300,35780,31492,35481 1000,34145,31551,35411 1100,35219,31380,34302 1200,35060,31037,34322 ``` If you graph this with log scale on the X axis (internal link: https://pxl.cl/5qKRc), you see that the decoupled/4k configuration is essentially the best of both the before/4k and before/8k configurations: handles really tight memory closer to the old 4k configuration and handles generous memory closer to the old 8k configuration. Reviewed By: jowlyzhang Differential Revision: D61376772 Pulled By: pdillinger fbshipit-source-id: fc2af2aee44290e2d9620f79651a30640799e01f	2024-08-16 15:34:31 -07:00
Changyu Bi	748f74aca3	Update main branch for 9.4 release (#12802 ) Summary: Main branch cut at `e90e9153d5`. Updated HISTORY.md, version and format compatibility test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12802 Reviewed By: ajkr Differential Revision: D58956464 Pulled By: cbi42 fbshipit-source-id: 50d786c145cebf93d1dd554b1b0e26baac3cc88c	2024-06-24 11:53:05 -07:00
Yu Zhang	c73cf7a878	Add CompactForTieringCollector to support automatically trigger compaction for tiering use case (#12760 ) Summary: This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier. A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following: 1) Setting the ratio to be equal to or smaller than 0 disables this collector 2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met. 3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction. For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760 Test Plan: Added unit tests Reviewed By: pdillinger Differential Revision: D58734976 Pulled By: jowlyzhang fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81	2024-06-18 10:51:29 -07:00
Andrew Kryczka	c3ae569792	Update the main branch for the 9.3 release (#12726 ) Summary: Cut the 9.3.fb branch as of 5/17 11:59pm. Also, cherry-picked all bug fixes that have happened since then. Removed their files from unreleased_history/ since those fixes will appear in 9.3.0, so there seems no use repeating them in any later release. Release branch: https://github.com/facebook/rocksdb/tree/9.3.fb Tests: https://github.com/facebook/rocksdb/actions/runs/9342097111 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12726 Reviewed By: ltamasi Differential Revision: D58069263 Pulled By: ajkr fbshipit-source-id: c4f557bc8dbc20ce53021ac7e97a24f930542bf9	2024-06-02 22:10:24 -07:00
anand76	9cc6168c98	Add LDB command and option for follower instances (#12682 ) Summary: Add the `--leader_path` option to specify the directory path of the leader for a follower RocksDB instance. This PR also adds a `count` command to the repl shell. While not specific to followers, it is useful for testing purposes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12682 Reviewed By: jowlyzhang Differential Revision: D57642296 Pulled By: anand1976 fbshipit-source-id: 53767d496ecadc363ff92cd958b8e15a7bf3b151	2024-05-28 23:21:32 -07:00
muthukrishnan24	259f21e695	Add WB, WBWI Create, UpdateTimestamp, Iterator::Refresh in C API (#10529 ) Summary: This PR adds UpdateTimestamp API of WriteBatch and WBWI, create WB, WBWI with all options and Iterator Refresh in C API Pull Request resolved: https://github.com/facebook/rocksdb/pull/10529 Reviewed By: cbi42 Differential Revision: D57826913 Pulled By: ajkr fbshipit-source-id: d2ec840129f61a1d3a5a12e859728be98ebbad2f	2024-05-28 15:36:09 -07:00
Levi Tamasi	c87f5cf91c	Add GetEntityForUpdate to optimistic and WriteCommitted pessimistic transactions (#12668 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12668 The patch adds a new `GetEntityForUpdate` API to optimistic and WriteCommitted pessimistic transactions, which provides transactional wide-column point lookup functionality with concurrency control. For WriteCommitted transactions, user-defined timestamps are also supported similarly to the `GetForUpdate` API. Reviewed By: jaykorean Differential Revision: D57458304 fbshipit-source-id: 7eadbac531ca5446353e494abbd0635d63f62d24	2024-05-20 10:43:05 -07:00
Andrew Kryczka	4eaf628120	Add `Iterator` property "rocksdb.iterator.is-value-pinned" (#12659 ) Summary: `ReadOptions::pin_data` already has the effect of pinning the `Slice` returned by `Iterator::value()` when the value is stored inline (e.g., `kTypeValue`). This PR adds a bit of visibility into that via a new `Iterator` property, "rocksdb.iterator.is-value-pinned", as well as some documentation and tests. See also: https://github.com/facebook/rocksdb/issues/12658 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12659 Reviewed By: cbi42 Differential Revision: D57391200 Pulled By: ajkr fbshipit-source-id: 0caa8db27ca1aba86ee2addc3dfd6f0e003d32e2	2024-05-15 19:11:52 -07:00
Levi Tamasi	b92d874c8b	Support MultiGetEntity in optimistic and WriteCommitted pessimistic transactions (#12634 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12634 The patch implements support for the `MultiGetEntity` API in optimistic transactions and pessimistic transactions with the WriteCommitted policy. Similarly to the other wide-column transaction APIs, the implementation leverages the `WriteBatchWithIndex` layer. Reviewed By: jaykorean Differential Revision: D57177638 fbshipit-source-id: 2d9f9f287fc97e7c126830b48d21457c7c35db3f	2024-05-09 16:49:38 -07:00
Levi Tamasi	83d051a8d9	Add release note for GetEntity transaction support (#12625 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12625 Reviewed By: jaykorean Differential Revision: D57059775 fbshipit-source-id: 80b3ddb51d538c6c21b69cd589f4ee8dd13596c9	2024-05-07 11:38:04 -07:00
Levi Tamasi	45c290660a	Add PutEntity support for optimistic and WritePrepared pessimistic transactions (#12606 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12606 The patch extends optimistic transactions and WriteCommitted pessimistic transactions with support for the `PutEntity` API. Similarly to the other APIs, `PutEntity` is available via both the `Transaction` and `TransactionDB` interfaces, where using the latter executes the write in a single-operation transaction as usual. Support for read APIs and other write policies (WritePrepared, WriteUnprepared) will be added in separate PRs. Reviewed By: jaykorean Differential Revision: D56911242 fbshipit-source-id: 57cf8bb6c6b1b40ba4a8a652831c13a617644289	2024-05-06 14:41:00 -07:00
anand76	6349da612b	Update HISTORY.md and version to 9.3.0 (#12601 ) Summary: Update HISTORY.md for 9.2 and version to 9.3. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12601 Reviewed By: jaykorean, jowlyzhang Differential Revision: D56845901 Pulled By: anand1976 fbshipit-source-id: 0d1137a6568e4712be2f8b705f4f7b438217dbed	2024-05-01 16:33:04 -07:00
Jay Huh	4f584652ab	Add an option to wait for purge in WaitForCompact (#12520 ) Summary: Adding an option to wait for purge to complete in `WaitForCompact` API. Internally, RocksDB has a way to wait for purge to complete (e.g. TEST_WaitForPurge() in db_impl_debug.cc), but there's no public API available for gracefully wait for purge to complete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12520 Test Plan: Unit Test Added - `WaitForCompactWithWaitForPurgeOptionTest` ``` ./deletefile_test -- --gtest_filter="WaitForCompactWithWaitForPurgeOptionTest" ``` Existing Tests ``` ./db_compaction_test -- --gtest_filter="WaitForCompactWithOption" ``` Reviewed By: ajkr Differential Revision: D55888283 Pulled By: jaykorean fbshipit-source-id: cfc6d6e8657deaefab8961890b36e390095c9f65	2024-04-17 17:33:27 -07:00
Levi Tamasi	c0aef2a28e	Add MultiGetEntityFromBatchAndDB to WriteBatchWithIndex (#12539 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12539 As a follow-up to https://github.com/facebook/rocksdb/pull/12533, this PR extends `WriteBatchWithIndex` with a `MultiGetEntityFromBatchAndDB` API that enables users to perform batched wide-column point lookups with read-your-own-writes consistency. This API transparently combines data from the indexed write batch and the underlying database as needed and presents the results in the form of a wide-column entity. Reviewed By: jaykorean Differential Revision: D56153145 fbshipit-source-id: 537967051b7521bb41b04070ac1a78a1d8873c08	2024-04-16 08:58:04 -07:00
Jay Huh	d34712e0ac	MultiCfIterator - AttributeGroupIter Impl & CoalescingIter Optimization (#12534 ) Summary: Continuing from the previous MultiCfIterator Implementations - (https://github.com/facebook/rocksdb/issues/12422, https://github.com/facebook/rocksdb/issues/12480 #12465), this PR completes the `AttributeGroupIterator` by implementing `AttributeGroupIteratorImpl::AddToAttributeGroups()`. While implementing the `AttributeGroupIterator`, we had to make some changes in `MultiCfIteratorImpl` and found an opportunity to improve `Coalesce()` in `CoalescingIterator`. Lifting `UNDER CONSTRUCTION - DO NOT USE` comment by replacing it with `EXPERIMENTAL` Here are some implementation details: - `IteratorAttributeGroups` is introduced to avoid having to copy all `WideColumn` objects during iteration. - `PopulateIterator()` no longer advances non-top iterators that have the same key as the top iterator in the heap. - `AdvanceIterator()` needs to advance the non-top iterators when they have the same key as the top iterator in the heap. - Instead of populating one by one, `PopulateIterator()` now collects all items with the same key and calls `populate_func(items)` at once. - This allowed optimization in `Coalesce()` such that we no longer do K-1 rounds of 2-way merge, but do one K-way merge instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12534 Test Plan: Uncommented the assertions in `verifyAttributeGroupIterator()` ``` ./multi_cf_iterator_test ``` Reviewed By: ltamasi Differential Revision: D56089019 Pulled By: jaykorean fbshipit-source-id: 6b0b4247e221f69b40b147d41492008cc9b15054	2024-04-16 08:45:38 -07:00
Levi Tamasi	491c4fb0ed	Add GetEntityFromBatchAndDB to WriteBatchWithIndex (#12533 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12533 The PR extends `WriteBatchWithIndex` with a new wide-column point lookup API `GetEntityFromBatchAndDB`. Similarly to `GetFromBatchAndDB`, the new API can transparently combine data from the write batch with data from the underlying database as needed. Like `DB::GetEntity`, it returns any result in the form of a wide-column entity (i.e. plain key-values are wrapped into an entity with a single anonymous column). Reviewed By: jaykorean Differential Revision: D56069132 fbshipit-source-id: 4f19cdeea4ce136497ce79fc9d28c925de59e220	2024-04-15 09:20:47 -07:00
liuhu	b7f1eeb0ca	Cache dumper exit early due to deadline or max_dumped_size (#12491 ) Summary: In production, we need to control the duration time or max size of cache dumper to get better performance. Fixes https://github.com/facebook/rocksdb/issues/12494 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12491 Reviewed By: hx235 Differential Revision: D55905826 Pulled By: ajkr fbshipit-source-id: 9196a5e852c344d6783f7a8234e997c87215bd19	2024-04-11 21:56:45 -07:00
Yu Zhang	74d419be4d	Add support in SstFileReader to get a raw table iterator (#12385 ) Summary: This PR adds support to programmatically iterate a raw table file with an iterator returned by `SstFileReader::NewTableIterator`. For third party tools to use to observe SST files created by RocksDB. The original feature request was from this merge request: https://github.com/facebook/rocksdb/pull/12370 Since keys returned by raw table iterators are internal keys, this PR also adds a struct `ParsedEntryInfo` and util method `ParseEntry` to support user to parse internal key. `GetInternalKeyForSeek`, and `GetInternalKeyForSeekForPrev` to support users to create internal keys for seek operations with this raw table iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12385 Test Plan: Added unit tests Reviewed By: cbi42 Differential Revision: D55662855 Pulled By: jowlyzhang fbshipit-source-id: 0716a173ee95924fbd4e1f9b6cccf06525c40049	2024-04-02 21:23:06 -07:00
akankshamahajan	1856734821	Branch cut 9.1.fb (#12476 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12476 Reviewed By: jowlyzhang Differential Revision: D55319508 Pulled By: akankshamahajan15 fbshipit-source-id: 2b6db671e027511282775c0fea155335d8e73cc2	2024-03-25 15:07:43 -07:00
Yu Zhang	13e1c32a18	Follow ups for TimedPut and write time property (#12455 ) Summary: This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including: 1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry. 2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries. 3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible. Test plans: Added unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455 Reviewed By: pdillinger Differential Revision: D55165422 Pulled By: jowlyzhang fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47	2024-03-21 10:00:15 -07:00
anand76	4868c10b44	Retry block reads on checksum mismatch (#12427 ) Summary: On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`. Tests: Add new unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427 Reviewed By: ajkr Differential Revision: D55025941 Pulled By: anand1976 fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e	2024-03-18 16:16:05 -07:00
Andrew Kryczka	3f5bd46a07	Add `ContinueCallback` to `GetMergeOperands()` (#12438 ) Summary: The use case is similar to `MergeOperator::ShouldMerge()` for `Get()`: preventing reads into LSM components for merge operands that are of no interest to the user. `MergeOperator::ShouldMerge()` cannot be reused here because: - Its name does not make sense in the context of `GetMergeOperands()` since `GetMergeOperands()` never invokes merge - The callback is part of the `MergeOperator`, but an option specific to the read operation makes more sense to me If there are any ideas for an API design that covers both `MergeOperator::ShouldMerge()`'s use cases and `GetMergeOperandsOptions::continue_cb`'s use cases, that would be ideal, but for now this is what I came up with. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12438 Reviewed By: hx235 Differential Revision: D54914669 Pulled By: ajkr fbshipit-source-id: 5f3ff78d3890adc0b1b74bedf3921221930ce63a	2024-03-15 12:25:49 -07:00
Levi Tamasi	7c290f72b8	Implement WriteBatchWithIndex::GetEntityFromBatch (#12424 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12424 The PR adds a wide-column point lookup API `GetEntityFromBatch` to `WriteBatchWithIndex`. Similarly to APIs like `DB::GetEntity`, this new API returns wide-column entities as-is, and wraps plain values in an entity with a single column (the anonymous default column). Also, similarly to `WriteBatchWithIndex::GetFromBatch`, it only reads data from the batch itself. Reviewed By: jaykorean Differential Revision: D54826535 fbshipit-source-id: 92604f3ebd90fe1afbd36f2d2194b7dee0011efa	2024-03-14 10:45:49 -07:00
Peter Dillinger	a53ed91691	Fix/improve temperature handling for file ingestion (#12402 ) Summary: Partly following up on leftovers from https://github.com/facebook/rocksdb/issues/12388 In terms of public API: * Make it clear that IngestExternalFileArg::file_temperature is just a hint for opening the existing file, though it was previously used for both copy-from temp hint and copy-to temp, which was bizarre. * Specify how IngestExternalFile assigns temperature to file ingested into DB. (See details in comments.) This approach is not perfect in terms of matching how the DB assigns temperatures, but was the simplest way to get close. The key complication for matching DB temperature assignments is that ingestion files are copied (to a destination temp) before their target level is determined (in general). * Add a temperature option to SstFileWriter::Open so that files intended for ingestion can be initially written to a chosen temperature. * Note that "fail_if_not_bottommost_level" is obsolete/confusing use of "bottommost" In terms of the implementation, there was a similar bit of oddness with the internal CopyFile API, which only took one temperature, ambiguously applicable to the source, destination, or both. This is also fixed. Eventual suggested follow-up: * Before copying files for ingestion, determine a tentative level assignment to use for destination temperature, and keep that even if final level assignment happens to be different at commit time (rare). * More temperature handling for CreateColumnFamilyWithImport and Checkpoints. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12402 Test Plan: Deeply revamped ExternalSSTFileBasicTest.IngestWithTemperature to test the new changes. Previously this test was insufficient because it was only looking at temperatures according to the DB manifest. Incorporating FileTemperatureTestFS allows us to also test the temperatures in the storage layer. Used macros instead of functions for better tracing to critical source location on test failures. Some enhancements to FileTemperatureTestFS in the process of developing the revamped test. Reviewed By: jowlyzhang Differential Revision: D54442794 Pulled By: pdillinger fbshipit-source-id: 41d9d0afdc073e6a983304c10bbc07c70cc7e995	2024-03-05 16:56:08 -08:00
Yu Zhang	f1ca47b904	Add support to bulk load external files for UDT in memtable only feature (#12356 ) Summary: This PR expands on the capabilities added in https://github.com/facebook/rocksdb/issues/12343. It adds sanity checks for external file's comparator name and user-defined timestamps related flag. With this, it now supports ingesting files to a column family that enables user-defined timestamps in Memtable only feature. Two fields in the table properties are used for aformentioned check: 1) the comparator name, it records what comparator is used to create this external sst file, 2) the flag `user_defined_timestamps_persisted`. We compare these two fields with the column family's settings. The details are in util function `ValidateUserDefinedTimestampsOptions`. To optimize for the majority of the cases where sanity check should pass and the table properties read should not affect how `TableReader` is constructed, instead of read the table properties block separately and use it for sanity check before creating a `TableReader`. We continue using the current flow to first create a `TableReader`, use it for reading table properties and do sanity checks, and reset the`TableReader` for the case where the column family enables UDTs in memtable only feature, and the external file does not contain user-defined timestamps. This PR also groups other table properties related sanity check in function `GetIngestedFileInfo` into the newly added `SanityCheckTableProperties` function. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12356 Test Plan: added unit test existing unit test Reviewed By: cbi42 Differential Revision: D54025116 Pulled By: jowlyzhang fbshipit-source-id: a918276c15f9908bd9df8513ce667638882e1554	2024-02-21 15:41:53 -08:00
Yu Zhang	31dfc81e18	Start 9.1.0 release (#12360 ) Summary: with release notes for 9.0.fb, format_compatible test update, and version.h update. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12360 Test Plan: CI Reviewed By: cbi42 Differential Revision: D53879416 Pulled By: jowlyzhang fbshipit-source-id: 29598893d9ce2d0bb181345ddb78f9b1529aee75	2024-02-16 18:26:48 -08:00
Yu Zhang	f405e55cfa	Add support in SstFileWriter to not persist user defined timestamps (#12348 ) Summary: This PR adds support in `SstFileWriter` to create SST files without persisting timestamps when the column family has enabled UDTs in Memtable only feature. The sst files created from flush and compaction do not contain timestamps, we want to make the sst files created by `SstFileWriter` to follow the same pattern and not persist timestamps. This is to prepare for ingesting external SST files for this type of column family. There are timestamp-aware APIs and non timestamp-aware APIs in `SstFileWriter`. The former are exclusively used for when the column family's comparator is timestamp-aware, a.k.a `Comparator::timestamp_size() > 0`, while the latter are exclusively used for the column family's comparator is non timestamp-aware, a.k.a `Comparator::timestamp_size() == 0`. There are sanity checks to make sure these APIs are correctly used. In this PR, the APIs usage continue with above enforcement, where even though timestamps are not eventually persisted, users are still asked to use only the timestamp-aware APIs. But because data points will logically all have minimum timestamps, we don't allow multiple versions of the same user key (without timestamp) to be added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12348 Test Plan: Added unit tests Manual inspection of generated sst files with `sst_dump` Reviewed By: ltamasi Differential Revision: D53732667 Pulled By: jowlyzhang fbshipit-source-id: e43beba0d3a1736b94ee5c617163a6280efd65b7	2024-02-13 20:30:07 -08:00
Peter Dillinger	58d55b7f4e	Mark wal_compression feature as production-ready (#12336 ) Summary: (as title) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12336 Test Plan: in use at Meta for a large service; in crash test Reviewed By: hx235 Differential Revision: D53537628 Pulled By: pdillinger fbshipit-source-id: 69e7ac9ab7b59b928d1144105667a7fde8a55a5a	2024-02-07 15:06:04 -08:00
Yu Zhang	e3e8fbb497	Add a separate range classes for internal usage (#12071 ) Summary: Introduce some different range classes `UserKeyRange` and `UserKeyRangePtr` to be used by internal implementation. The `Range` class is used in both public APIs like `DB::GetApproximateSizes`, `DB::GetApproximateMemTableStats`, `DB::GetPropertiesOfTablesInRange` etc and internal implementations like `ColumnFamilyData::RangesOverlapWithMemtables`, `VersionSet::GetPropertiesOfTablesInRange`. These APIs have different expectations of what keys this range class contain. Public API users are supposed to populate the range with the user keys without timestamp, in the same way that point lookup and range scan APIs' key input only expect the user key without timestamp. The internal APIs implementation expect a user key whose format is compatible with the user comparator, a.k.a a user key with the timestamp. This PR contains: 1) introducing counterpart range class `UserKeyRange` `UserKeyRangePtr` for internal implementation while leave the existing `Range` and `RangePtr` class only for public APIs. Internal implementations are updated to use this new class instead. 2) add user-defined timestamp support for `DB::GetPropertiesOfTablesInRange` API and `DeleteFilesInRanges` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12071 Test Plan: existing tests Added test for `DB::GetPropertiesOfTablesInRange` and `DeleteFilesInRanges` APIs for when user-defined timestamp is enabled. The change in external_file_ingestion_job doesn't have a user-defined timestamp enabled test case coverage, will add one in a follow up PR that adds file ingestion support for UDT. Reviewed By: ltamasi Differential Revision: D53292608 Pulled By: jowlyzhang fbshipit-source-id: 9a9279e23c640a6d8f8232636501a95aef7638b8	2024-02-06 18:35:36 -08:00
akankshamahajan	b9cb7b9644	Provide support for FSBuffer for point lookups (#12266 ) Summary: Provide support for FSBuffer for point lookups It also add support for compaction and scan reads that goes through BlockFetcher when readahead/prefetching is not enabled. Some of the compaction/Scan reads goes through FilePrefetchBuffer and some through BlockFetcher. This PR add support to use underlying file system scratch buffer for reads that go through BlockFetcher as for FilePrefetch reads, design is complicated to support this feature. Design - In order to use underlying FileSystem provided scratch for Reads, it uses MultiRead with 1 request instead of Read API which required API change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12266 Test Plan: Stress test using underlying file system scratch buffer internally. Reviewed By: anand1976 Differential Revision: D53019089 Pulled By: akankshamahajan15 fbshipit-source-id: 4fe3d090d77363320e4b67186fd4d51c005c0961	2024-01-29 15:08:20 -08:00
Peter Dillinger	800cfae987	Start 9.0.0 release (#12256 ) Summary: with release notes for 8.11.fb, format_compatible test update, and version.h update. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12256 Test Plan: CI Reviewed By: cbi42 Differential Revision: D52926051 Pulled By: pdillinger fbshipit-source-id: adcf7119b065758599e904c16cbdf1d28811e0b4	2024-01-20 08:38:20 -08:00
Hui Xiao	06e593376c	Group SST write in flush, compaction and db open with new stats (#11910 ) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre\|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff	2023-12-29 15:29:23 -08:00
Levi Tamasi	81765866c4	Update HISTORY/version/format compatibility script for the 8.10 release (#12154 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12154 Reviewed By: jaykorean, akankshamahajan15 Differential Revision: D52216271 Pulled By: ltamasi fbshipit-source-id: 13bab72802eeec8f6e3544be9ebcd7f725a64d2e	2023-12-15 14:44:23 -08:00
akankshamahajan	c77b50a4fd	Add AsyncIO support for tuning readahead_size by block cache lookup (#11936 ) Summary: Add support for tuning of readahead_size by block cache lookup for async_io. Design/ Implementation - BlockBasedTableIterator.cc - `BlockCacheLookupForReadAheadSize` callback API lookups in the block cache and tries to reduce the start and end offset passed. This function looks into the block cache for the blocks between `start_offset` and `end_offset` and add all the handles in the queue. It then iterates from the end in the handles to find first miss block and update the end offset to that block. It also iterates from the start and find first miss block and update the start offset to that block. ``` _read_curr_block_ argument : True if this call was due to miss in the cache and caller wants to read that block synchronously. False if current call is to prefetch additional data in extra buffers (due to ReadAsync call in FilePrefetchBuffer) ``` In case there is no data to be read in that callback (because of upper_bound or all blocks are in cache), it updates start and end offset to be equal and that `FilePrefetchBuffer` interprets that as 0 length to be read. FilePrefetchBuffer.cc - FilePrefetchBuffer calls the callback - `ReadAheadSizeTuning` and pass the start and end offset to that callback to get updated start and end offset to read based on cache hits/misses. 1. In case of Read calls (when offset passed to FilePrefetchBuffer is on cache miss and that data needs to be read), _read_curr_block_ is passed true. 2. In case of ReadAsync calls, when buffer is all consumed and can go for additional prefetching, the start offset passed is the initial end offset of prev buffer (without any updated offset based on cache hit/miss). Foreg. if following are the data blocks with cache hit/miss and start offset and Read API found miss on DB1 and based on readahead_size (50) it passes end offset to be 50. [DB1 - miss- 0 ] [DB2 - hit -10] [DB3 - miss -20] [DB4 - miss-30] [DB5 - hit-40] [DB6 - hit-50] [DB7 - miss-60] [DB8 - miss - 70] [DB9 - hit - 80] [DB6 - hit 90] - For Read call - updated start offset remains 0 but end offset updates to DB4, as DB5 is in cache. - Read calls saves initial end offset 50 as that was meant to be prefetched. - Now for next ReadAsync call - the start offset will be 50 (previous buffer initial end offset) and based on readahead_size, end offset will be 100 - On callback, because of cache hits - callback will update the start offset to 60 and end offset to 80 to read only 2 data blocks (DB7 and DB8). - And for that ReadAsync call - initial end offset will be set to 100 which will again used by next ReadAsync call as start offset. - `initial_end_offset_` in `BufferInfo` is used to save the initial end offset of that buffer. - If let's say DB5 and DB6 overlaps in 2 buffers (because of alignment), `prev_buf_end_offset` is passed to make sure already prefetched data is not prefetched again in second buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11936 Test Plan: - Ran crash_test several times. - New unit tests added. Reviewed By: anand1976 Differential Revision: D50906217 Pulled By: akankshamahajan15 fbshipit-source-id: 0d75d3c98274e98aa34901b201b8fb05232139cf	2023-12-06 13:48:15 -08:00
Levi Tamasi	2045fe4693	Mention PR 11892 in the changelog (#12118 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12118 Reviewed By: jaykorean Differential Revision: D51820703 fbshipit-source-id: d2a86a4781618747c6b7c71971862d510a25e103	2023-12-04 13:20:28 -08:00
Jay Huh	ddb7df10ef	Update HISTORY.md and version.h for 8.9.fb release (#12074 ) Summary: Creating cut for 8.9 release Pull Request resolved: https://github.com/facebook/rocksdb/pull/12074 Test Plan: CI Reviewed By: ajkr Differential Revision: D51435289 Pulled By: jaykorean fbshipit-source-id: 3918a8250032839e5b71f67f26c8ba01cbc17a41	2023-11-21 18:07:19 -08:00

1 2

86 commits