mirror of https://github.com/facebook/rocksdb.git
1596 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Peter Dillinger | 3c17930ede |
Change default block cache from 8MB to 32MB (#11350)
Summary:
... which increases default number of shards from 16 to 64. Although the default block cache size is only recommended for applications where RocksDB is not performance-critical, under stress conditions, block cache mutex contention could become a performance bottleneck. This change of default should alleviate that.
Note that reducing the size of cache shards (recommended minimum 512MB) could cause thrashing, e.g. on filter blocks, so capacity needs to increase to safely increase number of shards.
The 8MB default dates back to 2011 or earlier (
|
|
Andrew Kryczka | 9f8cdc8ad6 |
validate SstFileWriter range tombstones cover positive ranges (#11322)
Summary: As titled. This is the same as https://github.com/facebook/rocksdb/issues/6788 but for range tombstones written through `SstFileWriter` rather than through `DB`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11322 Reviewed By: cbi42 Differential Revision: D44317733 Pulled By: ajkr fbshipit-source-id: f6eb8791ae2c09c169b6bfe0d047449d924b377e |
|
Peter Dillinger | 204fcff751 |
HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5 |
|
Peter Dillinger | ccaa3225b0 |
Simplify tracking entries already in SecondaryCache (#11299)
Summary: In preparation for factoring secondary cache support out of individual Cache implementations, we can get rid of the "in secondary cache" flag on entries through a workable hack: when an entry is promoted from secondary, it is inserted in primary using a helper that lacks secondary cache support, thus preventing re-insertion into secondary cache through existing logic. This adds to the complexity of building CacheItemHelpers, because you always have to be able to get to an equivalent helper without secondary cache support, but that complexity is reasonably isolated within RocksDB typed_cache.h and test code. gcc-7 seems to have problems with constexpr constructor referencing `this` so removed constexpr support on CacheItemHelper. Also refactored some related test code to share common code / functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11299 Test Plan: existing tests Reviewed By: anand1976 Differential Revision: D44101453 Pulled By: pdillinger fbshipit-source-id: 7a59d0a3938ee40159c90c3e65d7004f6a272345 |
|
Hui Xiao | bab5f9a6f2 |
Add new stat rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit} (#11265)
Summary: **Context/Summary:** We are adding new stats to measure behavior of prefetched tail size and look up into this buffer The stat collection is done in FilePrefetchBuffer but only for prefetched tail buffer during table open for now using FilePrefetchBuffer enum. It's cleaner than the alternative of implementing in upper-level call places of FilePrefetchBuffer for table open. It also has the benefit of extensible to other types of FilePrefetchBuffer if needed. See db bench for perf regression concern. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11265 Test Plan: **- Piggyback on existing test** **- rocksdb.table.open.prefetch.tail.miss is harder to UT so I manually set prefetch tail read bytes to be small and run db bench.** ``` ./db_bench -db=/tmp/testdb -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 -use_direct_reads=true ``` ``` rocksdb.table.open.prefetch.tail.read.bytes P50 : 4096.000000 P95 : 4096.000000 P99 : 4096.000000 P100 : 4096.000000 COUNT : 225 SUM : 921600 rocksdb.table.open.prefetch.tail.miss COUNT : 91 rocksdb.table.open.prefetch.tail.hit COUNT : 1034 ``` **- No perf regression observed in db_bench** SETUP command: create same db with ~900 files for pre-change/post-change. ``` ./db_bench -db=/tmp/testdb -benchmarks="fillseq" -key_size=32 -value_size=512 -num=500000 -write_buffer_size=655360 -disable_auto_compactions=true -target_file_size_base=16777216 -compression_type=none ``` TEST command 60 runs or til convergence: as suggested by anand1976 and akankshamahajan15, vary `seek_nexts` and `async_io` in testing. ``` ./db_bench -use_existing_db=true -db=/tmp/testdb -statistics=false -cache_size=0 -cache_index_and_filter_blocks=false -benchmarks=seekrandom[-X60] -num=50000 -seek_nexts={10, 500, 1000} -async_io={0|1} -use_direct_reads=true ``` async io = 0, direct io read = true | seek_nexts = 10, 30 runs | seek_nexts = 500, 12 runs | seek_nexts = 1000, 6 runs -- | -- | -- | -- pre-post change | 4776 (± 28) ops/sec; 24.8 (± 0.1) MB/sec | 288 (± 1) ops/sec; 74.8 (± 0.4) MB/sec | 145 (± 4) ops/sec; 75.6 (± 2.2) MB/sec post-change | 4790 (± 32) ops/sec; 24.9 (± 0.2) MB/sec | 288 (± 3) ops/sec; 74.7 (± 0.8) MB/sec | 143 (± 3) ops/sec; 74.5 (± 1.6) MB/sec async io = 1, direct io read = true | seek_nexts = 10, 54 runs | seek_nexts = 500, 6 runs | seek_nexts = 1000, 4 runs -- | -- | -- | -- pre-post change | 3350 (± 36) ops/sec; 17.4 (± 0.2) MB/sec | 264 (± 0) ops/sec; 68.7 (± 0.2) MB/sec | 138 (± 1) ops/sec; 71.8 (± 1.0) MB/sec post-change | 3358 (± 27) ops/sec; 17.4 (± 0.1) MB/sec | 263 (± 2) ops/sec; 68.3 (± 0.8) MB/sec | 139 (± 1) ops/sec; 72.6 (± 0.6) MB/sec Reviewed By: ajkr Differential Revision: D43781467 Pulled By: hx235 fbshipit-source-id: a706a18472a8edb2b952bac3af40eec803537f2a |
|
Levi Tamasi | 49881921cd |
Rename a recently added PerfContext counter (#11294)
Summary: The patch renames the counter added in https://github.com/facebook/rocksdb/issues/11284 for better consistency with the existing naming scheme. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11294 Test Plan: `make check` Reviewed By: jowlyzhang Differential Revision: D44035964 Pulled By: ltamasi fbshipit-source-id: 8b1a2a03ee728148365367e0ecc1fcf462f62191 |
|
Levi Tamasi | 1d52438504 |
Add a PerfContext counter for merge operands applied in point lookups (#11284)
Summary: The existing PerfContext counter `internal_merge_count` only tracks the Merge operands applied during range scans. The patch adds a new counter called `internal_merge_count_point_lookups` to track the same metric for point lookups (`Get` / `MultiGet` / `GetEntity` / `MultiGetEntity`), and also fixes a couple of cases in the iterator where the existing counter wasn't updated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11284 Test Plan: `make check` Reviewed By: jowlyzhang Differential Revision: D43926082 Pulled By: ltamasi fbshipit-source-id: 321566d8b4cf0a3b6c9b73b7a5c984fb9bb492e9 |
|
Peter Dillinger | e01073252b |
Tests verifying non-zero checksums of zero bytes (#11260)
Summary: Adds unit tests verifying that a block payload and checksum of all zeros is not falsely considered valid data. The test exhaustively checks that for blocks up to some length (default 20K, more exhaustively 10M) of all zeros do not produce a block checksum of all zeros. Also small refactoring of an existing checksum test to use parameterized test. (Suggest hiding whitespace changes for review.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11260 Test Plan: this is the test, manual run with `ROCKSDB_THOROUGH_CHECKSUM_TEST=1` to verify up to 10M. Reviewed By: hx235 Differential Revision: D43706192 Pulled By: pdillinger fbshipit-source-id: 95e721c320ca928e7fa2400c2570fb359cc30b1f |
|
Changyu Bi | d053926fa2 |
Improve documentation for MergingIterator (#11161)
Summary: Add some comments to try to explain how/why MergingIterator works. Made some small refactoring, mostly in MergingIterator::SkipNextDeleted() and MergingIterator::SeekImpl(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/11161 Test Plan: crash test with small key range: ``` python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=6000 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10 ``` Reviewed By: ajkr Differential Revision: D42860994 Pulled By: cbi42 fbshipit-source-id: 3f0c1c9c6481a7f468bf79d823998907a8116e9e |
|
Levi Tamasi | 3c9eed688e |
Enable moving a string or PinnableSlice into PinnableWideColumns (#11248)
Summary: This makes it possible to eliminate some copies in `GetEntity` / `MultiGetEntity`, in particular when `Merge`s or blobs are involved. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11248 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D43544215 Pulled By: ltamasi fbshipit-source-id: bc4c8955a24bbd8bc4ab098e72133ead757f9707 |
|
Changyu Bi | 229297d1b8 |
Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113)
Summary: A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)). The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113 Test Plan: * added unit test in db_range_del_test * crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10` Reviewed By: ajkr Differential Revision: D42655709 Pulled By: cbi42 fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c |
|
mrambacher | b6640c3117 |
Remove FactoryFunc from LoadXXXObject (#11203)
Summary: The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available. With the removal of LITE mode, the function was no longer required. Note that the MergeOperator had some private classes defined in header files. To gain access to their constructors (and name methods), the class definitions were moved into header files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203 Reviewed By: cbi42 Differential Revision: D43160255 Pulled By: pdillinger fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c |
|
Andrew Kryczka | 25e1365227 |
Merge operator failed subcode (#11231)
Summary: From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions. This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231 Test Plan: updated unit test Reviewed By: akankshamahajan15 Differential Revision: D43396607 Pulled By: ajkr fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9 |
|
Andrew Kryczka | 6aef1a05d6 |
Use CacheDependencies() at start of ApproximateKeyAnchors() (#11230)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11230 Test Plan: - setup command: `$ ./db_bench -benchmarks=fillrandom,compact -compression_type=none -num=1000000 -write_buffer_size=4194304 -target_file_size_base=4194304 -use_direct_io_for_flush_and_compaction=true -partition_index_and_filters=true -bloom_bits=10 -metadata_block_size=1024` - measure small read count bucketed by size: `$ strace -fye pread64 ./db_bench.ctrl -use_existing_db=true -benchmarks=compact -compaction_readahead_size=4194304 -compression_type=none -num=1000000 -write_buffer_size=4194304 -target_file_size_base=4194304 -use_direct_io_for_flush_and_compaction=true -partition_index_and_filters=true -bloom_bits=10 -metadata_block_size=1024 -subcompactions=4 -cache_size=1048576000 2>&1 >/dev/null | awk '/= [0-9]+$/{print "[", int($NF / 1024), "KB,", int(1 + $NF / 1024), "KB)"}' | sort -n -k 2 | uniq -c | head -3` - before: ``` 1119 [ 0 KB, 1 KB) 1 [ 6 KB, 7 KB) 2 [ 7 KB, 8 KB) ``` - after: ``` 242 [ 0 KB, 1 KB) 1 [ 6 KB, 7 KB) 2 [ 7 KB, 8 KB) ``` Reviewed By: pdillinger Differential Revision: D43388507 Pulled By: ajkr fbshipit-source-id: a02413c9f615b00784700646825a9870ee10f3a7 |
|
Levi Tamasi | 9794acb597 |
Add a new MultiGetEntity API (#11222)
Summary: The new `MultiGetEntity` API can be used to get a consistent view of a batch of keys, with the results presented as wide-column entities. Similarly to `GetEntity` and the iterator's `columns` API, if the entry corresponding to the key is a wide-column entity to start with, it is returned as-is, and if it is a plain key-value, it is wrapped into an entity with a single default column. Implementation-wise, the new API shares the logic of the batched `MultiGet` API (via the `MultiGetCommon` methods). Both single-CF and multi-CF `MultiGetEntity` APIs are provided, and blobs are also supported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11222 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D43256950 Pulled By: ltamasi fbshipit-source-id: 47fb2cb7e2d0470e3580f43fdb2fe9e51f0e7005 |
|
Wentian Guo | 42d6652ba2 |
remove dependency on options.h for port_posix.h andport_win.h (#11214)
Summary: The files in `port/`, such as `port_posix.h`, are layering over the system libraries, so shouldn't include the DB-specific files like `options.h`. This PR remove this dependency. # How The reason that `port_posix.h` (or `port_win.h`) include `options.h` is to use `CpuPriority`, as there is a method `SetCpuPriority()` in `port_posix.h` that uses `CpuPriority.` - I think `SetCpuPriority()` make sense to exist in `port_posix.h` as it provides has platform-dependent implementation - `CpuPriority` enum is defined in `env.h`, but used in `rocksdb/include` and `port/`. Hence, let us define `CpuPriority` enum in a common file, say `port_defs.h`, such that both directories `rocksdb/include` and `port/` can include. When we remove this dependency, some other files have compile errors because they can't find definitions, so add header files to resolve # Test make all check -j Pull Request resolved: https://github.com/facebook/rocksdb/pull/11214 Reviewed By: pdillinger Differential Revision: D43196910 Pulled By: guowentian fbshipit-source-id: 70deccb72844cfb08fcc994f76c6ef6df5d55ab9 |
|
Peter Dillinger | 3cacd4b4ec |
Put Cache and CacheWrapper in new public header (#11192)
Summary: The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment. In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544 Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it. Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192 Test Plan: `make check` Reviewed By: anand1976 Differential Revision: D43055487 Pulled By: pdillinger fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110 |
|
Hui Xiao | 6650ca244e |
Remove a couple deprecated convenience.h APIs (#11120)
Summary: **Context/Summary:** As instructed by convenience.h comments, a few deprecated APIs are removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11120 Test Plan: - make check & CI - eyeball check on test semantics. Reviewed By: pdillinger Differential Revision: D42937507 Pulled By: hx235 fbshipit-source-id: a9e4709387da01b1d0e9148c2e210f02e9746ee1 |
|
Yu Zhang | 24ac53d81a |
Use user key on sst file for blob verification for Get and MultiGet (#11105)
Summary: Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller. Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105 Test Plan: make V=1 db_blob_basic_test ./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*" Reviewed By: ltamasi Differential Revision: D42716487 Pulled By: jowlyzhang fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a |
|
sdong | 4720ba4391 |
Remove RocksDB LITE (#11147)
Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2 |
|
Karim TAAM | a1e92bd956 |
use verify checksum option in block based table reader Open() (#11099)
Summary: ## Description In this issue https://github.com/facebook/rocksdb/issues/11002 we found that when we use rocksdb with the `verify checksum` read_option to false the verification is done anyway By analyzing the code along the stacktrace I saw that at the level of https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129f we are not keeping all the options and we forget the `verify_checksum` the comment in this class suggests that it should be managed https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fL581 <img width="1724" alt="204511641-86ab4b9b-45e5-4a2b-a13d-81fa26435d38" src="https://user-images.githubusercontent.com/26581503/213152802-c46bc1c7-a3a2-4a6f-9bb1-bf92ee93af7a.png"> this PR just adds the line to manage the `verify checksum` ## Tests - Running unit tests - Test without setting `verify checksum` and verifying that we are calling the checksum code - Test by setting `verify checksum` to true and verifying that we are calling the checksum code - Test by setting `verify checksum` to false and verifying that we are **not** calling the checksum code Pull Request resolved: https://github.com/facebook/rocksdb/pull/11099 Reviewed By: cbi42 Differential Revision: D42679881 Pulled By: ajkr fbshipit-source-id: c7dd10768282fd0699f7e1bf397ceb7adbea4ab6 |
|
Heiko Becker | 88edfbfb5e |
Fix build with gcc 13 by including <cstdint> (#11118)
Summary: Like other versions before, gcc 13 moved some includes around and as a result <cstdint> is no longer transitively included [1]. Explicitly include it for uint{32,64}_t. [1] https://gcc.gnu.org/gcc-13/porting_to.html#header-dep-changes Pull Request resolved: https://github.com/facebook/rocksdb/pull/11118 Reviewed By: cbi42 Differential Revision: D42711356 Pulled By: ajkr fbshipit-source-id: 5ea257b85b7017f40fd8fdbce965336da95c55b2 |
|
sdong | 2800aa069a |
Remove compressed block cache (#11117)
Summary: Compressed block cache is replaced by compressed secondary cache. Remove the feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117 Test Plan: See CI passes Reviewed By: pdillinger Differential Revision: D42700164 fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d |
|
Andrew Kryczka | b7fbcefda8 |
Add API to limit blast radius of merge operator failure (#11092)
Summary: Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage. To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092 Reviewed By: siying Differential Revision: D42525673 Pulled By: ajkr fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee |
|
leipeng | a5bcbcd8be |
remove unused InternalIteratorBase::is_mutable_ (#11104)
Summary: `InternalIteratorBase::is_mutable_` is not used any more, remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11104 Reviewed By: ajkr Differential Revision: D42582747 Pulled By: cbi42 fbshipit-source-id: d30bf75151fc8414df0ae112a6ec4943b5b7330b |
|
Changyu Bi | f515d9d203 |
Revert #10802 Consider range tombstone in compaction output file cutting (#11089)
Summary:
This reverts commit
|
|
Peter Dillinger | 9f7801c5f1 |
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432 |
|
anand76 | bec4264813 |
Avoid mixing sync and async prefetch (#11050)
Summary: Reading uncompression dict block always uses sync reads, while data blocks may use async reads and prefetching. This causes problems in FilePrefetchBuffer. So avoid mixing the two by reading the uncompression dict straight from the file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11050 Test Plan: Crash test Reviewed By: akankshamahajan15 Differential Revision: D42194682 Pulled By: anand1976 fbshipit-source-id: aaa8b396fdfe966b157e210f5ef8501c45b7b69e |
|
Changyu Bi | f02c708aa3 |
Consider range tombstone in compaction output file cutting (#10802)
Summary: This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802 Test Plan: - added unit tests in db_range_del_test. - stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000` Reviewed By: ajkr, jay-zhuang Differential Revision: D40308827 Pulled By: cbi42 fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df |
|
Arvid Lunnemark | 00238a386b |
replace sprintf with its safe version snprintf (v2) (#11011)
Summary: same motivations as https://github.com/facebook/rocksdb/pull/5475, applied to the last remaining `sprintf`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11011 Reviewed By: pdillinger Differential Revision: D41673500 Pulled By: ajkr fbshipit-source-id: 88618ea791cafad86a9a491799c45979d46e3544 |
|
Peter Dillinger | 433d7e4594 |
Improve error messages for SST footer and size errors (#11009)
Summary: Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure: * Magic number error is reported first if we attempt to open an SST file and the footer is completely bad. * Footer errors are reported with affected file. * If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009 Test Plan: unit tests added, some manual Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed). Reviewed By: siying Differential Revision: D41656272 Pulled By: pdillinger fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df |
|
Changyu Bi | 6cdb7af9f8 |
Remove copying of range tombstones keys in iterator (#10878)
Summary: In MergingIterator, if a range tombstone's start or end key is added to minHeap/maxHeap, the key is copied. This PR removes the copying of range tombstone keys by adding InternalKey comparator that compares `Slice` for internal key and `ParsedInternalKey` directly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10878 Test Plan: - existing UT - ran all flavors of stress test through sandcastle - benchmarks: I did not get improvement when compiling with DEBUG_LEVEL=0, and saw many noise. With `OPTIMIZE_LEVEL="-O3" USE_LTO=1` I do see improvement. ``` # Favorable set up: half of the writes are DeleteRange. TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=1000000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=50 # benchmark command TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --disable_auto_compactions=true --avoid_flush_during_recovery=true --seek_nexts=100 --reads=1000000 --num=1000000 --threads=25 # main readseq [AVG 5 runs] : 26017977 (± 371077) ops/sec; 3721.9 (± 53.1) MB/sec readseq [MEDIAN 5 runs] : 26096905 ops/sec; 3733.2 MB/sec # this PR readseq [AVG 5 runs] : 27481724 (± 568758) ops/sec; 3931.3 (± 81.4) MB/sec readseq [MEDIAN 5 runs] : 27323957 ops/sec; 3908.7 MB/sec ``` Reviewed By: ajkr Differential Revision: D40711170 Pulled By: cbi42 fbshipit-source-id: 708cb584e2bd085a9ce0d2ef6a420489f721717f |
|
Changyu Bi | 534fb06dd3 |
Prevent iterating over range tombstones beyond `iterate_upper_bound` (#10966)
Summary: Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966 Test Plan: - added unit test - stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100` - ran different stress tests over sandcastle - Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones. Reviewed By: ajkr Differential Revision: D41414172 Pulled By: cbi42 fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981 |
|
Peter Dillinger | f321e8fc98 |
Don't attempt to use SecondaryCache on block_cache_compressed (#10944)
Summary: Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom! This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed. Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache. This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944 Test Plan: test added that fails on base revision (at least with ASAN) Reviewed By: akankshamahajan15 Differential Revision: D41205578 Pulled By: pdillinger fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f |
|
Levi Tamasi | 2ea109521f |
Revisit the interface of MergeHelper::TimedFullMerge(WithEntity) (#10932)
Summary: The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)` a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge` overload, which makes the responsibilities clearer by making sure the query result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in `SaveValue` and `GetContext`. Second, it changes the interface of `TimedFullMergeWithEntity` so it exposes its result in a serialized form; this is a more decoupled design which will come in handy when adding support for `Merge` with wide-column entities to `DBIter`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D41129399 Pulled By: ltamasi fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677 |
|
Levi Tamasi | fbd9077d66 |
Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925)
Summary: The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D41096453 Pulled By: ltamasi fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8 |
|
Levi Tamasi | 941d834739 |
Support Merge for wide-column entities during point lookups (#10916)
Summary: The patch adds `Merge` support for wide-column entities to the point lookup APIs, i.e. `Get`, `MultiGet`, `GetEntity`, and `GetMergeOperands`. (I plan to update the iterator and compaction logic in separate PRs.) In terms of semantics, the `Merge` operation is applied to the default (anonymous) column; any other columns in the entity are unaffected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10916 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D40962311 Pulled By: ltamasi fbshipit-source-id: 244bc9d172be1af2f204796b2f89104e4d2fa373 |
|
Yanqin Jin | 7d26e4c5a3 |
Basic Support for Merge with user-defined timestamp (#10819)
Summary: This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled. Simplest usage: ```cpp // assume string append merge op is used with '.' as delimiter. // ts1 < ts2 db->Put(WriteOptions(), "key", ts1, "v0"); db->Merge(WriteOptions(), "key", ts2, "1"); ReadOptions ro; ro.timestamp = &ts2; db->Get(ro, "key", &value); ASSERT_EQ("v0.1", value); ``` Some code comments are added for clarity. Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819 Test Plan: make check Reviewed By: ltamasi Differential Revision: D40603195 Pulled By: riversand963 fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013 |
|
Yanqin Jin | 9079895aae |
Fix deletion counting in memtable stats (#10886)
Summary: Currently, a memtable's stats `num_deletes_` is incremented only if the entry is a regular delete (kTypeDeletion). We need to fix it by accounting for kTypeSingleDeletion and kTypeDeletionWithTimestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10886 Test Plan: make check Reviewed By: ltamasi Differential Revision: D40740754 Pulled By: riversand963 fbshipit-source-id: 7bde62cd6df136585bc5bfb1c426c7a8276c08e1 |
|
Jay Zhuang | 36f5e19e33 |
Fix a Windows build error (#10897)
Summary: The for loop is marked as unreachable code because it will never call the increment. Switch it to `if`. ``` \table\merging_iterator.cc(823): error C2220: the following warning is treated as an error \table\merging_iterator.cc(823): warning C4702: unreachable code \table\merging_iterator.cc(1030): error C2220: the following warning is treated as an error \table\merging_iterator.cc(1030): warning C4702: unreachable code ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10897 Reviewed By: cbi42 Differential Revision: D40811790 Pulled By: jay-zhuang fbshipit-source-id: fe8fd3e7cf3d6f710360c402b79763854d5120df |
|
Levi Tamasi | 7867a1112b |
Handle Merges correctly in GetEntity (#10894)
Summary: The PR fixes the handling of `Merge`s in `GetEntity`. Note that `Merge` is not yet supported for wide-column entities written using `PutEntity`; this change is about returning correct (i.e. consistent with `Get`) results in cases like when the base value is a plain old key-value written using `Put` or when there is no real base value because we hit either a tombstone or the beginning of history. Implementation-wise, the patch introduces a new wrapper around the existing `MergeHelper::TimedFullMerge` that can store the merge result in either a string (for the purposes of `Get`) or a `PinnableWideColumns` instance (for `GetEntity`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10894 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D40782708 Pulled By: ltamasi fbshipit-source-id: 3d700d56b2ef81f02ba1e2d93f6481bf13abcc90 |
|
Changyu Bi | 56715350d9 |
Reduce heap operations for range tombstone keys in iterator (#10877)
Summary: Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877 Test Plan: - existing UT - ran all flavors of stress test through sandcastle - benchmark: ``` # Set up: --writes_per_range_tombstone=1 means one point write and one delete range TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64 Level Files Size(MB) -------------------- 0 8 152 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 # Benchmark TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true # Pre PR readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec # Post PR readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec ``` Reviewed By: ajkr Differential Revision: D40710936 Pulled By: cbi42 fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022 |
|
sdong | 3e686c7cbe |
sst_dump --command=raw to add index offset information (#10873)
Summary: Add some extra information in outputs of "sst_dump --command=raw" to help debug some issues. Right now, encoded block handle is printed out. It is more useful to directly print out offset and size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10873 Test Plan: Manually run it against a file and check the output. Reviewed By: anand1976 Differential Revision: D40742289 fbshipit-source-id: 04d7de26e7f27e1595a7cc3ac1c1082e4e835b93 |
|
anand76 | 727bad78b8 |
Format files under table/ by clang-format (#10852)
Summary: Run clang-format on files under the `table` directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10852 Reviewed By: ajkr Differential Revision: D40650732 Pulled By: anand1976 fbshipit-source-id: 2023a958e37fd6274040c5181130284600c9e0ef |
|
Changyu Bi | 7a95938899 |
Improve FragmentTombstones() speed by lazily initializing `seq_set_` (#10848)
Summary: FragmentedRangeTombstoneList has a member variable `seq_set_` that contains the sequence numbers of all range tombstones in a set. The set is constructed in `FragmentTombstones()` and is used only in `FragmentedRangeTombstoneList::ContainsRange()` which only happens during compaction. This PR moves the initialization of `seq_set_` to `FragmentedRangeTombstoneList::ContainsRange()`. This should speed up `FragmentTombstones()` when the range tombstone list is used for read/scan requests. Microbench shows the speed improvement to be ~45%. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10848 Test Plan: - Existing tests and stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5`. - Microbench: update `range_del_aggregator_bench` to benchmark speed of `FragmentTombstones()`: ``` ./range_del_aggregator_bench --num_range_tombstones=1000 --tombstone_start_upper_bound=50000000 --num_runs=10000 --tombstone_width_mean=200 --should_deletes_per_run=100 --use_compaction_range_del_aggregator=true Before this PR: ========================= Fragment Tombstones: 270.286 us AddTombstones: 1.28933 us ShouldDelete (first): 0.525528 us ShouldDelete (rest): 0.0797519 us After this PR: time to fragment tombstones is pushed to AddTombstones() which only happen during compaction. ========================= Fragment Tombstones: 149.879 us AddTombstones: 102.131 us ShouldDelete (first): 0.565871 us ShouldDelete (rest): 0.0729444 us ``` - db_bench: this should improve speed for fragmenting range tombstones for mutable memtable: ``` ./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=100 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=250000 --disable_auto_compactions --max_num_range_tombstones=100000 --finish_after_writes --write_buffer_size=1073741824 --threads=25 Before this PR: readwhilewriting : 18.301 micros/op 1310445 ops/sec 4.769 seconds 6250000 operations; 28.1 MB/s (41001 of 250000 found) After this PR: readwhilewriting : 16.943 micros/op 1439376 ops/sec 4.342 seconds 6250000 operations; 23.8 MB/s (28977 of 250000 found) ``` Reviewed By: ajkr Differential Revision: D40646227 Pulled By: cbi42 fbshipit-source-id: ea471667edb258f67d01cfd828588e80a89e4083 |
|
Changyu Bi | deb6a24be2 |
Remove range tombstone test code from sst_file_reader (#10847)
Summary: `#include "db/range_tombstone_fragmenter.h"` seems to break some internal test for 7.8 release. I'm removing it from sst_file_reader.h for now to unblock release. This should be fine as it is only used in a unit test for DeleteRange with timestamp. In addition, it does not seem to be useful to support delete range for sst file writer, since the range tombstone won't cover any key (its sequence number is 0). So maybe we can remove it in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10847 Test Plan: CI. Reviewed By: akankshamahajan15 Differential Revision: D40620865 Pulled By: cbi42 fbshipit-source-id: be44b2f31e062bff87ed1b8d94482c3f7eaa370c |
|
Peter Dillinger | 27c9705ac4 |
Use kXXH3 as default checksum (CPU efficiency) (#10778)
Summary: Since this has been supported for about a year, I think it's time to make it the default. This should improve CPU efficiency slightly on most hardware. A current DB performance comparison using buck+clang build: ``` TEST_TMPDIR=/dev/shm ./db_bench -checksum_type={1,4} -benchmarks=fillseq[-X1000] -num=3000000 -disable_wal ``` kXXH3 (+0.2% DB write throughput): `fillseq [AVG 1000 runs] : 822149 (± 1004) ops/sec; 91.0 (± 0.1) MB/sec` kCRC32c: `fillseq [AVG 1000 runs] : 820484 (± 1203) ops/sec; 90.8 (± 0.1) MB/sec` Micro benchmark comparison: ``` ./db_bench --benchmarks=xxh3[-X20],crc32c[-X20] ``` Machine 1, buck+clang build: `xxh3 [AVG 20 runs] : 3358616 (± 19091) ops/sec; 13119.6 (± 74.6) MB/sec` `crc32c [AVG 20 runs] : 2578725 (± 7742) ops/sec; 10073.1 (± 30.2) MB/sec` Machine 2, make+gcc build, DEBUG_LEVEL=0 PORTABLE=0: `xxh3 [AVG 20 runs] : 6182084 (± 137223) ops/sec; 24148.8 (± 536.0) MB/sec` `crc32c [AVG 20 runs] : 5032465 (± 42454) ops/sec; 19658.1 (± 165.8) MB/sec` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10778 Test Plan: make check, unit tests updated Reviewed By: ajkr Differential Revision: D40112510 Pulled By: pdillinger fbshipit-source-id: e59a8d50a60346137732f8668ba7cfac93be2b37 |
|
akankshamahajan | 0e7b27bfcf |
Refactor block cache tracing APIs (#10811)
Summary: Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. `DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`. New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer. This same philosophy can be applied to KV and IO tracing as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811 Test Plan: existing unit tests Old API DB::StartBlockTrace checked with db_bench tool create database ``` ./db_bench --benchmarks="fillseq" \ --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \ --cache_index_and_filter_blocks --cache_size=1048576 \ --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \ --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 ``` To trace block cache accesses when running readrandom benchmark: ``` ./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \ --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \ --cache_index_and_filter_blocks --cache_size=1048576 \ --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \ --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \ --threads=16 \ -block_cache_trace_file="/tmp/binary_trace_test_example" \ -block_cache_trace_max_trace_file_size_in_bytes=1073741824 \ -block_cache_trace_sampling_frequency=1 ``` Reviewed By: anand1976 Differential Revision: D40435289 Pulled By: akankshamahajan15 fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282 |
|
Peter Dillinger | 7555243bcf |
Refactor ShardedCache for more sharing, static polymorphism (#10801)
Summary: The motivations for this change include * Free up space in ClockHandle so that we can add data for secondary cache handling while still keeping within single cache line (64 byte) size. * This change frees up space by eliminating the need for the `hash` field by making the fixed-size key itself a hash, using a 128-bit bijective (lossless) hash. * Generally more customizability of ShardedCache (such as hashing) without worrying about virtual call overheads * ShardedCache now uses static polymorphism (template) instead of dynamic polymorphism (virtual overrides) for the CacheShard. No obvious performance benefit is seen from the change (as mostly expected; most calls to virtual functions in CacheShard could already be optimized to static calls), but offers more flexibility without incurring the runtime cost of adhering to a common interface (without type parameters or static callbacks). * You'll also notice less `reinterpret_cast`ing and other boilerplate in the Cache implementations, as this can go in ShardedCache. More detail: * Don't have LRUCacheShard maintain `std::shared_ptr<SecondaryCache>` copies (extra refcount) when LRUCache can be in charge of keeping a `shared_ptr`. * Renamed `capacity_mutex_` to `config_mutex_` to better represent the scope of what it guards. * Some preparation for 64-bit hash and indexing in LRUCache, but didn't include the full change because of slight performance regression. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10801 Test Plan: Unit test updates were non-trivial because of major changes to the ClockCacheShard interface in handling of key vs. hash. Performance: Create with `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16` Test with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X1000] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=16 ``` Before: `readrandom [AVG 150 runs] : 321147 (± 253) ops/sec` After: `readrandom [AVG 150 runs] : 321530 (± 326) ops/sec` So possibly ~0.1% improvement. And with `-cache_type=hyper_clock_cache`: Before: `readrandom [AVG 30 runs] : 614126 (± 7978) ops/sec` After: `readrandom [AVG 30 runs] : 645349 (± 8087) ops/sec` So roughly 5% improvement! Reviewed By: anand1976 Differential Revision: D40252236 Pulled By: pdillinger fbshipit-source-id: ff8fc70ef569585edc95bcbaaa0386f61355ae5b |
|
Peter Dillinger | e466173d5c |
Print stack traces on frozen tests in CI (#10828)
Summary: Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off. For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828 Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI Reviewed By: hx235 Differential Revision: D40447634 Pulled By: pdillinger fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1 |
|
Changyu Bi | 9f2363f4c4 |
User-defined timestamp support for `DeleteRange()` (#10661)
Summary: Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are - internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps. - Garbage collection of range tombstones and range tombstone covered keys during compaction. - Get()/MultiGet() to return the timestamp of a range tombstone when needed. - Get/Iterator with range tombstones bounded by readoptions.timestamp. - timestamp crash test now issues DeleteRange by default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661 Test Plan: - Added unit test: `make check` - Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case. | micros/op | fillrandom | seekrandom | | --- | --- | --- | |main| 2.58 |10.96| |PR 10661| 2.68 |10.63| Reviewed By: riversand963 Differential Revision: D39441192 Pulled By: cbi42 fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2 |
|
Jay Zhuang | f3cc66632b |
Align compaction output file boundaries to the next level ones (#10655)
Summary: Try to align the compaction output file boundaries to the next level ones (grandparent level), to reduce the level compaction write-amplification. In level compaction, there are "wasted" data at the beginning and end of the output level files. Align the file boundary can avoid such "wasted" compaction. With this PR, it tries to align the non-bottommost level file boundaries to its next level ones. It may cut file when the file size is large enough (at least 50% of target_file_size) and not too large (2x target_file_size). db_bench shows about 12.56% compaction reduction: ``` TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432 # baseline: Flush(GB): cumulative 25.882, interval 7.216 Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds # with this change: Flush(GB): cumulative 25.882, interval 7.753 Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds ``` The compaction simulator shows a similar result (14% with 100G random data). As a side effect, with this PR, the SST file size can exceed the target_file_size, but is capped at 2x target_file_size. And there will be smaller files. Here are file size statistics when loading 100GB with the target file size 32MB: ``` baseline this_PR count 1.656000e+03 1.705000e+03 mean 3.116062e+07 3.028076e+07 std 7.145242e+06 8.046139e+06 ``` The feature is enabled by default, to revert to the old behavior disable it with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false` Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for use case like user adding 2 or more non-overlapping data range at the same time, it can reduce the overlapping of 2 datasets in the lower levels. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655 Reviewed By: cbi42 Differential Revision: D39552321 Pulled By: jay-zhuang fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2 |
|
Yanqin Jin | 07249fea8f |
Fix DBImpl::GetLatestSequenceForKey() for Merge (#10724)
Summary: Currently, without this fix, DBImpl::GetLatestSequenceForKey() may not return the latest sequence number for merge operands of the key. This can cause conflict checking during optimistic transaction commit phase to fail. Fix it by always returning the latest sequence number of the key, also considering range tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10724 Test Plan: make check Reviewed By: cbi42 Differential Revision: D39756847 Pulled By: riversand963 fbshipit-source-id: 0764c3dd4cb24960b37e18adccc6e7feed0e6876 |
|
Peter Dillinger | ef443cead4 |
Refactor to avoid confusing "raw block" (#10408)
Summary: We have a lot of confusing code because of mixed, sometimes completely opposite uses of of the term "raw block" or "raw contents", sometimes within the same source file. For example, in `BlockBasedTableBuilder`, `raw_block_contents` and `raw_size` generally referred to uncompressed block contents and size, while `WriteRawBlock` referred to writing a block that is already compressed if it is going to be. Meanwhile, in `BlockBasedTable`, `raw_block_contents` either referred to a (maybe compressed) block with trailer, or a maybe compressed block maybe without trailer. (Note: left as follow-up work to use C++ typing to better sort out the various kinds of BlockContents.) This change primarily tries to apply some consistent terminology around the kinds of block representations, avoiding the unclear "raw". (Any meaning of "raw" assumes some bias toward the storage layer or toward the logical data layer.) Preferred terminology: * **Serialized block** - bytes that go into storage. For block-based table (usually the case) this includes the block trailer. WART: block `size` may or may not include the trailer; need to be clear about whether it does or not. * **Maybe compressed block** - like a serialized block, but without the trailer (or no promise of including a trailer). Must be accompanied by a CompressionType. * **Uncompressed block** - "payload" bytes that are either stored with no compression, used as input to compression function, or result of decompression function. * **Parsed block** - an in-memory form of a block in block cache, as it is used by the table reader. Different C++ types are used depending on the block type (see block_like_traits.h). Other refactorings: * Misc corrections/improvements of internal API comments * Remove a few misleading / unhelpful / redundant comments. * Use move semantics in some places to simplify contracts * Use better parameter names to indicate which parameters are used for outputs * Remove some extraneous `extern` * Various clean-ups to `CacheDumperImpl` (mostly unnecessary code) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408 Test Plan: existing tests Reviewed By: akankshamahajan15 Differential Revision: D38172617 Pulled By: pdillinger fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c |
|
anand76 | fb9a025892 |
Fix platform 10 build with folly (#10708)
Summary: Change the library order in PLATFORM_LDFLAGS to enable fbcode platform 10 build with folly. This PR also has a few fixes for platform 10 compiler errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10708 Test Plan: ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_COROUTINES=1 make -j64 check ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_FOLLY=1 make -j64 check Reviewed By: ajkr Differential Revision: D39666590 Pulled By: anand1976 fbshipit-source-id: 256a1127ef561399cd6299a6a392ca29bd68ca44 |
|
Changyu Bi | 749b849a34 |
Fix memtable-only iterator regression (#10705)
Summary: when there is a single memtable without range tombstones and no SST files in the database, DBIter should wrap memtable iterator directly. Currently we create a merging iterator on top of the memtable iterator, and have DBIter wrap around it. This causes iterator regression and this PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10705 Test Plan: - `make check` - Performance: - Set up: `./db_bench -benchmarks=filluniquerandom -write_buffer_size=$((1 << 30)) -num=10000` - Benchmark: `./db_bench -benchmarks=seekrandom -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=$((1 << 30)) -num=10000 -threads=16 -duration=60 -seek_nexts=$seek_nexts` ``` seek_nexts main op/sec https://github.com/facebook/rocksdb/issues/10705 RocksDB v7.6 0 5746568 5749033 5786180 30 2411690 3006466 2837699 1000 102556 128902 124667 ``` Reviewed By: ajkr Differential Revision: D39644221 Pulled By: cbi42 fbshipit-source-id: 8063ff611ba31b0e5670041da3927c8c54b2097d |
|
Akanksha Mahajan | bd2ad2f9a0 |
Fix stress test failure for async_io (#10660)
Summary: Sanitize initial_auto_readahead_size if its greater than max_auto_readahead_size in case of async_io Pull Request resolved: https://github.com/facebook/rocksdb/pull/10660 Test Plan: Ran db_stress with intitial_auto_readahead_size greater than max_auto_readahead_size. Reviewed By: anand1976 Differential Revision: D39408095 Pulled By: akankshamahajan15 fbshipit-source-id: 07f933242f636cfbc7ccf042e0c8b959a8ec5f3a |
|
Peter Dillinger | 6de7081cf3 |
Always verify SST unique IDs on SST file open (#10532)
Summary: Although we've been tracking SST unique IDs in the DB manifest unconditionally, checking has been opt-in and with an extra pass at DB::Open time. This changes the behavior of `verify_sst_unique_id_in_manifest` to check unique ID against manifest every time an SST file is opened through table cache (normal DB operations), replacing the explicit pass over files at DB::Open time. This change also enables the option by default and removes the "EXPERIMENTAL" designation. One possible criticism is that the option no longer ensures the integrity of a DB at Open time. This is far from an all-or-nothing issue. Verifying the IDs of all SST files hardly ensures all the data in the DB is readable. (VerifyChecksum is supposed to do that.) Also, with max_open_files=-1 (default, extremely common), all SST files are opened at DB::Open time anyway. Implementation details: * `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass that is now removed. * Unit tests that manipulate/corrupt table properties have to opt out of this check, because that corrupts the "actual" unique id. (And even for testing we don't currently have a mechanism to set "no unique id" in the in-memory file metadata for new files.) * A lot of other unit test churn relates to (a) default checking on, and (b) checking on SST open even without DB::Open (e.g. on flush) * Use `FileMetaData` for more `TableCache` operations (in place of `FileDescriptor`) so that we have access to the unique_id whenever we might need to open an SST file. **There is the possibility of performance impact because we can no longer use the more localized `fd` part of an `FdWithKeyRange` but instead follow the `file_metadata` pointer. However, this change (possible regression) is only done for `GetMemoryUsageByTableReaders`.** * Removed a completely unnecessary constructor overload of `TableReaderOptions` Possible follow-up: * Verification only happens when opening through table cache. Are there more places where this should happen? * Improve error message when there is a file size mismatch vs. manifest (FIXME added in the appropriate place). * I'm not sure there's a justification for `FileDescriptor` to be distinct from `FileMetaData`. * I'm skeptical that `FdWithKeyRange` really still makes sense for optimizing some data locality by duplicating some data in memory, but I could be wrong. * An unnecessary overload of NewTableReader was recently added, in the public API nonetheless (though unusable there). It should be cleaned up to put most things under `TableReaderOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532 Test Plan: updated unit tests Performance test showing no significant difference (just noise I think): `./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000` Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec Reviewed By: jay-zhuang Differential Revision: D38765551 Pulled By: pdillinger fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2 |
|
Changyu Bi | 30bc495c03 |
Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449)
Summary: Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`. With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator: - in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys. - in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L. This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail. One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`. Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449 Test Plan: - Added many unit tests in db_range_del_test - Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2` - Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913. ``` python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1 ``` - Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written". As expected, the performance with this PR depends on the range tombstone width. ``` # Setup: TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000 TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50 # Scan entire DB TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true # Short range scan (10 Next()) TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true # Long range scan(1000 Next()) TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true ``` Avg over of 10 runs (some slower tests had fews runs): For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones. - Scan entire DB | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% | | ------------- | ------------- | ------------- | ------------- | | 0 range tombstone |2525600 (± 43564) |2486917 (± 33698) |-1.53% | | 100 |1853835 (± 24736) |2073884 (± 32176) |+11.87% | | 1000 |422415 (± 7466) |1115801 (± 22781) |+164.15% | | 10000 |22384 (± 227) |227919 (± 6647) |+918.22% | | 1 range tombstone |2176540 (± 39050) |2434954 (± 24563) |+11.87% | - Short range scan | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% | | ------------- | ------------- | ------------- | ------------- | | 0 range tombstone |35398 (± 533) |35338 (± 569) |-0.17% | | 100 |28276 (± 664) |31684 (± 331) |+12.05% | | 1000 |7637 (± 77) |25422 (± 277) |+232.88% | | 10000 |1367 |28667 |+1997.07% | | 1 range tombstone |32618 (± 581) |32748 (± 506) |+0.4% | - Long range scan | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% | | ------------- | ------------- | ------------- | ------------- | | 0 range tombstone |2262 (± 33) |2353 (± 20) |+4.02% | | 100 |1696 (± 26) |1926 (± 18) |+13.56% | | 1000 |410 (± 6) |1255 (± 29) |+206.1% | | 10000 |25 |414 |+1556.0% | | 1 range tombstone |1957 (± 30) |2185 (± 44) |+11.65% | - Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61 Reviewed By: ajkr Differential Revision: D38450331 Pulled By: cbi42 fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca |
|
Akanksha Mahajan | 4cd16d65ae |
Add new option num_file_reads_for_auto_readahead in BlockBasedTableOptions (#10556)
Summary: RocksDB does auto-readahead for iterators on noticing more than two reads for a table file if user doesn't provide readahead_size and reads are sequential. A new option num_file_reads_for_auto_readahead is added which can be configured and indicates after how many sequential reads prefetching should be start. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10556 Test Plan: Existing and new unit test Reviewed By: anand1976 Differential Revision: D38947147 Pulled By: akankshamahajan15 fbshipit-source-id: c9eeab495f84a8df7f701c42f04894e46440ad97 |
|
anand76 | 35cdd3e71e |
MultiGet async IO across multiple levels (#10535)
Summary: This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535 Test Plan: 1. Ensure existing MultiGet unit tests pass, updating them as necessary 2. New unit tests - TODO 3. Run stress test - TODO No noticeable regression (<1%) without async IO - Without PR: `multireadrandom : 7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations; 571.6 MB/s (8168992 of 8168992 found)` With PR: `multireadrandom : 7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations; 568.2 MB/s (8271992 of 8271992 found)` For a fully cached DB, but with async IO option on, no regression observed (<1%) - Without PR: `multireadrandom : 5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations; 797.9 MB/s (11540992 of 11540992 found) ` With PR: `multireadrandom : 5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations; 790.7 MB/s (11649992 of 11649992 found) ` Reviewed By: akankshamahajan15 Differential Revision: D38774009 Pulled By: anand1976 fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007 |
|
Levi Tamasi | 81388b36e0 |
Add support for wide-column point lookups (#10540)
Summary: The patch adds a new API `GetEntity` that can be used to perform wide-column point lookups. It also extends the `Get` code path and the `MemTable` / `MemTableList` and `Version` / `GetContext` logic accordingly so that wide-column entities can be served from both memtables and SSTs. If the result of a lookup is a wide-column entity (`kTypeWideColumnEntity`), it is passed to the application in deserialized form; if it is a plain old key-value (`kTypeValue`), it is presented as a wide-column entity with a single default (anonymous) column. (In contrast, regular `Get` returns plain old key-values as-is, and returns the value of the default column for wide-column entities, see https://github.com/facebook/rocksdb/issues/10483 .) The result of `GetEntity` is a self-contained `PinnableWideColumns` object. `PinnableWideColumns` contains a `PinnableSlice`, which either stores the underlying data in its own buffer or holds on to a cache handle. It also contains a `WideColumns` instance, which indexes the contents of the `PinnableSlice`, so applications can access the values of columns efficiently. There are several pieces of functionality which are currently not supported for wide-column entities: there is currently no `MultiGetEntity` or wide-column iterator; also, `Merge` and `GetMergeOperands` are not supported, and there is no `GetEntity` implementation for read-only and secondary instances. We plan to implement these in future PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10540 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D38847474 Pulled By: ltamasi fbshipit-source-id: 42311a34ccdfe88b3775e847a5e2a5296e002b5b |
|
anand76 | 2553d1efa1 |
Revert "Avoid dynamic memory allocation on read path (#10453)" (#10541)
Summary:
This reverts commit
|
|
Gang Liao | 275cd80cdb |
Add a blob-specific cache priority (#10461)
Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461 Reviewed By: siying Differential Revision: D38672823 Pulled By: ltamasi fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743 |
|
Peter Dillinger | 86a1e3e0e7 |
Derive cache keys from SST unique IDs (#10394)
Summary: ... so that cache keys can be derived from DB manifest data before reading the file from storage--so that every part of the file can potentially go in a persistent cache. See updated comments in cache_key.cc for technical details. Importantly, the new cache key encoding uses some fancy but efficient math to pack data into the cache key without depending on the sizes of the various pieces. This simplifies some existing code creating cache keys, like cache warming before the file size is known. This should provide us an essentially permanent mapping between SST unique IDs and base cache keys, with the ability to "upgrade" SST unique IDs (and thus cache keys) with new SST format_versions. These cache keys are of similar, perhaps indistinguishable quality to the previous generation. Before this change (see "corrected" days between collision): ``` ./cache_bench -stress_cache_key -sck_keep_bits=43 18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected) ``` After this change (keep 43 bits, up through 50, to validate "trajectory" is ok on "corrected" days between collision): ``` 19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected) 16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected) 15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected) 15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected) 15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected) 15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected) 15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected) 15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected) ``` However, the change does modify (probably weaken) the "guaranteed unique" promise from this > SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86 to this (see https://github.com/facebook/rocksdb/issues/10388) > With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits I don't think this is a practical concern, though. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394 Test Plan: unit tests updated, see simulation results above Reviewed By: jay-zhuang Differential Revision: D38667529 Pulled By: pdillinger fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba |
|
sdong | 9277569ba3 |
Add some missing headers (#10519)
Summary: Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519 Reviewed By: jay-zhuang Differential Revision: D38603291 fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1 |
|
sdong | 911c0208b9 |
WritableFileWriter tries to skip operations after failure (#10489)
Summary: A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489 Test Plan: Fix existing unit tests and watch crash tests for a while. Reviewed By: anand1976 Differential Revision: D38473277 fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366 |
|
Levi Tamasi | 24bcab7d5d |
Make queries return the value of the default column for wide-column entities (#10483)
Summary: The patch adds support for wide-column entities to the existing query APIs (`Get`, `MultiGet`, and iterator). Namely, when during a query a wide-column entity is encountered, we will return the value of the default (anonymous) column as the result. Later, we plan to add wide-column specific query APIs which will enable retrieving entire wide-column entities or a subset of their columns. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10483 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D38441881 Pulled By: ltamasi fbshipit-source-id: 6444e79a31aff2470e866698e3a97985bc2b3543 |
|
Jay Zhuang | 0d885e80d4 |
Avoid dynamic memory allocation on read path (#10453)
Summary: lambda function dynamicly allocates memory from heap if it needs to capture multiple values, which could be expensive. Switch to explictly use local functor from stack. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10453 Test Plan: CI db_bench shows ~2-3% read improvement: ``` # before the change TEST_TMPDIR=/tmp/dbbench4 ./db_bench_main --benchmarks=filluniquerandom,readrandom -compression_type=none -max_background_jobs=12 -num=10000000 readrandom : 8.528 micros/op 117265 ops/sec 85.277 seconds 10000000 operations; 13.0 MB/s (10000000 of 10000000 found) # after the change TEST_TMPDIR=/tmp/dbbench5 ./db_bench_new --benchmarks=filluniquerandom,readrandom -compression_type=none -max_background_jobs=12 -num=10000000 readrandom : 8.263 micros/op 121015 ops/sec 82.634 seconds 10000000 operations; 13.4 MB/s (10000000 of 10000000 found) ``` details: https://gist.github.com/jay-zhuang/5ac0628db8fc9cbcb499e056d4cb5918 Micro-benchmark shows a similar improvement ~1-2%: before the change: https://gist.github.com/jay-zhuang/9dc0ebf51bbfbf4af82f6193d43cf75b after the change: https://gist.github.com/jay-zhuang/fc061f1813cd8f441109ad0b0fe7c185 Reviewed By: ajkr Differential Revision: D38345056 Pulled By: jay-zhuang fbshipit-source-id: f3597aeeee338a804d37bf2e81386d5a100665e0 |
|
Changyu Bi | 9d77bf8f7b |
Fragment memtable range tombstone in the write path (#10380)
Summary: - Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact. - db_bench is updated to print out the number of range deletions executed if there is any. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380 Test Plan: - CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed. - Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable. ``` single thread: ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100 multi_thread ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100 ``` Commit |
|
anand76 | bf4532eb5c |
Break TableReader MultiGet into filter and lookup stages (#10432)
Summary: This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432 Reviewed By: akankshamahajan15 Differential Revision: D38245910 Pulled By: anand1976 fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de |
|
anand76 | 54aebb2cc5 |
Fix cache metrics update when secondary cache is used (#10440)
Summary: If a secondary cache is configured, its possible that a cache lookup will get a hit in the secondary cache. In that case, the ```LRUCacheShard::Lookup``` doesn't immediately update the ```total_charge``` for the item handle if the ```wait``` parameter is false (i.e caller will call later to check the completeness). However, ```BlockBasedTable::GetEntryFromCache``` assumes the handle is complete and calls ```UpdateCacheHitMetrics```, which checks the usage of the cache item and fails the assert in https://github.com/facebook/rocksdb/blob/main/cache/lru_cache.h#L237 (```assert(total_charge >= meta_charge)```). To fix this, we call ```UpdateCacheHitMetrics``` later in ```MultiGet```, after waiting for all cache lookup completions. Test plan - Run crash test with changes from https://github.com/facebook/rocksdb/issues/10160 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10440 Reviewed By: gitbw95 Differential Revision: D38283968 Pulled By: anand1976 fbshipit-source-id: 31c54ef43517726c6e5fdda81899b364241dd7e1 |
|
Bo Wang | 1aab5b32ad |
Update passing rate_limiter_priority for a PartitionedFilterBlockReader function to FS (#10438)
Summary: Add param rate_limiter_parameter in PartitionedFilterBlockReader::GetFilterPartitionBlock . Pull Request resolved: https://github.com/facebook/rocksdb/pull/10438 Test Plan: Unit Tests. Reviewed By: anand1976 Differential Revision: D38266395 Pulled By: gitbw95 fbshipit-source-id: 3ed062a3b43d6df323371cb0d266f7fe869e9ad2 |
|
Peter Dillinger | 65036e4217 |
Revert "Add a blob-specific cache priority (#10309)" (#10434)
Summary:
This reverts commit
|
|
Gang Liao | 8d178090be |
Add a blob-specific cache priority (#10309)
Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309 Reviewed By: ltamasi Differential Revision: D38211655 Pulled By: gangliao fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5 |
|
sdong | 252bea405e |
Improve SubCompaction Partitioning (#10393)
Summary: Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393 Test Plan: 1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable; 2. Add a new unit test to ApproximateKeyAnchors() 3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly. Reviewed By: jay-zhuang Differential Revision: D38043783 fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f |
|
Gang Liao | 0b6bc101ba |
Charge blob cache usage against the global memory limit (#10321)
Summary: To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321 Reviewed By: ltamasi Differential Revision: D37913590 Pulled By: gangliao fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72 |
|
Jay Zhuang | a3acf2ef87 |
Add seqno to time mapping (#10338)
Summary: Which will be used for tiered storage to preclude hot data from compacting to the cold tier (the last level). Internally, adding seqno to time mapping. A periodic_task is scheduled to record the current_seqno -> current_time in certain cadence. When memtable flush, the mapping informaiton is stored in sstable property. During compaction, the mapping information are merged and get the approximate time of sequence number, which is used to determine if a key is recently inserted or not and preclude it from the last level if it's recently inserted (within the `preclude_last_level_data_seconds`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338 Test Plan: CI Reviewed By: siying Differential Revision: D37810187 Pulled By: jay-zhuang fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f |
|
sdong | c8b20d469d |
Make InternalKeyComparator not configurable (#10342)
Summary: InternalKeyComparator is an internal class which is a simple wrapper of Comparator. https://github.com/facebook/rocksdb/pull/8336 made Comparator customizeable. As a side effect, internal key comparator was made configurable too. This introduces overhead to this simple wrapper. For example, every InternalKeyComparator will have an std::vector attached to it, which consumes memory and possible allocation overhead too. We remove InternalKeyComparator from being customizable by making InternalKeyComparator not a subclass of Comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10342 Test Plan: Run existing CI tests and make sure it doesn't fail Reviewed By: riversand963 Differential Revision: D37771351 fbshipit-source-id: 917256ee04b2796ed82974549c734fb6c4d8ccee |
|
sdong | 769b156e65 |
Remove customized naming from InternalKeyComparator (#10343)
Summary: InternalKeyComparator is a thin wrapper around user comparator. Storing a string for name is relatively expensive to this small wrapper for both CPU and memory usage. Try to remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10343 Test Plan: Run existing tests Reviewed By: ajkr Differential Revision: D37772469 fbshipit-source-id: d2d106a8d022193058fd7f6b220108e3d94aca34 |
|
Peter Dillinger | e6c5e0ab9a |
Have Cache use Status::MemoryLimit (#10262)
Summary:
I noticed it would clean up some things to have Cache::Insert()
return our MemoryLimit Status instead of Incomplete for the case in
which the capacity limit is reached. I suspect this fixes some existing but
unknown bugs where this Incomplete could be confused with other uses
of Incomplete, especially no_io cases. This is the most suspicious case I
noticed, but was not able to reproduce a bug, in part because the existing
code is not covered by unit tests (FIXME added):
|
|
Akanksha Mahajan | 2acbf386a3 |
Provide support for direct_reads with async_io (#10197)
Summary: Provide support for use_direct_reads with async_io. TestPlan: - Updated unit tests - db_bench: Results in https://github.com/facebook/rocksdb/pull/10197#issuecomment-1159239420 - db_stress ``` export CRASH_TEST_EXT_ARGS=" --async_io=1 --use_direct_reads=1" make crash_test -j ``` - Ran db_bench on previous RocksDB version before any async_io implementation (as there have many changes in different PRs in this area) https://github.com/facebook/rocksdb/pull/10197#issuecomment-1160781563. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10197 Reviewed By: anand1976 Differential Revision: D37255646 Pulled By: akankshamahajan15 fbshipit-source-id: fec61ae15bf4d625f79dea56e4f86e0e307ba920 |
|
Changyu Bi | f9cfc6a808 |
Updated NewDataBlockIterator to not fetch compression dict for non-da… (#10310)
Summary: …ta blocks During MyShadow testing, ajkr helped me find out that with partitioned index and dictionary compression enabled, `PartitionedIndexIterator::InitPartitionedIndexBlock()` spent considerable amount of time (1-2% CPU) on fetching uncompression dictionary. Fetching uncompression dict was not needed since the index blocks were not compressed (and even if they were, they use empty dictionary). This should only affect use cases with partitioned index, dictionary compression and without uncompression dictionary pinned. This PR updates NewDataBlockIterator to not fetch uncompression dictionary when it is not for data blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10310 Test Plan: 1. `make check` 2. Perf benchmark: 1.5% (143950 -> 146176) improvement in op/sec for partitioned index + dict compression benchmark. For default config without partitioned index and without dict compression, there is no regression in readrandom perf from multiple runs of db_bench. ``` # Set up for partitioned index with dictionary compression TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -partition_index=true -compression_max_dict_bytes=16384 -compression_zstd_max_train_bytes=1638400 # Pre PR TEST_TMPDIR=/dev/shm ./db_bench_main -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true readrandom [AVG 50 runs] : 143950 (± 1108) ops/sec; 15.9 (± 0.1) MB/sec readrandom [MEDIAN 50 runs] : 144406 ops/sec; 16.0 MB/sec # Post PR TEST_TMPDIR=/dev/shm ./db_bench_opt -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true readrandom [AVG 50 runs] : 146176 (± 1121) ops/sec; 16.2 (± 0.1) MB/sec readrandom [MEDIAN 50 runs] : 146014 ops/sec; 16.2 MB/sec # Set up for no partitioned index and no dictionary compression TEST_TMPDIR=/dev/shm/baseline ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false # Pre PR TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_main --use_existing_db=true "--benchmarks=readrandom[-X50]" readrandom [AVG 50 runs] : 158546 (± 1000) ops/sec; 17.5 (± 0.1) MB/sec readrandom [MEDIAN 50 runs] : 158280 ops/sec; 17.5 MB/sec # Post PR TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_opt --use_existing_db=true "--benchmarks=readrandom[-X50]" readrandom [AVG 50 runs] : 161061 (± 1520) ops/sec; 17.8 (± 0.2) MB/sec readrandom [MEDIAN 50 runs] : 161596 ops/sec; 17.9 MB/sec ``` Reviewed By: ajkr Differential Revision: D37631358 Pulled By: cbi42 fbshipit-source-id: 6ca2665e270e63871968e061ba4a99d3136785d9 |
|
Levi Tamasi | c73d2a9d18 |
Add API for writing wide-column entities (#10242)
Summary: The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds a new API called `PutEntity` that can be used to write a wide-column entity to the database. The new API is added to both `DB` and `WriteBatch`. Note that currently there is no way to retrieve these entities; more precisely, all read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they encounter a wide-column entity that is required to answer a query. Read-side support (as well as other missing functionality like `Merge`, compaction filter, and timestamp support) will be added in later PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D37369748 Pulled By: ltamasi fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d |
|
Bo Wang | 8e63d90ff8 |
Pass rate_limiter_priority through filter block reader functions to FS (#10251)
Summary: With https://github.com/facebook/rocksdb/pull/9996 , we can pass the rate_limiter_priority to FS for most cases. This PR is to update the code path for filter block reader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10251 Test Plan: Current unit tests should pass. Reviewed By: pdillinger Differential Revision: D37427667 Pulled By: gitbw95 fbshipit-source-id: 1ce5b759b136efe4cfa48a6b97e2f837ff087433 |
|
Bo Wang | c073ed7601 |
Fix typo in comments and code (#10233)
Summary: Fix typo in comments and code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10233 Test Plan: Existing unit tests should pass. Reviewed By: jay-zhuang, anand1976 Differential Revision: D37356702 Pulled By: gitbw95 fbshipit-source-id: 32c019adcc6dcc95a9882b38147a310091368e51 |
|
Peter Dillinger | 84210c9489 |
Add data block hash index to crash test, fix MultiGet issue (#10220)
Summary: There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data block hash index, which was not caught because data block hash index was never added to stress tests. This change fixes both issues. Fixes https://github.com/facebook/rocksdb/issues/10186 I intend to pick this into the 7.4.0 release candidate Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220 Test Plan: Failure quickly reproduces in crash test with kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing the failure with a unit test I believe would be too tricky and fragile to be worthwhile. Reviewed By: anand1976 Differential Revision: D37315647 Pulled By: pdillinger fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba |
|
Peter Dillinger | fff302d989 |
More testing w/prefix extractor, small refactor (#10122)
Summary: There was an interesting code path not covered by testing that is difficult to replicate in a unit test, which is now covered using a sync point. Specifically, the case of table_prefix_extractor == null and !need_upper_bound_check in `BlockBasedTable::PrefixMayMatch`, which can happen if table reader is open before extractor is registered with global object registry, but is later registered and re-set with SetOptions. (We don't have sufficient testing control over object registry to set that up repeatedly.) Also, this function has been renamed to `PrefixRangeMayMatch` for clarity vs. other functions that are not the same. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10122 Test Plan: unit tests expanded Reviewed By: siying Differential Revision: D36944834 Pulled By: pdillinger fbshipit-source-id: 9e52d9da1929a3e42bbc230fcdc3599949de7bdb |
|
Peter Dillinger | 126c223714 |
Remove deprecated block-based filter (#10184)
Summary: In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using the public API, because of its inefficiency. Although we normally maintain read compatibility on old DBs forever, filters are not required for reading a DB, only for optimizing read performance. Thus, it should be acceptable to remove this code and the substantial maintenance burden it carries as useful features are developed and validated (such as user timestamp). This change completely removes the code for reading and writing the old block-based filters, net removing about 1370 lines of code no longer needed. Options removed from testing / benchmarking tools. The prior existence is only evident in a couple of places: * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in a major release to minimize source code incompatibilities. * A warning is logged when an old table file is opened that used the old block-based filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing should suffice. Unfortunately, sst_dump does not tell you whether a file uses block-based filter, and the structure of the code makes it very difficult to fix. * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`) for metaindex is maintained (for now). Other notes: * In some cases where numbers are associated with filter configurations, we have had to update the assigned numbers so that they all correspond to something that exists. * Fixed potential stat counting bug by assuming `filter_checked = false` for cases like `filter == nullptr` rather than assuming `filter_checked = true` * Removed obsolete `block_offset` and `prefix_extractor` parameters from several functions. * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)` because the caller guarantees the prefix extractor exists and is compatible Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184 Test Plan: tests updated, manually test new warning in LOG using base version to generate a DB Reviewed By: riversand963 Differential Revision: D37212647 Pulled By: pdillinger fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80 |
|
anand76 | a6691d0f65 |
Update stats to help users estimate MultiGet async IO impact (#10182)
Summary: Add a couple of stats to help users estimate the impact of potential MultiGet perf improvements - 1. NUM_LEVEL_READ_PER_MULTIGET - A histogram stat for number of levels that required MultiGet to read from a file 2. MULTIGET_COROUTINE_COUNT - A ticker stat to count the number of times the coroutine version of MultiGetFromSST was used The NUM_DATA_BLOCKS_READ_PER_LEVEL stat is obsoleted as it doesn't provide useful information for MultiGet optimization. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10182 Reviewed By: akankshamahajan15 Differential Revision: D37213296 Pulled By: anand1976 fbshipit-source-id: 5d2b7708017c0e278578ae4bffac3926f6530efb |
|
Akanksha Mahajan | 8353ae8b27 |
Add few optimizations in async_io for short scans (#10140)
Summary: This PR adds few optimizations for async_io for shorter scans. 1. If async_io is enabled, seek would create FilePrefetchBuffer object to fetch the data asynchronously. However `FilePrefetchbuffer::num_file_reads_` wasn't taken into consideration if it calls Next after Seek and would go for Prefetching. This PR fixes that and Next will go for prefetching only if `FilePrefetchbuffer::num_file_reads_` is greater than 2 along with if blocks are sequential. This scenario is only for implicit auto readahead. 2. For seek, when it calls TryReadFromCacheAsync to poll it makes async call as well because TryReadFromCacheAsync flow wasn't changed. So I updated to return after poll instead of further prefetching any data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10140 Test Plan: 1. Added a unit test 2. Ran crash_test with async_io = 1 to make sure nothing crashes. Reviewed By: anand1976 Differential Revision: D37042242 Pulled By: akankshamahajan15 fbshipit-source-id: b8e6b7cb2ee0886f37a8f53951948b9084e8ffda |
|
Gang Liao | cba398df8a |
Add blob cache option in the column family options (#10155)
Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10155 Reviewed By: ltamasi Differential Revision: D37150819 Pulled By: gangliao fbshipit-source-id: b807c7916ea5d411588128f8e22a49f171388fe2 |
|
Hui Xiao | d665afdbf3 |
Account memory of FileMetaData in global memory limit (#9924)
Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625 |
|
Peter Dillinger | d3a3b02134 |
Fix bug with kHashSearch and changing prefix_extractor with SetOptions (#10128)
Summary: When opening an SST file created using index_type=kHashSearch, the *current* prefix_extractor would be saved, and used with hash index if the *new current* prefix_extractor at query time is compatible with the SST file. This is a problem if the prefix_extractor at SST open time is not compatible but SetOptions later changes (back) to one that is compatible. This change fixes that by using the known compatible (or missing) prefix extractor we save for use with prefix filtering. Detail: I have moved the InternalKeySliceTransform wrapper to avoid some indirection and remove unnecessary fields. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10128 Test Plan: expanded unit test (using some logic from https://github.com/facebook/rocksdb/issues/10122) that fails before fix and probably covers some other previously uncovered cases. Reviewed By: siying Differential Revision: D36955738 Pulled By: pdillinger fbshipit-source-id: 0c78a6b0d24054ef2f3cb237bf010c1c5589fb10 |
|
Peter Dillinger | 4f78f9699b |
Refactor: Add BlockTypes to make them imply C++ type in block cache (#10098)
Summary: We have three related concepts: * BlockType: an internal enum conceptually indicating a type of SST file block * CacheEntryRole: a user-facing enum for categorizing block cache entries, which is also involved in associated cache entries with an appropriate deleter. Can include categories for non-block cache entries (e.g. memory reservations). * TBlocklike: a C++ type for the actual type behind a void* cache entry. We had some existing code ugliness because BlockType did not imply TBlocklike, because of various kinds of "filter" block. This refactoring fixes that with new BlockTypes. More clean-up can come in later work. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10098 Test Plan: existing tests Reviewed By: akankshamahajan15 Differential Revision: D36897945 Pulled By: pdillinger fbshipit-source-id: 3ae496b5caa81e0a0ed85e873eb5b525e2d9a295 |
|
Yanqin Jin | 3e02c6e05a |
Point-lookup returns timestamps of Delete and SingleDelete (#10056)
Summary: If caller specifies a non-null `timestamp` argument in `DB::Get()` or a non-null `timestamps` in `DB::MultiGet()`, RocksDB will return the timestamps of the point tombstones. Note: DeleteRange is still unsupported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10056 Test Plan: make check Reviewed By: ltamasi Differential Revision: D36677956 Pulled By: riversand963 fbshipit-source-id: 2d7af02cc7237b1829cd269086ea895a49d501ae |
|
Hui Xiao | 4bdcc80192 |
Increase ChargeTableReaderTest/ChargeTableReaderTest.Basic error tolerance rate from 1% to 5% (#10113)
Summary: **Context:** https://github.com/facebook/rocksdb/pull/9748 added support to charge table reader memory to block cache. In the test `ChargeTableReaderTest/ChargeTableReaderTest.Basic`, it estimated the table reader memory, calculated the expected number of table reader opened based on this estimation and asserted this number with actual number. The expected number of table reader opened calculated based on estimated table reader memory will not be 100% accurate and should have tolerance for error. It was previously set to 1% and recently encountered an assertion failure that `(opened_table_reader_num) <= (max_table_reader_num_capped_upper_bound), actual: 375 or 376 vs 374` where `opened_table_reader_num` is the actual opened one and `max_table_reader_num_capped_upper_bound` is the estimated opened one (=371 * 1.01). I believe it's safe to increase error tolerance from 1% to 5% hence there is this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10113 Test Plan: - CI again succeeds. Reviewed By: ajkr Differential Revision: D36911556 Pulled By: hx235 fbshipit-source-id: 259687dd77b450fea0f5658a5b567a1d31d4b1f7 |
|
tagliavini | 6c50082654 |
Remove code that only compiles for Visual Studio versions older than 2015 (#10065)
Summary: There are currently some preprocessor checks that assume support for Visual Studio versions older than 2015 (i.e., 0 < _MSC_VER < 1900), although we don't support them any more. We removed all code that only compiles on those older versions, except third-party/ files. The ROCKSDB_NOEXCEPT symbol is now obsolete, since it now always gets replaced by noexcept. We removed it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10065 Reviewed By: pdillinger Differential Revision: D36721901 Pulled By: guidotag fbshipit-source-id: a2892d365ef53cce44a0a7d90dd6b72ee9b5e5f2 |
|
Jay Zhuang | a96a4a2f7b |
Fix ApproximateOffsetOfCompressed test (#10048)
Summary: https://github.com/facebook/rocksdb/issues/9857 introduced new an option `use_zstd_dict_trainer`, which is stored in SST as text, e.g.: ``` ... zstd_max_train_bytes=0; enabled=0;... ``` it increased the sst size a little bit and cause `ApproximateOffsetOfCompressed` test to fail: ``` Value 7053 is not in range [4000, 7050] table/table_test.cc:4019: Failure Value of: Between(c.ApproximateOffsetOf("xyz"), 4000, 7050) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10048 Test Plan: verified the test pass after the change Reviewed By: cbi42 Differential Revision: D36643688 Pulled By: jay-zhuang fbshipit-source-id: bf12d211f6ae71937259ef21b1226bd06e8da717 |
|
Akanksha Mahajan | a479c2c2b2 |
Fix stress test failure "Corruption: checksum mismatch" or "Iterator Diverged" with async_io enabled (#10032)
Summary: In case of non sequential reads with `async_io`, `FilePRefetchBuffer::TryReadFromCacheAsync` can be called for previous blocks with `offset < bufs_[curr_].offset_` which wasn't handled correctly resulting wrong data being returned from buffer. Since `FilePRefetchBuffer::PrefetchAsync` can be called for any data block, it sets `prev_len_` to 0 indicating `FilePRefetchBuffer::TryReadFromCacheAsync` to go for the prefetching even though offset < bufs_[curr_].offset_ This is because async prefetching is always done in second buffer (to avoid mutex) even though curr_ is empty leading to offset < bufs_[curr_].offset_ in some cases. If prev_len_ is non zero then `TryReadFromCacheAsync` returns false if `offset < bufs_[curr_].offset_ && prev_len != 0` indicating reads are not sequential and previous call wasn't PrefetchAsync. - This PR also simplifies `FilePRefetchBuffer::TryReadFromCacheAsync` as it was getting complicated covering different scenarios based on `async_io` enabled/disabled. If `for_compaction` is set true, it now calls `FilePRefetchBufferTryReadFromCache` following synchronous flow as before. Its decided in BlockFetcher.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/10032 Test Plan: 1. export CRASH_TEST_EXT_ARGS=" --async_io=1" make crash_test -j completed successfully locally 2. make crash_test -j completed successfully locally 3. Reran CircleCi mini crashtest job 4 - 5 times. 4. Updated prefetch_test for more coverage. Reviewed By: anand1976 Differential Revision: D36579858 Pulled By: akankshamahajan15 fbshipit-source-id: 0c428d62b45e12e082a83acf533a5e37a584bedf |
|
Akanksha Mahajan | 2db6a4a1d6 |
Seek parallelization (#9994)
Summary: The RocksDB iterator is a hierarchy of iterators. MergingIterator maintains a heap of LevelIterators, one for each L0 file and for each non-zero level. The Seek() operation naturally lends itself to parallelization, as it involves positioning every LevelIterator on the correct data block in the correct SST file. It lookups a level for a target key, to find the first key that's >= the target key. This typically involves reading one data block that is likely to contain the target key, and scan forward to find the first valid key. The forward scan may read more data blocks. In order to find the right data block, the iterator may read some metadata blocks (required for opening a file and searching the index). This flow can be parallelized. Design: Seek will be called two times under async_io option. First seek will send asynchronous request to prefetch the data blocks at each level and second seek will follow the normal flow and in FilePrefetchBuffer::TryReadFromCacheAsync it will wait for the Poll() to get the results and add the iterator to min_heap. - Status::TryAgain is passed down from FilePrefetchBuffer::PrefetchAsync to block_iter_.Status indicating asynchronous request has been submitted. - If for some reason asynchronous request returns error in submitting the request, it will fallback to sequential reading of blocks in one pass. - If the data already exists in prefetch_buffer, it will return the data without prefetching further and it will be treated as single pass of seek. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9994 Test Plan: - **Run Regressions.** ``` ./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216 ``` i) Previous release 7.0 run for normal prefetching with async_io disabled: ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 7.0 Date: Thu Mar 17 13:11:34 2022 CPU: 24 * Intel Core Processor (Broadwell) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2594.0 MB (estimated) FileSize: 1373.3 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main] seekrandom : 483618.390 micros/op 2 ops/sec; 338.9 MB/s (249 of 249 found) ``` ii) normal prefetching after changes with async_io disable: ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 Set seed to 1652922591315307 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 7.3 Date: Wed May 18 18:09:51 2022 CPU: 32 * Intel Xeon Processor (Skylake) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2594.0 MB (estimated) FileSize: 1373.3 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main] seekrandom : 483080.466 micros/op 2 ops/sec 120.287 seconds 249 operations; 340.8 MB/s (249 of 249 found) ``` iii) db_bench with async_io enabled completed succesfully ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1 -adaptive_readahead=1 Set seed to 1652924062021732 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 7.3 Date: Wed May 18 18:34:22 2022 CPU: 32 * Intel Xeon Processor (Skylake) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2594.0 MB (estimated) FileSize: 1373.3 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main] seekrandom : 553913.576 micros/op 1 ops/sec 120.199 seconds 217 operations; 293.6 MB/s (217 of 217 found) ``` - db_stress with async_io disabled completed succesfully ``` export CRASH_TEST_EXT_ARGS=" --async_io=0" make crash_test -j ``` I**n Progress**: db_stress with async_io is failing and working on debugging/fixing it. Reviewed By: anand1976 Differential Revision: D36459323 Pulled By: akankshamahajan15 fbshipit-source-id: abb1cd944abe712bae3986ae5b16704b3338917c |
|
anand76 | e015206dd6 |
Fix crash due to MultiGet async IO and direct IO (#10024)
Summary: MultiGet with async IO is not officially supported with Posix yet. Avoid a crash by using synchronous MultiRead when direct IO is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10024 Test Plan: Run db_crashtest.py manually Reviewed By: akankshamahajan15 Differential Revision: D36551053 Pulled By: anand1976 fbshipit-source-id: 72190418fa92dd0397e87825df618b12c9bdecda |
|
Changyu Bi | cc23b46da1 |
Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](
|
|
anand76 | 57997ddaaf |
Multi file concurrency in MultiGet using coroutines and async IO (#9968)
Summary: This PR implements a coroutine version of batched MultiGet in order to concurrently read from multiple SST files in a level using async IO, thus reducing the latency of the MultiGet. The API from the user perspective is still synchronous and single threaded, with the RocksDB part of the processing happening in the context of the caller's thread. In Version::MultiGet, the decision is made whether to call synchronous or coroutine code. A good way to review this PR is to review the first 4 commits in order - de773b3, 70c2f70, 10b50e1, and 377a597 - before reviewing the rest. TODO: 1. Figure out how to build it in CircleCI (requires some dependencies to be installed) 2. Do some stress testing with coroutines enabled No regression in synchronous MultiGet between this branch and main - ``` ./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=64 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=true -threads=16 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -statistics ``` Branch - ```multireadrandom : 4.025 micros/op 3975111 ops/sec 60.001 seconds 238509056 operations; 2062.3 MB/s (14767808 of 14767808 found)``` Main - ```multireadrandom : 3.987 micros/op 4013216 ops/sec 60.001 seconds 240795392 operations; 2082.1 MB/s (15231040 of 15231040 found)``` More benchmarks in various scenarios are given below. The measurements were taken with ```async_io=false``` (no coroutines) and ```async_io=true``` (use coroutines). For an IO bound workload (with every key requiring an IO), the coroutines version shows a clear benefit, being ~2.6X faster. For CPU bound workloads, the coroutines version has ~6-15% higher CPU utilization, depending on how many keys overlap an SST file. 1. Single thread IO bound workload on remote storage with sparse MultiGet batch keys (~1 key overlap/file) - No coroutines - ```multireadrandom : 831.774 micros/op 1202 ops/sec 60.001 seconds 72136 operations; 0.6 MB/s (72136 of 72136 found)``` Using coroutines - ```multireadrandom : 318.742 micros/op 3137 ops/sec 60.003 seconds 188248 operations; 1.6 MB/s (188248 of 188248 found)``` 2. Single thread CPU bound workload (all data cached) with ~1 key overlap/file - No coroutines - ```multireadrandom : 4.127 micros/op 242322 ops/sec 60.000 seconds 14539384 operations; 125.7 MB/s (14539384 of 14539384 found)``` Using coroutines - ```multireadrandom : 4.741 micros/op 210935 ops/sec 60.000 seconds 12656176 operations; 109.4 MB/s (12656176 of 12656176 found)``` 3. Single thread CPU bound workload with ~2 key overlap/file - No coroutines - ```multireadrandom : 3.717 micros/op 269000 ops/sec 60.000 seconds 16140024 operations; 139.6 MB/s (16140024 of 16140024 found)``` Using coroutines - ```multireadrandom : 4.146 micros/op 241204 ops/sec 60.000 seconds 14472296 operations; 125.1 MB/s (14472296 of 14472296 found)``` 4. CPU bound multi-threaded (16 threads) with ~4 key overlap/file - No coroutines - ```multireadrandom : 4.534 micros/op 3528792 ops/sec 60.000 seconds 211728728 operations; 1830.7 MB/s (12737024 of 12737024 found) ``` Using coroutines - ```multireadrandom : 4.872 micros/op 3283812 ops/sec 60.000 seconds 197030096 operations; 1703.6 MB/s (12548032 of 12548032 found) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9968 Reviewed By: akankshamahajan15 Differential Revision: D36348563 Pulled By: anand1976 fbshipit-source-id: c0ce85a505fd26ebfbb09786cbd7f25202038696 |
|
Peter Dillinger | 280b9f371a |
Fix auto_prefix_mode performance with partitioned filters (#10012)
Summary: Essentially refactored the RangeMayExist implementation in FullFilterBlockReader to FilterBlockReaderCommon so that it applies to partitioned filters as well. (The function is not called for the block-based filter case.) RangeMayExist is essentially a series of checks around a possible PrefixMayExist, and I'm confident those checks should be the same for partitioned as for full filters. (I think it's likely that bugs remain in those checks, but this change is overall a simplifying one.) Added auto_prefix_mode support to db_bench Other small fixes as well Fixes https://github.com/facebook/rocksdb/issues/10003 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10012 Test Plan: Expanded unit test that uses statistics to check for filter optimization, fails without the production code changes here Performance: populate two DBs with ``` TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters ``` Observe no measurable change in non-partitioned performance ``` TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20 ``` Before: seekrandom [AVG 15 runs] : 11798 (± 331) ops/sec After: seekrandom [AVG 15 runs] : 11724 (± 315) ops/sec Observe big improvement with partitioned (also supported by bloom use statistics) ``` TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20 ``` Before: seekrandom [AVG 12 runs] : 2942 (± 57) ops/sec After: seekrandom [AVG 12 runs] : 7489 (± 184) ops/sec Reviewed By: siying Differential Revision: D36469796 Pulled By: pdillinger fbshipit-source-id: bcf1e2a68d347b32adb2b27384f945434e7a266d |
|
Jay Zhuang | c6d326d3d7 |
Track SST unique id in MANIFEST and verify (#9990)
Summary: Start tracking SST unique id in MANIFEST, which is used to verify with SST properties to make sure the SST file is not overwritten or misplaced. A DB option `try_verify_sst_unique_id` is introduced to enable/disable the verification, if enabled, it opens all SST files during DB-open to read the unique_id from table properties (default is false), so it's recommended to use it with `max_open_files = -1` to pre-open the files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990 Test Plan: unittests, format-compatible test, mini-crash Reviewed By: anand1976 Differential Revision: D36381863 Pulled By: jay-zhuang fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f |
|
Hui Xiao | dde774db64 |
Mark old reserve* option deprecated (#10016)
Summary: **Context/Summary:** https://github.com/facebook/rocksdb/pull/9926 removed inefficient `reserve*` option API but forgot to mark them deprecated in `block_based_table_type_info` for compatible table format. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10016 Test Plan: build-format-compatible Reviewed By: pdillinger Differential Revision: D36484247 Pulled By: hx235 fbshipit-source-id: c41b90cc99fb7ab7098934052f0af7290b221f98 |
|
gitbw95 | 4da34b97ee |
Set Read rate limiter priority dynamically and pass it to FS (#9996)
Summary: ### Context: Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users. ### Solution User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile. **This PR is for the Read path.** The **Read:** dynamic priority for different state are listed as follows: | State | Normal | Delayed | Stalled | | ----- | ------ | ------- | ------- | | Flush (verification read in BuildTable()) | IO_USER | IO_USER | IO_USER | | Compaction | IO_LOW | IO_USER | IO_USER | | User | User provided | User provided | User provided | We will respect the read_options that the user provided and will not set it. The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read". **Details** 1. Set read_options.rate_limiter_priority dynamically: - User: Do not update the read_options. Use the read_options that the user provided. - Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction(). - Flush: Update read_options in BuildTable(). 2. Pass the rate limiter priority to FSRandomAccessFile functions: - After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions, including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock(). - In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch(). - Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator(). ### Test Plans Add unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996 Reviewed By: anand1976 Differential Revision: D36452483 Pulled By: gitbw95 fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf |
|
Peter Dillinger | 0070680cfd |
Adjust public APIs to prefer 128-bit SST unique ID (#10009)
Summary: 128 bits should suffice almost always and for tracking in manifest. Note that this changes the output of sst_dump --show_properties to only show 128 bits. Also introduces InternalUniqueIdToHumanString for presenting internal IDs for debugging purposes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10009 Test Plan: unit tests updated Reviewed By: jay-zhuang Differential Revision: D36458189 Pulled By: pdillinger fbshipit-source-id: 93ebc4a3b6f9c73ee154383a1f8b291a5d6bbef5 |
|
Hui Xiao | 3573558ec5 |
Rewrite memory-charging feature's option API (#9926)
Summary: **Context:** Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`. Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags. **Summary:** - Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that - Added missing db bench/stress test for some memory charging features - Renamed related test suite to indicate they are under the same theme of memory charging - Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication - Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926 Test Plan: - New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)` - New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)` - CI - db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465** 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078** - db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36054712 Pulled By: hx235 fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af |
|
mrambacher | b11ff347b4 |
Use STATIC_AVOID_DESTRUCTION for static objects with non-trivial destructors (#9958)
Summary: Changed the static objects that had non-trivial destructors to use the STATIC_AVOID_DESTRUCTION construct. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9958 Reviewed By: pdillinger Differential Revision: D36442982 Pulled By: mrambacher fbshipit-source-id: 029d47b1374d30d198bfede369a4c0ae7a4eb519 |
|
mrambacher | 204a42ca97 |
Added GetFactoryCount/Names/Types to ObjectRegistry (#9358)
Summary: These methods allow for more thorough testing of the ObjectRegistry and Customizable infrastructure in a simpler manner. With this change, the Customizable tests can now check what factories are registered and attempt to create each of them in a systematic fashion. With this change, I think all of the factories registered with the ObjectRegistry/CreateFromString are now tested via the customizable_test classes. Note that there were a few other minor changes. There was a "posix://*" register with the ObjectRegistry which was missed during the PatternEntry conversion -- these changes found that. The nickname and default names for the FileSystem classes was also inverted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9358 Reviewed By: pdillinger Differential Revision: D33433542 Pulled By: mrambacher fbshipit-source-id: 9a32da74e6620745b4eeffb2712be70eeeadfa7e |
|
sdong | c4cd8e1acc |
Fix a bug handling multiget index I/O error. (#9993)
Summary: In one path of BlockBasedTable::MultiGet(), Next() is directly called after calling Seek() against the index iterator. This might cause crash if an I/O error happens in Seek(). The bug is discovered in crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9993 Test Plan: See existing CI tests pass. Reviewed By: anand1976 Differential Revision: D36381758 fbshipit-source-id: a11e0aa48dcee168c2554c33b532646ffdb68877 |
|
sdong | 736a7b5433 |
Remove own ToString() (#9955)
Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471 |
|
sdong | 49628c9a83 |
Use std::numeric_limits<> (#9954)
Summary: Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954 Test Plan: See CI Runs. Reviewed By: riversand963 Differential Revision: D36173954 fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0 |
|
Xinyu Zeng | 8b74cea7fe |
Reduce comparator objects init cost in BlockIter (#9611)
Summary: This PR solves the problem discussed in https://github.com/facebook/rocksdb/issues/7149. By storing the pointer of InternalKeyComparator as icmp_ in BlockIter, the object size remains the same. And for each call to CompareCurrentKey, there is no need to create Comparator objects. One can use icmp_ directly or use the "user_comparator" from the icmp_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9611 Test Plan: with https://github.com/facebook/rocksdb/issues/9903, ``` $ TEST_TMPDIR=/dev/shm python3.6 ../benchmark/tools/compare.py benchmarks ./db_basic_bench ../rocksdb-pr9611/db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1 --benchmark_repetitions=50 ... Comparing ./db_basic_bench to ../rocksdb-pr9611/db_basic_bench Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ... DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_pvalue 0.0001 0.0001 U Test, Repetitions: 50 vs 50 DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_mean -0.0483 -0.0483 3924 3734 3924 3734 DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_median -0.0713 -0.0713 3971 3687 3970 3687 DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_stddev -0.0342 -0.0344 225 217 225 217 DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_cv +0.0148 +0.0146 0 0 0 0 OVERALL_GEOMEAN -0.0483 -0.0483 0 0 0 0 ``` Reviewed By: akankshamahajan15 Differential Revision: D35882037 Pulled By: ajkr fbshipit-source-id: 9e5337bbad8f1239dff7aa9f6549020d599bfcdf |
|
Peter Dillinger | 9d0cae7104 |
Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899)
Summary: When MultiGet() determines that multiple query keys can be served by examining the same data block in block cache (one Lookup()), each PinnableSlice referring to data in that data block needs to hold on to the block in cache so that they can be released at arbitrary times by the API user. Historically this is accomplished with extra calls to Ref() on the Handle from Lookup(), with each PinnableSlice cleanup calling Release() on the Handle, but this creates extra contention on the block cache for the extra Ref()s and Release()es, especially because they hit the same cache shard repeatedly. In the case of merge operands (possibly more cases?), the problem was compounded by doing an extra Ref()+eventual Release() for each merge operand for a key reusing a block (which could be the same key!), rather than one Ref() per key. (Note: the non-shared case with `biter` was already one per key.) This change optimizes MultiGet not to rely on these extra, contentious Ref()+Release() calls by instead, in the shared block case, wrapping the cache Release() cleanup in a refcounted object referenced by the PinnableSlices, such that after the last wrapped reference is released, the cache entry is Release()ed. Relaxed atomic refcounts should be much faster than mutex-guarded Ref() and Release(), and much less prone to a performance cliff when MultiGet() does a lot of block sharing. Note that I did not use std::shared_ptr, because that would require an extra indirection object (shared_ptr itself new/delete) in order to associate a ref increment/decrement with a Cleanable cleanup entry. (If I assumed it was the size of two pointers, I could do some hackery to make it work without the extra indirection, but that's too fragile.) Some details: * Fixed (removed) extra block cache tracing entries in cases of cache entry reuse in MultiGet, but it's likely that in some other cases traces are missing (XXX comment inserted) * Moved existing implementations for cleanable.h from iterator.cc to new cleanable.cc * Improved API comments on Cleanable * Added a public SharedCleanablePtr class to cleanable.h in case others could benefit from the same pattern (potentially many Cleanables and/or smart pointers referencing a shared Cleanable) * Add a typedef for MultiGetContext::Mask * Some variable renaming for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/9899 Test Plan: Added unit tests for SharedCleanablePtr. Greatly enhanced ability of existing tests to detect cache use-after-free. * Release PinnableSlices from MultiGet as they are read rather than in bulk (in db_test_util wrapper). * In ASAN build, default to using a trivially small LRUCache for block_cache so that entries are immediately erased when unreferenced. (Updated two tests that depend on caching.) New ASAN testsuite running time seems OK to me. If I introduce a bug into my implementation where we skip the shared cleanups on block reuse, ASAN detects the bug in `db_basic_test *MultiGet*`. If I remove either of the above testing enhancements, the bug is not detected. Consider for follow-up work: manipulate or randomize ordering of PinnableSlice use and release from MultiGet db_test_util wrapper. But in typical cases, natural ordering gives pretty good functional coverage. Performance test: In the extreme (but possible) case of MultiGetting the same or adjacent keys in a batch, throughput can improve by an order of magnitude. `./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb -readonly -num=5 -duration=10 -threads=20 -multiread_batched -batch_size=200` Before ops/sec, num=5: 1,384,394 Before ops/sec, num=500: 6,423,720 After ops/sec, num=500: 10,658,794 After ops/sec, num=5: 16,027,257 Also note that previously, with high parallelism, having query keys concentrated in a single block was worse than spreading them out a bit. Now concentrated in a single block is faster than spread out, which is hopefully consistent with natural expectation. Random query performance: with num=1000000, over 999 x 10s runs running before & after simultaneously (each -threads=12): Before: multireadrandom [AVG 999 runs] : 1088699 (± 7344) ops/sec; 120.4 (± 0.8 ) MB/sec After: multireadrandom [AVG 999 runs] : 1090402 (± 7230) ops/sec; 120.6 (± 0.8 ) MB/sec Possibly better, possibly in the noise. Reviewed By: anand1976 Differential Revision: D35907003 Pulled By: pdillinger fbshipit-source-id: bbd244d703649a8ca12d476f2d03853ed9d1a17e |
|
Akanksha Mahajan | 3653029dda |
Add stats related to async prefetching (#9845)
Summary: Add stats PREFETCHED_BYTES_DISCARDED and POLL_WAIT_MICROS. PREFETCHED_BYTES_DISCARDED records number of prefetched bytes discarded by FilePrefetchBuffer. POLL_WAIT_MICROS records the time taken by underling file_system Poll API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9845 Test Plan: Update existing tests Reviewed By: anand1976 Differential Revision: D35909694 Pulled By: akankshamahajan15 fbshipit-source-id: e009ef940bb9ed72c9446f5529095caabb8a1e36 |
|
Hui Xiao | a5063c8931 |
Fix issue of opening too many files in BlockBasedTableReaderCapMemoryTest.CapMemoryUsageUnderCacheCapacity (#9869)
Summary: **Context:** Unit test for https://github.com/facebook/rocksdb/pull/9748 keeps opening new files to see whether the new feature is able to correctly constrain the opening based on block cache capacity. However, the unit test has two places written inefficiently that can lead to opening too many new files relative to underlying operating system/file system constraint, even before hitting the block cache capacity: (1) [opened_table_reader_num < 2 * max_table_reader_num](https://github.com/facebook/rocksdb/pull/9748/files?show-viewed-files=true&file-filters%5B%5D=#diff-ec9f5353e317df71093094734ba29193b94a998f0f9c9af924e4c99692195eeaR438), which can leads to 1200 + open files because of (2) below (2) NewLRUCache(6 * CacheReservationManagerImpl<CacheEntryRole::kBlockBasedTableReader>::GetDummyEntrySize()) in [here](https://github.com/facebook/rocksdb/pull/9748/files?show-viewed-files=true&file-filters%5B%5D=#diff-ec9f5353e317df71093094734ba29193b94a998f0f9c9af924e4c99692195eeaR364) Therefore we see CI failures like this on machine with a strict open file limit ~1000 (see the "table_1021" naming in following error msg) https://app.circleci.com/pipelines/github/facebook/rocksdb/12886/workflows/75524682-3fa4-41ee-9a61-81827b51d99b/jobs/345270 ``` fs_->NewWritableFile(path, foptions, &file, nullptr) IO error: While open a file for appending: /dev/shm/rocksdb.Jedwt/run-block_based_table_reader_test-CapMemoryUsageUnderCacheCapacity-BlockBasedTableReaderCapMemoryTest.CapMemoryUsageUnderCacheCapacity-0/block_based_table_reader_test_1668910_829492452552920927/**table_1021**: Too many open files ``` **Summary:** - Revised the test more efficiently on the above 2 places, including using 1.1 instead 2 in the threshold and lowering down the block cache capacity a bit - Renamed some variables for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/9869 Test Plan: - Manual inspection of max opened table reader in all test case, which is around ~389 - Circle CI to see if error is gone Reviewed By: ajkr Differential Revision: D35752655 Pulled By: hx235 fbshipit-source-id: 8a0953d39d561babfa4257b8ed8550bb21b04839 |
|
Akanksha Mahajan | 0c7f455f85 |
Make initial auto readahead_size configurable (#9836)
Summary: Make initial auto readahead_size configurable Pull Request resolved: https://github.com/facebook/rocksdb/pull/9836 Test Plan: Added new unit test Ran regression: Without change: ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 7.0 Date: Thu Mar 17 13:11:34 2022 CPU: 24 * Intel Core Processor (Broadwell) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2594.0 MB (estimated) FileSize: 1373.3 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main] seekrandom : 483618.390 micros/op 2 ops/sec; 338.9 MB/s (249 of 249 found) ``` With this change: ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 Set seed to 1649895440554504 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 7.2 Date: Wed Apr 13 17:17:20 2022 CPU: 24 * Intel Core Processor (Broadwell) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2594.0 MB (estimated) FileSize: 1373.3 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main] ... finished 100 ops seekrandom : 476892.488 micros/op 2 ops/sec; 344.6 MB/s (252 of 252 found) ``` Reviewed By: anand1976 Differential Revision: D35632815 Pulled By: akankshamahajan15 fbshipit-source-id: c8057a88f9294c9d03b1d434b03affe02f74d796 |
|
Peter Dillinger | efd035164b |
Meta-internal folly integration with F14FastMap (#9546)
Summary: Especially after updating to C++17, I don't see a compelling case for *requiring* any folly components in RocksDB. I was able to purge the existing hard dependencies, and it can be quite difficult to strip out non-trivial components from folly for use in RocksDB. (The prospect of doing that on F14 has changed my mind on the best approach here.) But this change creates an optional integration where we can plug in components from folly at compile time, starting here with F14FastMap to replace std::unordered_map when possible (probably no public APIs for example). I have replaced the biggest CPU users of std::unordered_map with compile-time pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set. USE_FOLLY is always set in the Meta-internal buck build, and a simulation of that is in the Makefile for public CI testing. A full folly build is not needed, but checking out the full folly repo is much simpler for getting the dependency, and anything else we might want to optionally integrate in the future. Some picky details: * I don't think the distributed mutex stuff is actually used, so it was easy to remove. * I implemented an alternative to `folly::constexpr_log2` (which is much easier in C++17 than C++11) so that I could pull out the hard dependencies on `ConstexprMath.h` * I had to add noexcept move constructors/operators to some types to make F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a macro to make that easier in some common cases. * Updated Meta-internal buck build to use folly F14Map (always) No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a production integration for open source users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546 Test Plan: CircleCI tests updated so that a couple of them use folly. Most internal unit & stress/crash tests updated to use Meta-internal latest folly. (Note: they should probably use buck but they currently use Makefile.) Example performance improvement: when filter partitions are pinned in cache, they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build a test that exercises that heavily. Build DB with ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters ``` and test with (simultaneous runs with & without folly, ~20 times each to see convergence) ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache ``` Average ops/s no folly: 26229.2 Average ops/s with folly: 26853.3 (+2.4%) Reviewed By: ajkr Differential Revision: D34181736 Pulled By: pdillinger fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94 |
|
Akanksha Mahajan | 63e68a4e77 |
Enable async prefetching for ReadOptions.readahead_size (#9827)
Summary: Currently async prefetching is enabled for implicit internal auto readahead in FilePrefetchBuffer if `ReadOptions.async_io` is set. This PR enables async prefetching for `ReadOptions.readahead_size` when `ReadOptions.async_io` is set true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9827 Test Plan: Update unit test Reviewed By: anand1976 Differential Revision: D35552129 Pulled By: akankshamahajan15 fbshipit-source-id: d9f9a96672852a591375a21eef15355cf3289f5c |
|
Hui Xiao | f337542948 |
Fix a bug of TEST_SetRandomTableProperties due to non-zero padding between fields in TableProperties struct (#9812)
Summary: Context: https://github.com/facebook/rocksdb/pull/9748#discussion_r843134214 reveals an issue with TEST_SetRandomTableProperties when non-zero padding is used between the last string field and first non-string field in TableProperties. Fixed by https://github.com/facebook/rocksdb/pull/9748#discussion_r843244375 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9812 Test Plan: No production code changes and rely on existing CI Reviewed By: ajkr Differential Revision: D35423680 Pulled By: hx235 fbshipit-source-id: fd855eef3d32771bb79c65bd7012ab8bb3c400ab |
|
Peter Dillinger | 8ce7cea93f |
Tests for filter compatibility (#9773)
Summary: This change adds two unit tests that would each catch the regression fixed in https://github.com/facebook/rocksdb/issues/9736 * TableMetaIndexKeys - detects any churn in metaindex block keys generated by SST files using standard db_test_util configurations. * BloomFilterCompatibility - this detects if any common built-in FilterPolicy configurations fail to read filters generated by another. (The regression bug caused NewRibbonFilterPolicy not to read filters from NewBloomFilterPolicy and vice-versa.) This replaces some previous tests that didn't really appear to be testing much of anything except basic data correctness, which doesn't tell you a filter is being used. Light refactoring in meta_blocks.cc/h to support inspecting metaindex keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9773 Test Plan: this is the test. Verified that 7.0.2 fails both tests and 7.0.3 passes. With backporting for intentional API changes in 7.0, 6.29 also passes. Reviewed By: ajkr Differential Revision: D35236248 Pulled By: pdillinger fbshipit-source-id: 493dfe9ad7e27524bf7c6c1af8a4b8c31bc6ef5a |
|
Hui Xiao | 49623f9c8e |
Account memory of big memory users in BlockBasedTable in global memory limit (#9748)
Summary: **Context:** Through heap profiling, we discovered that `BlockBasedTableReader` objects can accumulate and lead to high memory usage (e.g, `max_open_file = -1`). These memories are currently not saved, not tracked, not constrained and not cache evict-able. As a first step to improve this, similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track an estimate of `BlockBasedTableReader` object's memory in block cache and fail future creation if the memory usage exceeds the available space of cache at the time of creation. **Summary:** - Approximate big memory users (`BlockBasedTable::Rep` and `TableProperties` )' memory usage in addition to the existing estimated ones (filter block/index block/un-compression dictionary) - Charge all of these memory usages to block cache on `BlockBasedTable::Open()` and release them on `~BlockBasedTable()` as there is no memory usage fluctuation of concern in between - Refactor on CacheReservationManager (and its call-sites) to add concurrent support for BlockBasedTable used in this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9748 Test Plan: - New unit tests - db bench: `OpenDb` : **-0.52% in ms** - Setup `./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -write_buffer_size=1048576` - Repeated run with pre-change w/o feature and post-change with feature, benchmark `OpenDb`: `./db_bench -benchmarks=readrandom -use_existing_db=1 -db=/dev/shm/testdb -reserve_table_reader_memory=true (remove this when running w/o feature) -file_opening_threads=3 -open_files=-1 -report_open_timing=true| egrep 'OpenDb:'` #-run | (feature-off) avg milliseconds | std milliseconds | (feature-on) avg milliseconds | std milliseconds | change (%) -- | -- | -- | -- | -- | -- 10 | 11.4018 | 5.95173 | 9.47788 | 1.57538 | -16.87382694 20 | 9.23746 | 0.841053 | 9.32377 | 1.14074 | 0.9343477536 40 | 9.0876 | 0.671129 | 9.35053 | 1.11713 | 2.893283155 80 | 9.72514 | 2.28459 | 9.52013 | 1.0894 | -2.108041632 160 | 9.74677 | 0.991234 | 9.84743 | 1.73396 | 1.032752389 320 | 10.7297 | 5.11555 | 10.547 | 1.97692 | **-1.70275031** 640 | 11.7092 | 2.36565 | 11.7869 | 2.69377 | **0.6635807741** - db bench on write with cost to cache in WriteBufferManager (just in case this PR's CRM refactoring accidentally slows down anything in WBM) : `fillseq` : **+0.54% in micros/op** `./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -cost_write_buffer_to_cache=true -write_buffer_size=10000000000 | egrep 'fillseq'` #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) avg micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 6.15 | 0.260187 | 6.289 | 0.371192 | 2.260162602 20 | 7.28025 | 0.465402 | 7.37255 | 0.451256 | 1.267813605 40 | 7.06312 | 0.490654 | 7.13803 | 0.478676 | **1.060579461** 80 | 7.14035 | 0.972831 | 7.14196 | 0.92971 | **0.02254791432** - filter bench: `bloom filter`: **-0.78% in ms/key** - ` ./filter_bench -impl=2 -quick -reserve_table_builder_memory=true | grep 'Build avg'` #-run | (pre-PR) avg ns/key | std ns/key | (post-PR) ns/key | std ns/key | change (%) -- | -- | -- | -- | -- | -- 10 | 26.4369 | 0.442182 | 26.3273 | 0.422919 | **-0.4145720565** 20 | 26.4451 | 0.592787 | 26.1419 | 0.62451 | **-1.1465262** - Crash test `python3 tools/db_crashtest.py blackbox --reserve_table_reader_memory=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D35136549 Pulled By: hx235 fbshipit-source-id: 146978858d0f900f43f4eb09bfd3e83195e3be28 |
|
Akanksha Mahajan | 36bc3da97f |
Fix segfault in FilePrefetchBuffer with async_io enabled (#9777)
Summary: If FilePrefetchBuffer object is destroyed and then later Poll() calls callback on object which has been destroyed, it gives segfault on accessing destroyed object. It was caught after adding unit tests that tests Posix implementation of ReadAsync and Poll APIs. This PR also updates and fixes existing IOURing tests which were not running locally because RocksDbIOUringEnable function wasn't defined and IOUring was disabled for those tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/9777 Test Plan: Added new unit test Reviewed By: anand1976 Differential Revision: D35254002 Pulled By: akankshamahajan15 fbshipit-source-id: 68e80054ffb14ae25c255920ebc6548ca5f130a1 |
|
Alan Paxton | b6ad0d958f |
Fb 9718 verify checksums is ignored (#9767)
Summary: Fixes https://github.com/facebook/rocksdb/issues/9718 The verify_checksums flag of read_options should be passed to the read options used by the BlockFetcher in a couple of cases where it is not at present. It will now happen (but did not, previously) on iteration and on [multi]get, where a fetcher is created as part of the iterate/get call. This may result in much better performance in a few workloads where the client chooses to remove verification. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9767 Reviewed By: mrambacher Differential Revision: D35218986 Pulled By: jay-zhuang fbshipit-source-id: 329d29764bb70fbc7f2673440bc46c107a813bc8 |
|
myasuka | 98130c5a26 |
Enable READ_BLOCK_COMPACTION_MICROS to track stats (#9722)
Summary: After commit [ |
|
Peter Dillinger | 91687d70ea |
Fix a major performance bug in 7.0 re: filter compatibility (#9736)
Summary: Bloom filters generated by pre-7.0 releases are not read by 7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name() in https://github.com/facebook/rocksdb/issues/9590. This can severely impact read performance and read I/O on upgrade or downgrade with existing DB, but not data correctness. To fix, we go back using the old, unified name in SST metadata but (for a while anyway) recognize the aliases that could be generated by early 7.0.x releases. This unfortunately requires a public API change to avoid interfering with all the good changes from https://github.com/facebook/rocksdb/issues/9590, but the API change only affects users with custom FilterPolicy, which should be very few. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9736 Test Plan: manual Generate DBs with ``` ./db_bench.7.0 -db=/dev/shm/rocksdb.7.0 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 ``` and similar. Compare with ``` for IMPL in 6.29 7.0 fixed; do for DB in 6.29 7.0 fixed; do echo "Testing $IMPL on $DB:"; ./db_bench.$IMPL -db=/dev/shm/rocksdb.$DB -use_existing_db -readonly -bloom_bits=10 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=10 2>&1 | grep micros/op; done; done ``` Results: ``` Testing 6.29 on 6.29: readrandom : 34.381 micros/op 29085 ops/sec; 3.2 MB/s (291999 of 291999 found) Testing 6.29 on 7.0: readrandom : 190.443 micros/op 5249 ops/sec; 0.6 MB/s (52999 of 52999 found) Testing 6.29 on fixed: readrandom : 40.148 micros/op 24907 ops/sec; 2.8 MB/s (249999 of 249999 found) Testing 7.0 on 6.29: readrandom : 229.430 micros/op 4357 ops/sec; 0.5 MB/s (43999 of 43999 found) Testing 7.0 on 7.0: readrandom : 33.348 micros/op 29986 ops/sec; 3.3 MB/s (299999 of 299999 found) Testing 7.0 on fixed: readrandom : 152.734 micros/op 6546 ops/sec; 0.7 MB/s (65999 of 65999 found) Testing fixed on 6.29: readrandom : 32.024 micros/op 31224 ops/sec; 3.5 MB/s (312999 of 312999 found) Testing fixed on 7.0: readrandom : 33.990 micros/op 29390 ops/sec; 3.3 MB/s (294999 of 294999 found) Testing fixed on fixed: readrandom : 28.714 micros/op 34825 ops/sec; 3.9 MB/s (348999 of 348999 found) ``` Just paying attention to order of magnitude of ops/sec (short test durations, lots of noise), it's clear that with the fix we can read <= 6.29 & >= 7.0 at full speed, where neither 6.29 nor 7.0 can on both. And 6.29 release can properly read fixed DB at full speed. Reviewed By: siying, ajkr Differential Revision: D35057844 Pulled By: pdillinger fbshipit-source-id: a46893a6af4bf084375ebe4728066d00eb08f050 |
|
gitbw95 | 8102690a52 |
Update Cache::Release param from force_erase to erase_if_last_ref (#9728)
Summary: The param name force_erase may be misleading, since the handle is erased only if it has last reference even if the param is set true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9728 Reviewed By: pdillinger Differential Revision: D35038673 Pulled By: gitbw95 fbshipit-source-id: 0d16d1e8fed17b97eba7fb53207119332f659a5f |
|
Akanksha Mahajan | 49a10feb21 |
Provide implementation to prefetch data asynchronously in FilePrefetchBuffer (#9674)
Summary: In FilePrefetchBuffer if reads are sequential, after prefetching call ReadAsync API to prefetch data asynchronously so that in next prefetching data will be available. Data prefetched asynchronously will be readahead_size/2. It uses two buffers, one for synchronous prefetching and one for asynchronous. In case, the data is overlapping, the data is copied from both buffers to third buffer to make it continuous. This feature is under ReadOptions::async_io and is under experimental. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9674 Test Plan: 1. Add new unit tests 2. Run **db_stress** to make sure nothing crashes. - Normal prefetch without `async_io` ran successfully: ``` export CRASH_TEST_EXT_ARGS=" --async_io=0" make crash_test -j ``` 3. **Run Regressions**. i) Main branch without any change for normal prefetching with async_io disabled: ``` ./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 - use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216 ``` ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 7.0 Date: Thu Mar 17 13:11:34 2022 CPU: 24 * Intel Core Processor (Broadwell) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2594.0 MB (estimated) FileSize: 1373.3 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main] seekrandom : 483618.390 micros/op 2 ops/sec; 338.9 MB/s (249 of 249 found) ``` ii) normal prefetching after changes with async_io disable: ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_withchange -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 7.0 Date: Thu Mar 17 14:11:31 2022 CPU: 24 * Intel Core Processor (Broadwell) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2594.0 MB (estimated) FileSize: 1373.3 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_withchange] seekrandom : 471347.227 micros/op 2 ops/sec; 348.1 MB/s (255 of 255 found) ``` Reviewed By: anand1976 Differential Revision: D34731543 Pulled By: akankshamahajan15 fbshipit-source-id: 8e23aa93453d5fe3c672b9231ad582f60207937f |
|
Peter Dillinger | cff0d1e8e6 |
New backup meta schema, with file temperatures (#9660)
Summary: The primary goal of this change is to add support for backing up and restoring (applying on restore) file temperature metadata, without committing to either the DB manifest or the FS reported "current" temperatures being exclusive "source of truth". To achieve this goal, we need to add temperature information to backup metadata, which requires updated backup meta schema. Fortunately I prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version 6.19.0 for this kind of schema update. (Previously, backup meta schema was not extensible! Making this schema update public will allow some other "nice to have" features like taking backups with hard links, and avoiding crc32c checksum computation when another checksum is already available.) While schema version 2 is newly public, the default schema version is still 1. Until we change the default, users will need to set to 2 to enable features like temperature data backup+restore. New metadata like temperature information will be ignored with a warning in versions before this change and since 6.19.0. The metadata is considered ignorable because a functioning DB can be restored without it. Some detail: * Some renaming because "future schema" is now just public schema 2. * Initialize some atomics in TestFs (linter reported) * Add temperature hint support to SstFileDumper (used by BackupEngine) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660 Test Plan: related unit test majorly updated for the new functionality, including some shared testing support for tracking temperatures in a FS. Some other tests and testing hooks into production code also updated for making the backup meta schema change public. Reviewed By: ajkr Differential Revision: D34686968 Pulled By: pdillinger fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a |
|
ehds@qq.com | 60422f1676 |
Replace GetUserKey with ExtractUserKey (#9664)
Summary: Replace `GetUserKey` with `ExtractUserKey` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9664 Reviewed By: jay-zhuang Differential Revision: D34673571 Pulled By: ajkr fbshipit-source-id: 5acf7d0d1a45efce0ccbe505991acf76c9cdc461 |
|
Jay Zhuang | 36aec94d85 |
`compression_per_level` should be used for flush and changeable (#9658)
Summary: - Make `compression_per_level` dynamical changeable with `SetOptions`; - Fix a bug that `compression_per_level` is not used for flush; Pull Request resolved: https://github.com/facebook/rocksdb/pull/9658 Test Plan: CI Reviewed By: ajkr Differential Revision: D34700749 Pulled By: jay-zhuang fbshipit-source-id: a23b9dfa7ad03d393c1d71781d19e91de796f49c |
|
Hui Xiao | 4a776d81cc |
Dynamic toggling of BlockBasedTableOptions::detect_filter_construct_corruption (#9654)
Summary: **Context/Summary:** As requested, `BlockBasedTableOptions::detect_filter_construct_corruption` can now be dynamically configured using `DB::SetOptions` after this PR Pull Request resolved: https://github.com/facebook/rocksdb/pull/9654 Test Plan: - New unit test Reviewed By: pdillinger Differential Revision: D34622609 Pulled By: hx235 fbshipit-source-id: c06773ef3d029e6bf1724d3a72dffd37a8ec66d9 |
|
sdong | 33742c2a9f |
Remove BlockBasedTableOptions.hash_index_allow_collision (#9454)
Summary: BlockBasedTableOptions.hash_index_allow_collision is already deprecated and has no effect. Delete it for preparing 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9454 Test Plan: Run all existing tests. Reviewed By: ajkr Differential Revision: D33805827 fbshipit-source-id: ed8a436d1d083173ec6aef2a762ba02e1eefdc9d |
|
Andrew Kryczka | 0a89cea5f5 |
Handle failures in block-based table size/offset approximation (#9615)
Summary: In crash test with fault injection, we were seeing stack traces like the following: ``` https://github.com/facebook/rocksdb/issues/3 0x00007f75f763c533 in __GI___assert_fail (assertion=assertion@entry=0x1c5b2a0 "end_offset >= start_offset", file=file@entry=0x1c580a0 "table/block_based/block_based_table_reader.cc", line=line@entry=3245, function=function@entry=0x1c60e60 "virtual uint64_t rocksdb::BlockBasedTable::ApproximateSize(const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::TableReaderCaller)") at assert.c:101 https://github.com/facebook/rocksdb/issues/4 0x00000000010ea9b4 in rocksdb::BlockBasedTable::ApproximateSize (this=<optimized out>, start=..., end=..., caller=<optimized out>) at table/block_based/block_based_table_reader.cc:3224 https://github.com/facebook/rocksdb/issues/5 0x0000000000be61fb in rocksdb::TableCache::ApproximateSize (this=0x60f0000161b0, start=..., end=..., fd=..., caller=caller@entry=rocksdb::kCompaction, internal_comparator=..., prefix_extractor=...) at db/table_cache.cc:719 https://github.com/facebook/rocksdb/issues/6 0x0000000000c3eaec in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, v=<optimized out>, f=..., start=..., end=..., caller=<optimized out>) at ./db/version_set.h:850 https://github.com/facebook/rocksdb/issues/7 0x0000000000c6ebc3 in rocksdb::VersionSet::ApproximateSize (this=<optimized out>, options=..., v=v@entry=0x621000047500, start=..., end=..., start_level=start_level@entry=0, end_level=<optimized out>, caller=<optimized out>) at db/version_set.cc:5657 https://github.com/facebook/rocksdb/issues/8 0x000000000166e894 in rocksdb::CompactionJob::GenSubcompactionBoundaries (this=<optimized out>) at ./include/rocksdb/options.h:1869 https://github.com/facebook/rocksdb/issues/9 0x000000000168c526 in rocksdb::CompactionJob::Prepare (this=this@entry=0x7f75f3ffcf00) at db/compaction/compaction_job.cc:546 ``` The problem occurred in `ApproximateSize()` when the index `Seek()` for the first `ApproximateDataOffsetOf()` encountered an I/O error, while the second `Seek()` did not. In the old code that scenario caused `start_offset == data_size` , thus it was easy to trip the assertion that `end_offset >= start_offset`. The fix is to set `start_offset == 0` when the first index `Seek()` fails, and `end_offset == data_size` when the second index `Seek()` fails. I doubt these give an "on average correct" answer for how this function is used, but I/O errors in index seeks are hopefully rare, it looked consistent with what was already there, and it was easier to calculate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9615 Test Plan: run the repro command for a while and stopped seeing coredumps - ``` $ while ! ./db_stress --block_size=128 --cache_size=32768 --clear_column_family_one_in=0 --column_families=1 --continuous_verification_interval=0 --db=/dev/shm/rocksdb_crashtest --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --index_type=2 --iterpercent=10 --kill_random_test=18887 --max_key=1000000 --max_bytes_for_level_base=2048576 --nooverwritepercent=1 --open_files=-1 --open_read_fault_one_in=32 --ops_per_thread=1000000 --prefixpercent=5 --read_fault_one_in=0 --readpercent=45 --reopen=0 --skip_verifydb=1 --subcompactions=2 --target_file_size_base=524288 --test_batches_snapshots=0 --value_size_mult=32 --write_buffer_size=524288 --writepercent=35 ; do : ; done ``` Reviewed By: pdillinger Differential Revision: D34383069 Pulled By: ajkr fbshipit-source-id: fac26c3b20ea962e75387515ba5f2724dc48719f |
|
mrambacher | 30b08878d8 |
Make FilterPolicy Customizable (#9590)
Summary: Make FilterPolicy into a Customizable class. Allow new FilterPolicy to be discovered through the ObjectRegistry Pull Request resolved: https://github.com/facebook/rocksdb/pull/9590 Reviewed By: pdillinger Differential Revision: D34327367 Pulled By: mrambacher fbshipit-source-id: 37e7edac90ec9457422b72f359ab8ef48829c190 |
|
Andrew Kryczka | 54fb2a8975 |
Change type of cache buffer passed to `Cache::CreateCallback()` to `const void*` (#9595)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9595 Reviewed By: riversand963 Differential Revision: D34329906 Pulled By: ajkr fbshipit-source-id: 508601856fa9bee4d40f4a68d14d333ef2143d40 |
|
Peter Dillinger | 725833a424 |
Hide FilterBits{Builder,Reader} from public API (#9592)
Summary: We don't have any evidence of people using these to build custom filters. The recommended way of customizing filter handling is to defer to various built-in policies based on FilterBuildingContext (e.g. to build Monkey filtering policy). With old API, we have evidence of people modifying keys going into filter, but most cases of that can be handled with prefix_extractor. Having FilterBitsBuilder+Reader in the public API is an ogoing hinderance to code evolution (e.g. recent new Finish and MaybePostVerify), and so this change removes them from the public API for 7.0. Maybe they will come back in some form later, but lacking evidence of them providing value in the public API, we want to take back more freedom to evolve these. With this moved to internal-only, there is no rush to clean up the complex Finish signatures, or add memory allocator support, but doing so is much easier with them out of public API, for example to use CacheAllocationPtr without exposing it in the public API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9592 Test Plan: cosmetic changes only Reviewed By: hx235 Differential Revision: D34315470 Pulled By: pdillinger fbshipit-source-id: 03e03bb66a72c73df2c464d2dbbbae906dd8f99b |
|
Andrew Kryczka | babe56ddba |
Add rate limiter priority to ReadOptions (#9424)
Summary: Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working. `RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`. There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads). The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424 Test Plan: - new unit tests - new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart. - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true` - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true` - crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10` Reviewed By: hx235 Differential Revision: D33747386 Pulled By: ajkr fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c |
|
Peter Dillinger | 1e403a0c6c |
Fix assertion failure in FastLocalBloomBitsBuilder (#9585)
Summary: As in ``` db_stress: table/block_based/filter_policy.cc:316: rocksdb::{anonymous}::FastLocalBloomBitsBuilder::FastLocalBloomBitsBuilder(int, std::atomic<long int>*, std::shared_ptr<rocksdb::CacheReservationManager>, bool): Assertion `millibits_per_key >= 1000' failed. ``` This assertion failure was actually happening with our RibbonFilterPolicy which falls back to Bloom for some cases, often for flush, but was missing new special logic to skip generating filter for 0 bits per key case. Fixed by adding the logic in other builtin FilterPolicy implementations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9585 Test Plan: Updated db_bloom_filter_test to do more integration testing of the RibbonFilterPolicy ("auto Ribbon") class, incl regression test this with SkipFilterOnEssentiallyZeroBpk Reviewed By: ajkr Differential Revision: D34295101 Pulled By: pdillinger fbshipit-source-id: 3488eb207fc1d67bbbd1301313714aa1b6406e6e |
|
Peter Dillinger | 8c681087c7 |
Refactor FilterPolicies toward Customizable (#9567)
Summary: Some changes to make it easier to make FilterPolicy customizable. Especially, create distinct classes for the different testing-only and user-facing built-in FilterPolicy modes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9567 Test Plan: tests updated, with no intended difference in functionality tested. No difference in test performance seen as a result of moving to string-based filter type configuration. Reviewed By: mrambacher Differential Revision: D34234694 Pulled By: pdillinger fbshipit-source-id: 8a94931a9e04c3bcca863a4f524cfd064aaf0122 |
|
Peter Dillinger | 479eb1aad6 |
Hide deprecated, inefficient block-based filter from public API (#9535)
Summary: This change removes the ability to configure the deprecated, inefficient block-based filter in the public API. Options that would have enabled it now use "full" (and optionally partitioned) filters. Existing block-based filters can still be read and used, and a "back door" way to build them still exists, for testing and in case of trouble. About the only way this removal would cause an issue for users is if temporary memory for filter construction greatly increases. In HISTORY.md we suggest a few possible mitigations: partitioned filters, smaller SST files, or setting reserve_table_builder_memory=true. Or users who have customized a FilterPolicy using the CreateFilter/KeyMayMatch mechanism removed in https://github.com/facebook/rocksdb/issues/9501 will have to upgrade their code. (It's long past time for people to move to the new builder/reader customization interface.) This change also introduces some internal-use-only configuration strings for testing specific filter implementations while bypassing some compatibility / intelligence logic. This is intended to hint at a path toward making FilterPolicy Customizable, but it also gives us a "back door" way to configure block-based filter. Aside: updated db_bench so that -readonly implies -use_existing_db Pull Request resolved: https://github.com/facebook/rocksdb/pull/9535 Test Plan: Unit tests updated. Specifically, * BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back door configuration interface and ignoring of `use_block_based_builder`. * BlockBasedTableTest.TracingGetTest is migrated from testing block-based filter access pattern to full filter access patter, by re-ordering some things. * Options test (pretty self-explanatory) Performance test - create with `./db_bench -db=/dev/shm/rocksdb1 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0` with and without `-use_block_based_filter`, which creates a DB with 21 SST files in L0. Read with `./db_bench -db=/dev/shm/rocksdb1 -readonly -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=30` Without -use_block_based_filter: readrandom 464 ops/sec, 689280 KB DB With -use_block_based_filter: readrandom 169 ops/sec, 690996 KB DB No consistent difference with fillrandom Reviewed By: jay-zhuang Differential Revision: D34153871 Pulled By: pdillinger fbshipit-source-id: 31f4a933c542f8f09aca47fa64aec67832a69738 |
|
Peter Dillinger | 68a9c186d0 |
FilterPolicy API changes for 7.0 (#9501)
Summary: * Inefficient block-based filter is no longer customizable in the public API, though (for now) can still be enabled. * Removed deprecated FilterPolicy::CreateFilter() and FilterPolicy::KeyMayMatch() * Removed `rocksdb_filterpolicy_create()` from C API * Change meaning of nullptr return from GetBuilderWithContext() from "use block-based filter" to "generate no filter in this case." This is a cleaner solution to the proposal in https://github.com/facebook/rocksdb/issues/8250. * Also, when user specifies bits_per_key < 0.5, we now round this down to "no filter" because we expect a filter with >= 80% FP rate is unlikely to be worth the CPU cost of accessing it (esp with cache_index_and_filter_blocks=1 or partition_filters=1). * bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP rate) * This also gives us some support for configuring filters from OPTIONS file as currently saved: `filter_policy=rocksdb.BuiltinBloomFilter`. Opening from such an options file will enable reading filters (an improvement) but not writing new ones. (See Customizable follow-up below.) * Also removed deprecated functions * FilterBitsBuilder::CalculateNumEntry() * FilterPolicy::GetFilterBitsBuilder() * NewExperimentalRibbonFilterPolicy() * Remove default implementations of * FilterBitsBuilder::EstimateEntriesAdded() * FilterBitsBuilder::ApproximateNumEntries() * FilterPolicy::GetBuilderWithContext() * Remove support for "filter_policy=experimental_ribbon" configuration string. * Allow "filter_policy=bloomfilter:n" without bool to discourage use of block-based filter. Some pieces for https://github.com/facebook/rocksdb/issues/9389 Likely follow-up (later PRs): * Refactoring toward FilterPolicy Customizable, so that we can generate filters with same configuration as before when configuring from options file. * Remove support for user enabling block-based filter (ignore `bool use_block_based_builder`) * Some months after this change, we could even remove read support for block-based filter, because it is not critical to DB data preservation. * Make FilterBitsBuilder::FinishV2 to avoid `using FilterBitsBuilder::Finish` mess and add support for specifying a MemoryAllocator (for cache warming) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9501 Test Plan: A number of obsolete tests deleted and new tests or test cases added or updated. Reviewed By: hx235 Differential Revision: D34008011 Pulled By: pdillinger fbshipit-source-id: a39a720457c354e00d5b59166b686f7f59e392aa |
|
satyajanga | 036bbab6f7 |
Use the comparator from the sst file table properties in sst_dump_tool (#9491)
Summary: We introduced a new Comparator for timestamp in user keys. In the sst_dump_tool by default we use BytewiseComparator to read sst files. This change allows us to read comparator_name from table properties in meta data block and use it to read. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9491 Test Plan: added unittests for new functionality. make check ![image](https://user-images.githubusercontent.com/4923556/152915444-28b88a1f-7b4e-47d0-815f-7011552bd9a2.png) ![image](https://user-images.githubusercontent.com/4923556/152916196-bea3d2a1-a3d5-4362-b911-036131b83e8d.png) Reviewed By: riversand963 Differential Revision: D33993614 Pulled By: satyajanga fbshipit-source-id: 4b5cf938e6d2cb3931d763bef5baccc900b8c536 |
|
Hui Xiao | 920386f2b7 |
Detect (new) Bloom/Ribbon Filter construction corruption (#9342)
Summary: Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393 **Context:** (Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following: a) set of keys to add to filter b) set of hashes to add to filter (64-bit hash applied to each key) c) set of Bloom indices to set in filter, with duplicates d) set of Bloom indices to set in filter, deduplicated e) final filter and its checksum This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level. - b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`) - c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO. Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default. **Summary:** - Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()` - Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design - Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries` - Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()` - When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342 Test Plan: - Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption` - Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 | grep 'Build avg'` to make sure nothing break. - Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()` - FastLocalBloom - Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s** - After change: - `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)** - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)** - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)** - Standard128Ribbon - Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s** - After change: - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)** - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)** - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)** - Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true` - Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status. Reviewed By: pdillinger Differential Revision: D33746928 Pulled By: hx235 fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57 |
|
Peter Dillinger | f6d7ec1d02 |
Ignore `total_order_seek` in DB::Get (#9427)
Summary: Apparently setting total_order_seek=true for DB::Get was intended to allow accurate read semantics if the current prefix extractor doesn't match what was used to generate SST files on disk. But since prefix_extractor was made a mutable option in 5.14.0, we have been able to detect this case and provide the correct semantics regardless of the total_order_seek option. Since that time, the option has only made Get() slower in a reasonably common case: prefix_extractor unchanged and whole_key_filtering=false. So this change primarily removes unnecessary effect of total_order_seek on Get. Also cleans up some related comments. Also adds a -total_order_seek option to db_bench and canonicalizes handling of ReadOptions in db_bench so that command line options have the expected association with library features. (There is potential for change in regression test behavior, but the old behavior is likely indefensible, or some other inconsistency would need to be fixed.) TODO in follow-up work: there should be no reason for Get() to depend on current prefix extractor at all. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9427 Test Plan: Unit tests updated. Performance (using db_bench update) Create DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12 -whole_key_filtering=0` Test with and without `-total_order_seek` on `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=readrandom -num=10000000 -duration=40 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Before this change, total_order_seek=false: 25188 ops/sec Before this change, total_order_seek=true: 1222 ops/sec (~20x slower) After this change, total_order_seek=false: 24570 ops/sec After this change, total_order_seek=true: 25012 ops/sec (indistinguishable) Reviewed By: siying Differential Revision: D33753458 Pulled By: pdillinger fbshipit-source-id: bf892f34907a5e407d9c40bd4d42f0adbcbe0014 |
|
Peter Dillinger | ea89c77f27 |
Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453)
Summary: MemTable::MultiGet was not considering range tombstones before querying Bloom filter. This means range tombstones would be skipped for keys (or prefixes) with no other entries in the memtable. This could cause old values for a key (in SST files) to still show up until the range tombstone covering it has been flushed. This is fixed by essentially disabling the memtable Bloom filter when there are any range tombstones. (This could be better optimized in the future, but good enough for now.) Did some other cleanup/optimization in the same code to (more than) offset the cost of checking on range tombstones in more cases. There is now notable improvement when memtable_whole_key_filtering and prefix_extractor are used together (unusual), and this makes MultiGet closer to the Get implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453 Test Plan: new unit test added. Added memtable Bloom to crash test. Performance testing -------------------- Build WAL-only DB (recovers to memtable): ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000 ``` Query test command, to maximize sensitivity to the changed code: ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS ``` (Note -num here is 10x larger for mostly memtable misses) Before & after run simultaneously, average over 10 iterations per data point, ops/sec. MWKF=0 PXS=0 (Bloom disabled) Before: 5724844 After: 6722066 MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful) Before: 9981319 After: 10237990 MWKF=0 PXS=8 (prefixes unique; Bloom useful) Before: 12081715 After: 12117603 MWKF=1 PXS=0 (whole key Bloom useful) Before: 11944354 After: 12096085 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version) Before: 9444299 After: 11826029 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version) Before: 11784465 After: 11778591 Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster. Reviewed By: ajkr Differential Revision: D33805025 Pulled By: pdillinger fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf |
|
Hui Xiao | 11d7329503 |
Clarify status-handling logic in BlockBasedTableBuilder::WriteRawBlock (#9393)
Summary: **Context:** Inside `BlockBasedTableBuilder::WriteRawBlock`, there are multiple places that change local variables `io_s` and `s` while depend on them. This PR attempts to clarify the relevant logics so that it's easier to read and add places of changing these local variables later (like https://github.com/facebook/rocksdb/pull/9342.) without changing the current behavior. **Summary:** - Shorten the lifetime of local var `io_s` and `s` as much as possible to avoid if-else branches by early return **Test** - Reasoned against original behavior to verify new changes do not break existing behaviors. - Rely on CI tests since we are not changing current behavior. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9393 Reviewed By: pdillinger Differential Revision: D33626095 Pulled By: hx235 fbshipit-source-id: 6184d1e1d85d2650d16617c449971988d062ed3f |
|
sdong | 5d10a53b42 |
Not try to finish index builder after errors (#9426)
Summary: Right now, when error happens in block based table reader, we still call index_builder->Finish(), this causes one assertion in one stress test: db_stress: table/block_based/index_builder.cc:202: virtual rocksdb::Status rocksdb::PartitionedIndexBuilder::Finish(rocksdb::IndexBuilder::IndexBlocks*, const rocksdb::BlockHandle&): Assertion `sub_index_builder_ == nullptr' failed. This unlikely causes any corruption as we would finally abandon the file, but the code is confusing and it is hard to understand what would happen. Changing the behavior. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9426 Test Plan: Run existing tests Reviewed By: pdillinger Differential Revision: D33751929 fbshipit-source-id: 3c916b9444a4171010fc53df40496570bef5ae7a |
|
Peter Dillinger | fc9d4071f0 |
Fast path for detecting unchanged prefix_extractor (#9407)
Summary: Fixes a major performance regression in 6.26, where extra CPU is spent in SliceTransform::AsString when reads involve a prefix_extractor (Get, MultiGet, Seek). Common case performance is now better than 6.25. This change creates a "fast path" for verifying that the current prefix extractor is unchanged and compatible with what was used to generate a table file. This fast path detects the common case by pointer comparison on the current prefix_extractor and a "known good" prefix extractor (if applicable) that is saved at the time the table reader is opened. The "known good" prefix extractor is saved as another shared_ptr copy (in an existing field, however) to ensure the pointer is not recycled. When the prefix_extractor has changed to a different instance but same compatible configuration (rare, odd), performance is still a regression compared to 6.25, but this is likely acceptable because of the oddity of such a case. The performance of incompatible prefix_extractor is essentially unchanged. Also fixed a minor case (ForwardIterator) where a prefix_extractor could be used via a raw pointer after being freed as a shared_ptr, if replaced via SetOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9407 Test Plan: ## Performance Populate DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Running head-to-head comparisons simultaneously with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=seekrandom -num=10000000 -duration=20 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Below each is compared by ops/sec vs. baseline which is version 6.25 (multiple baseline runs because of variable machine load) v6.26: 4833 vs. 6698 (<- major regression!) v6.27: 4737 vs. 6397 (still) New: 6704 vs. 6461 (better than baseline in common case) Disabled fastpath: 4843 vs. 6389 (e.g. if prefix extractor instance changes but is still compatible) Changed prefix size (no usable filter) in new: 787 vs. 5927 Changed prefix size (no usable filter) in new & baseline: 773 vs. 784 Reviewed By: mrambacher Differential Revision: D33677812 Pulled By: pdillinger fbshipit-source-id: 571d9711c461fb97f957378a061b7e7dbc4d6a76 |
|
Peter Dillinger | ffe1e4b820 |
Make some FilterPolicy deprecations more clear (#9403)
Summary: The old block-based filter has been deprecated for years, but this makes that more clear by marking the functions specific to it and logging a warning when the feature is used. It is deprecated because of performance. In that old design, you have to binary search through the full SST index before a bloom filter query, which is much more expensive than a bloom query itself. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9403 Test Plan: Used db_bench with and without -use_block_based_filter, running at the same time TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom,readrandom -num=10000000 -duration=20 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 No significant difference in construction time but 3x slower readrandom with -use_block_based_filter: readrandom : 100.517 micros/op 9948 ops/sec; 1.1 MB/s vs. readrandom : 33.368 micros/op 29968 ops/sec; 3.3 MB/s Also saw deprecation message (just once) in LOG only with -use_block_based_filter Reviewed By: ajkr Differential Revision: D33673202 Pulled By: pdillinger fbshipit-source-id: 99f6f0eff619408d9e5f7ef546954ed0be6c7a5b |
|
Si Ke | 93b1de4f45 |
Enable db_test running in Centos 32 bit OS and Alpine 32 bit OS (#9294)
Summary: Closes https://github.com/facebook/rocksdb/issues/9271 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9294 Reviewed By: riversand963, hx235 Differential Revision: D33586002 Pulled By: pdillinger fbshipit-source-id: 3d1a2fa71023e108613ff03dbd37a5f954fc4920 |
|
mrambacher | 1973fcba11 |
Restore Regex support for ObjectLibrary::Register, rename new APIs to allow old one to be deprecated in the future (#9362)
Summary: In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions. The PatternEntry methods were left in place but renamed to AddFactory. The goal is to allow for the deprecation of the original regex Registry method in an upcoming release. Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362 Reviewed By: pdillinger Differential Revision: D33432562 Pulled By: mrambacher fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02 |
|
Hui Xiao | 9110685e8c |
Release cache reservation of hash entries of the fall-back Ribbon Filter earlier (#9345)
Summary: Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, as part of https://github.com/facebook/rocksdb/pull/9342 **Context:** https://github.com/facebook/rocksdb/pull/9073 charged the hash entries' memory in block cache with `CacheReservationHandle`. However, in the edge case where Ribbon Filter falls back to Bloom Filter and swaps its hash entries to the embedded bloom filter object, the handles associated with those entries are not swapped and thus not released as soon as those entries are cleared during Bloom Filter's finish process. Although this is a minor issue since RocksDB internal calls `FilterBitsBuilder->Reset()` right after `FilterBitsBuilder->Finish()` on the main path, which releases all the cache reservation related to both the Ribbon Filter and its embedded Bloom Filter, it still worths this fix to avoid confusion. **Summary:** - Swapped the `CacheReservationHandle` associated with the hash entries on Ribbon Filter's fallback Pull Request resolved: https://github.com/facebook/rocksdb/pull/9345 Test Plan: - Added a unit test to verify the number of cache reservation after clearing hash entries, which failed before the change and now succeeds Reviewed By: pdillinger Differential Revision: D33377225 Pulled By: hx235 fbshipit-source-id: 7487f4c40dfb6ee7928232021f93ef2c5329cffa |
|
Hui Xiao | fb0a76a9e2 |
Always check previous conditionally unchecked status due to shortcut evaluation in BlockBasedTableBuilder::WriteIndexBlock (#9349)
Summary: Note: part of https://github.com/facebook/rocksdb/pull/9342 **Context/Summary:** Due to shortcut evaluation in `ok() && s.IsIncomplete()`, status `s` remains unchecked if `ok()==false`, which is the case in https://app.circleci.com/pipelines/github/facebook/rocksdb/10718/workflows/429f7ad4-6b9a-446b-b9b3-710d51b90409/jobs/265508 revealed by the change in the corresponding PR https://github.com/facebook/rocksdb/pull/9342. As suggested by reviewers, separation and clarification of status checking for partitioned index building from general table building status is added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9349 Test Plan: - The newly added if-else code is an equivalent translation of the existing logic plus always checking the conditionally unchecked status so relying on existing tests should be fine - https://github.com/facebook/rocksdb/pull/9342's `[build-linux-shared_lib-alt_namespace-status_checked](https://app.circleci.com/pipelines/github/facebook/rocksdb/10721/workflows/a200efe0-d545-4075-8c42-26dd3dc00f27/jobs/265625)` test should now pass after rebasing on this change Reviewed By: pdillinger Differential Revision: D33377223 Pulled By: hx235 fbshipit-source-id: cb81da9709ae9185e9cea89776e3012e915d6ef9 |
|
mrambacher | 1c39b7952b |
Remove/Reduce use of Regex in ObjectRegistry/Library (#9264)
Summary: Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex. For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes. For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes. There is also a class for Customizable::IndividualId format matches. Added tests for the new derivative classes and got all unit tests to pass. Resolves https://github.com/facebook/rocksdb/issues/9225. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9264 Reviewed By: pdillinger Differential Revision: D33062001 Pulled By: mrambacher fbshipit-source-id: c2d2143bd2d38bdf522705c8280c35381b135c03 |
|
mrambacher | 9a116ab4b4 |
Add NewMetaDataIterator method (#8692)
Summary: Fixes a problem where the iterator for metadata was being treated as a non-user key when in fact it was a user key. This led to a problem where the property keys could not be searched for correctly. The main exposure of this problem was that the HashIndexReader could not get the "prefixes" property correctly, resulting in the failure of retrieval/creation of the BlockPrefixIndex. Added BlockBasedTableTest.SeekMetaBlocks test to validate this condition. Fixing this condition exposed two other tests (SeekWithPrefixLongerThanKey, MultiGetPrefixFilter) that passed incorrectly previously and now failed. Updated those two tests to pass. Not sure if the tests are functionally correct/still appropriate, but made them pass... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8692 Reviewed By: riversand963 Differential Revision: D33119539 Pulled By: mrambacher fbshipit-source-id: 658969fe9265f73dc184dab97cc3f4eaed2d881a |
|
mrambacher | 423538a816 |
Make MemoryAllocator into a Customizable class (#8980)
Summary: - Make MemoryAllocator and its implementations into a Customizable class. - Added a "DefaultMemoryAllocator" which uses new and delete - Added a "CountedMemoryAllocator" that counts the number of allocs and free - Updated the existing tests to use these new allocators - Changed the memkind allocator test into a generic test that can test the various allocators. - Added tests for creating all of the allocators - Added tests to verify/create the JemallocNodumpAllocator using its options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980 Reviewed By: zhichao-cao Differential Revision: D32990403 Pulled By: mrambacher fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76 |
|
Peter Dillinger | 0050a73a4f |
New stable, fixed-length cache keys (#9126)
Summary: This change standardizes on a new 16-byte cache key format for block cache (incl compressed and secondary) and persistent cache (but not table cache and row cache). The goal is a really fast cache key with practically ideal stability and uniqueness properties without external dependencies (e.g. from FileSystem). A fixed key size of 16 bytes should enable future optimizations to the concurrent hash table for block cache, which is a heavy CPU user / bottleneck, but there appears to be measurable performance improvement even with no changes to LRUCache. This change replaces a lot of disjointed and ugly code handling cache keys with calls to a simple, clean new internal API (cache_key.h). (Preserving the old cache key logic under an option would be very ugly and likely negate the performance gain of the new approach. Complete replacement carries some inherent risk, but I think that's acceptable with sufficient analysis and testing.) The scheme for encoding new cache keys is complicated but explained in cache_key.cc. Also: EndianSwapValue is moved to math.h to be next to other bit operations. (Explains some new include "math.h".) ReverseBits operation added and unit tests added to hash_test for both. Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126 Test Plan: ### Basic correctness Several tests needed updates to work with the new functionality, mostly because we are no longer relying on filesystem for stable cache keys so table builders & readers need more context info to agree on cache keys. This functionality is so core, a huge number of existing tests exercise the cache key functionality. ### Performance Create db with `TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters` And test performance with `TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4` using DEBUG_LEVEL=0 and simultaneous before & after runs. Before ops/sec, avg over 100 runs: 121924 After ops/sec, avg over 100 runs: 125385 (+2.8%) ### Collision probability I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity over many months, by making some pessimistic simplifying assumptions: * Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys) * All of every file is cached for its entire lifetime We use a simple table with skewed address assignment and replacement on address collision to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output with `./cache_bench -stress_cache_key -sck_keep_bits=40`: ``` Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached) ``` These come from default settings of 2.5M files per day of 32 MB each, and `-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality. More default assumptions, relatively pessimistic: * 100 DBs in same process (doesn't matter much) * Re-open DB in same process (new session ID related to old session ID) on average every 100 files generated * Restart process (all new session IDs unrelated to old) 24 times per day After enough data, we get a result at the end: ``` (keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected) ``` If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data: ``` (keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected) (keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected) ``` The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases: ``` 197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected) ``` I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data. Reviewed By: zhichao-cao Differential Revision: D33171746 Pulled By: pdillinger fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f |
|
Akanksha Mahajan | 96d0773a11 |
Update prepopulate_block_cache logic to support block-based filter (#9300)
Summary: Update prepopulate_block_cache logic to support block-based filter during insertion in block cache Pull Request resolved: https://github.com/facebook/rocksdb/pull/9300 Test Plan: CircleCI tests, make crash_test -j64 Reviewed By: pdillinger Differential Revision: D33132018 Pulled By: akankshamahajan15 fbshipit-source-id: 241deabab8645bda704728e572d6de6354df18b2 |
|
Peter Dillinger | e92a0ed040 |
Optimize & clean up footer code (#9280)
Summary: Again, ahead of planned changes in https://github.com/facebook/rocksdb/issues/9058. This change improves performance (vs. pre-https://github.com/facebook/rocksdb/issues/9240 baseline) by separating a FooterBuilder from Footer, where FooterBuilder includes (inline owns) the serialized data so that it can be stack allocated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9280 Test Plan: existing tests + performance testing below Extreme case performance testing as in https://github.com/facebook/rocksdb/issues/9240 with TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness) Pre-https://github.com/facebook/rocksdb/issues/9240 baseline ( |
|
Peter Dillinger | 653c392e47 |
More refactoring ahead of footer & meta changes (#9240)
Summary: I'm working on a new format_version=6 to support context checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test updates to support that change. Test coverage data and manual inspection agree on dead code in block_based_table_reader.cc (removed). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240 Test Plan: tests enhanced to cover more cases etc. Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible. TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true} (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness) Before w/ wal: 454512 After w/ wal: 444820 (-2.1%) Before w/o wal: 1004560 After w/o wal: 998897 (-0.6%) Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case. This regression will be corrected in a follow-up PR. Reviewed By: ajkr Differential Revision: D32813769 Pulled By: pdillinger fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f |
|
Akanksha Mahajan | 9e4d56f2c9 |
Fix segmentation fault in table_options.prepopulate_block_cache when used with partition_filters (#9263)
Summary: When table_options.prepopulate_block_cache is set to BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and table_options.partition_filters is also set true, then there is segmentation failure when top level filter is fetched because its entered with wrong type in cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263 Test Plan: Updated unit tests; Ran db_stress: make crash_test -j32 Reviewed By: pdillinger Differential Revision: D32936566 Pulled By: akankshamahajan15 fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426 |
|
Hui Xiao | 9daf07305c |
Replace TableProperties::properties_offsets map with external_sst_file_global_seqno_offset (#9212)
Summary: **Context:** Searching `TableProperties::properties_offsets` across the codebase reveals that internally it is only used to find the external SST file's global seqno offeset. Therefore we can narrow it down and replace this map property with a uint64_t property `external_sst_file_global_seqno_offset` to save memory usage related to table properties. Note: - See PR comments for discussion about potential impact on existing external usage of `TableProperties::properties_offsets` - See PR comments for discussion on keeping external SST file global seqno's offset VS using a simple flag indicating seqno's existence. **Summary:** - Replaced `TableProperties::properties_offsets` with `TableProperties::external_sst_file_global_seqno_offset` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9212 Test Plan: - Relied on existing tests should be sufficient since `TableProperties::properties_offsets` existed before and should already be tested. Reviewed By: ajkr Differential Revision: D32665941 Pulled By: hx235 fbshipit-source-id: 718e44617346dc4f3b1276ee953e61c196277795 |
|
Akanksha Mahajan | 04b2c16f9b |
Fix bug in rocksdb internal automatic prefetching (#9234)
Summary: After introducing adaptive_readahead, the original flow got broken. Readahead size was set to 0 because of which rocksdb wasn't be able to do automatic prefetching which it enables after seeing sequential reads. This PR fixes it. ---------------------------------------------------------------------------------------------------- Before this patch: b_bench -use_existing_db=true -db=/tmp/prefix_scan -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 6.27 Date: Tue Nov 30 11:56:50 2021 CPU: 24 * Intel Core Processor (Broadwell) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2594.0 MB (estimated) FileSize: 1373.3 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/prefix_scan] seekrandom : 5356367.174 micros/op 0 ops/sec; 29.4 MB/s (23 of 23 found) ---------------------------------------------------------------------------------------------------- After the patch: ./db_bench -use_existing_db=true -db=/tmp/prefix_scan -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 6.27 Date: Tue Nov 30 14:38:33 2021 CPU: 24 * Intel Core Processor (Broadwell) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2594.0 MB (estimated) FileSize: 1373.3 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/prefix_scan] seekrandom : 456504.277 micros/op 2 ops/sec; 359.8 MB/s (264 of 264 found) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9234 Test Plan: Ran ./db_bench -db=/data/mysql/rocksdb/prefix_scan -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_d irect_io_for_flush_and_compaction=true -target_file_size_base=16777216 and then ./db_bench -use_existing_db=true -db=/data/mysql/rocksdb/prefix_scan -benchmarks="seekrandom" -key_size=32 -value_siz e=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 and compared the results. Reviewed By: anand1976 Differential Revision: D32743965 Pulled By: akankshamahajan15 fbshipit-source-id: b950fba68c91963b7deb5c20acdf471bc60251f5 |
|
Levi Tamasi | dc5de45af8 |
Support readahead during compaction for blob files (#9187)
Summary: The patch adds a new BlobDB configuration option `blob_compaction_readahead_size` that can be used to enable prefetching data from blob files during compaction. This is important when using storage with higher latencies like HDDs or remote filesystems. If enabled, prefetching is used for all cases when blobs are read during compaction, namely garbage collection, compaction filters (when the existing value has to be read from a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187 Test Plan: Ran `make check` and the stress/crash test. Reviewed By: riversand963 Differential Revision: D32565512 Pulled By: ltamasi fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d |
|
Peter Dillinger | cd4ea675e3 |
Fix backward compatibility with 2.5 through 2.7 (#9189)
Summary: A bug in https://github.com/facebook/rocksdb/issues/9163 can cause checksum verification to fail if parsing a properties block fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9189 Test Plan: check_format_compatible.sh (never quite works locally but this particular case seems fixed using variants of SHORT_TEST=1). And added new unit test case. Reviewed By: ajkr Differential Revision: D32574626 Pulled By: pdillinger fbshipit-source-id: 6fa5c8595737b71a3c3d011a52daf6d6c08715d7 |
|
slk | e12753eb71 |
Track each SST's timestamp information as user properties (#9093)
Summary: Track each SST's timestamp information as user properties https://github.com/facebook/rocksdb/issues/8959 Rockdb has supported user-defined timestamp feature. Application can specify a timestamp when writing each k-v pair. When data flush from memory to disk file called SST files. Each SST files consist of multiple data blocks and several metadata blocks. Among the metadata blocks, there is one called Properties block that tracks some pre-defined properties of this SST file. This PR is for collecting the properties of min and max timestamps of all keys in the file. With those properties the SST file is more convenient to tell whether the keys in the SST have timestamps or not. The changes involved are as follows: 1) Add a class TimestampTablePropertiesCollector to collect min/max timestamp when add keys to table, The way TimestampTablePropertiesCollector use to compare timestamp of key should defined by user by implementing the Comparator::CompareTimestamp function in the user defined comparator. 2) Add corresponding unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9093 Reviewed By: ltamasi Differential Revision: D32406927 Pulled By: riversand963 fbshipit-source-id: 25922971b7e67bacf4d53a1fb67c4c5ddaa61573 |
|
Peter Dillinger | 230660be73 |
Improve / clean up meta block code & integrity (#9163)
Summary: * Checksums are now checked on meta blocks unless specifically suppressed or not applicable (e.g. plain table). (Was other way around.) This means a number of cases that were not checking checksums now are, including direct read TableProperties in Version::GetTableProperties (fixed in meta_blocks ReadTableProperties), reading any block from PersistentCache (fixed in BlockFetcher), read TableProperties in SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open, maybe more. * For that to work, I moved the global_seqno+TableProperties checksum logic to the shared table/ code, because that is used by many utilies such as SstFileDumper. * Also for that to work, we have to know when we're dealing with a block that has a checksum (trailer), so added that capability to Footer based on magic number, and from there BlockFetcher. * Knowledge of trailer presence has also fixed a problem where other table formats were reading blocks including bytes for a non-existant trailer--and awkwardly kind-of not using them, e.g. no shared code checking checksums. (BlockFetcher compression type was populated incorrectly.) Now we only read what is needed. * Minimized code duplication and differing/incompatible/awkward abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block without parsing block handle) * Moved some meta block handling code from table_properties*.* * Moved some code specific to block-based table from shared table/ code to BlockBasedTable class. The checksum stuff means we can't completely separate it, but things that don't need to be in shared table/ code should not be. * Use unique_ptr rather than raw ptr in more places. (Note: you can std::move from unique_ptr to shared_ptr.) Without enhancements to GetPropertiesOfAllTablesTest (see below), net reduction of roughly 100 lines of code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163 Test Plan: existing tests and * Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that checksums are now checked on direct read of table properties by TableCache (new test would fail before this change) * Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test putting table properties under old meta name * Also generally enhanced that same test to actually test what it was supposed to be testing already, by kicking things out of table cache when we don't want them there. Reviewed By: ajkr, mrambacher Differential Revision: D32514757 Pulled By: pdillinger fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9 |
|
Hui Xiao | 74544d582f |
Account Bloom/Ribbon filter construction memory in global memory limit (#9073)
Summary: Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge. **Context:** Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`. The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option. **Summary:** - Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction: - hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance), - banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`), - final filter (i.e, `mutable_buf`'s size). - Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as explicitly delete the filter bits builder when done with the final filter in block based table. - Added option fo run `filter_bench` with this memory reservation feature Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073 Test Plan: - Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory` - To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around **3000 - 6000 ms** and others take around **1500 - 2000 ms**, in total adding **20000 - 25000 ms** to the test suit running locally - Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter - Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over **20** run as below: - FastLocalBloom - baseline `./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg'`: - **Build avg ns/key: 29.56295** (DEBUG_LEVEL=1), **29.98153** (DEBUG_LEVEL=0) - new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg'`: - **Build avg ns/key: 30.99046** (DEBUG_LEVEL=1), **30.48867** (DEBUG_LEVEL=0) - new feature of RibbonFilter with fallback (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` : - **Build avg ns/key: 31.146975** (DEBUG_LEVEL=1), **30.08165** (DEBUG_LEVEL=0) - Ribbon128 - baseline `./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg'`: - **Build avg ns/key: 129.17585** (DEBUG_LEVEL=1), **130.5225** (DEBUG_LEVEL=0) - new feature (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg' `: - **Build avg ns/key: 131.61645** (DEBUG_LEVEL=1), **132.98075** (DEBUG_LEVEL=0) - new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` : - **Build avg ns/key: 52.032965** (DEBUG_LEVEL=1), **52.597825** (DEBUG_LEVEL=0) - And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console. Reviewed By: pdillinger Differential Revision: D31991348 Pulled By: hx235 fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e |
|
Peter Dillinger | f8c685c4fc |
Check for and disallow shared key space in block caches (#9172)
Summary: We have three layers of block cache that often use the same key but map to different physical data: * BlockBasedTableOptions::block_cache * BlockBasedTableOptions::block_cache_compressed * BlockBasedTableOptions::persistent_cache If any two of these happen to share an underlying implementation and key space (insertion into one shows up in another), then memory safety is broken. The simplest case is block_cache == block_cache_compressed. (Credit mrambacher for asking about this case in a review.) With this change, we explicitly check for overlap and preemptively and safely fail with a Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9172 Test Plan: test added. Crashes without new check Reviewed By: anand1976 Differential Revision: D32465659 Pulled By: pdillinger fbshipit-source-id: 3876b45b6dce6167e5a7a642725ddc86b96f8e40 |
|
Akanksha Mahajan | 17ce1ca48b |
Reuse internal auto readhead_size at each Level (expect L0) for Iterations (#9056)
Summary: RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB. This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block. 1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer), readahead_size will decrease by 8KB. 2. It maintains readahead_size for L1 - Ln levels. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056 Test Plan: Added new unit tests Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled. Reviewed By: anand1976 Differential Revision: D31773640 Pulled By: akankshamahajan15 fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98 |
|
anand76 | dddb791c18 |
Enable a few unit tests to use custom Env objects (#9087)
Summary: Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087 Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc. Reviewed By: riversand963 Differential Revision: D32007222 Pulled By: anand1976 fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792 |
|
Hui Xiao | 3018a3e27e |
Minor improvement to CacheReservationManager/WriteBufferManager/CompressionDictBuilding (#9139)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9139 Reviewed By: zhichao-cao Differential Revision: D32211415 Pulled By: hx235 fbshipit-source-id: 39ce036ba34e1fb4a1992a33ac6904a4a943301d |
|
Hui Xiao | 1ababeb76a |
Deallocate payload of BlockBasedTableBuilder::Rep::FilterBlockBuilder earlier for Full/PartitionedFilter (#9070)
Summary: Note: This PR is the 1st part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073). Context: Previously, the payload (i.e, filter data) within `BlockBasedTableBuilder::Rep::FilterBlockBuilder` object is not deallocated until `BlockBasedTableBuilder` is deallocated, despite it is no longer useful after its related `filter_content` being written. - Transferred the payload (i.e, the filter data) out of `BlockBasedTableBuilder::Rep::FilterBlockBuilder` object - For PartitionedFilter: - Unified `filters` and `filter_gc` lists into one `std::deque<FilterEntry> filters` by adding a new field `last_filter_entry_key` and storing the `std::unique_ptr filter_data` with the `Slice filter` in the same entry - Reset `last_filter_data` in the case where `filters` is empty, which should be as by then we would've finish using all the `Slice filter` - Deallocated the payload by going out of scope as soon as we're done with using the `filter_content` associated with the payload - This is an internal interface change at the level of `FilterBlockBuilder::Finish()`, which leads to touching the inherited interface in `BlockBasedFilterBlockBuilder`. But for that, the payload transferring is ignored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9070 Test Plan: - The main focus is to catch segment fault error during `FilterBlockBuilder::Finish()` and `BlockBasedTableBuilder::Finish()` and interface mismatch. Relying on existing CI tests is enough as `assert(false)` was temporarily added to verify the new logic of transferring ownership indeed run Reviewed By: pdillinger Differential Revision: D31884933 Pulled By: hx235 fbshipit-source-id: f73ecfbea13788d4fc058013ace27230110b52f4 |
|
Peter Dillinger | dfedc74d82 |
Some checksum code refactoring (#9113)
Summary: To prepare for adding checksum to footer and "context aware" checksums. This also brings closely related code much closer together. Recently added `BlockBasedTableBuilder::ComputeBlockTrailer` for testing is made obsolete in the refactoring, as testing the checksums can happen at a lower level of abstraction. Also now checking for unrecognized checksum type on reading footer, rather than later on use. Also removed an obsolete function delcaration. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9113 Test Plan: existing tests worked before refactoring to remove `ComputeBlockTrailer`. And then refactored+improved tests using it. Reviewed By: mrambacher Differential Revision: D32090149 Pulled By: pdillinger fbshipit-source-id: 2879da683c1498ea85a3b70dace9b6d9f6b47b6e |
|
hx235 | a5ec5e3ea0 |
Minor improvement to #8428 (Account for dictionary-building buffer in global memory limit) (#9032)
Summary: Summary/Context: - Renamed `cache_rev_mng` to `compression_dict_buffer_cache_res_mgr` - It is to distinguish with other potential `cache_res_mgr` in `BlockBasedTableBuilder` and to use correct short-hand for the words "reservation", "manager" - Added `table_options.block_cache == nullptr` in additional to `table_options.no_block_cache == true` to be conditions where we don't create a `CacheReservationManager` - Theoretically `table_options.no_block_cache == true` is equivalent to `table_options.block_cache == nullptr` by API. But since segment fault will be generated by passing `nullptr` into `CacheReservationManager`'s constructor, it does not hurt to directly verify `table_options.block_cache != nullptr` before passing in - Renamed `is_cache_full` to `exceeds_global_block_cache_limit` - It is to hide implementation detail of cache reservation and to emphasize on the concept/design intent of caping memory within global block cache limit Pull Request resolved: https://github.com/facebook/rocksdb/pull/9032 Test Plan: - Passing existing tests Reviewed By: ajkr Differential Revision: D32005807 Pulled By: hx235 fbshipit-source-id: 619fd17bb924199de3db5924d8ab7dae53b1efa2 |
|
Peter Dillinger | a7d4bea43a |
Implement XXH3 block checksum type (#9069)
Summary: XXH3 - latest hash function that is extremely fast on large data, easily faster than crc32c on most any x86_64 hardware. In integrating this hash function, I have handled the compression type byte in a non-standard way to avoid using the streaming API (extra data movement and active code size because of hash function complexity). This approach got a thumbs-up from Yann Collet. Existing functionality change: * reject bad ChecksumType in options with InvalidArgument This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is likely to be handled through different configuration than ChecksumType. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069 Test Plan: tests updated, and substantially expanded. Unit tests now check that we don't accidentally change the values generated by the checksum algorithms ("schema test") and that we properly handle invalid/unrecognized checksum types in options or in file footer. DBTestBase::ChangeOptions (etc.) updated from two to one configuration changing from default CRC32c ChecksumType. The point of this test code is to detect possible interactions among features, and the likelihood of some bad interaction being detected by including configurations other than XXH3 and CRC32c--and then not detected by stress/crash test--is extremely low. Stress/crash test also updated (manual run long enough to see it accepts new checksum type). db_bench also updated for microbenchmarking checksums. ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3 crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op) xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op) xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op) xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op) crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op) xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op) xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op) xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op) crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op) xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op) xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op) xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op) As you can see, especially once warmed up, xxh3 is fastest. ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) Test for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done Results (ops/sec) for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done results-0 252118 # kNoChecksum results-1 251588 # kCRC32c results-2 251863 # kxxHash results-3 252016 # kxxHash64 results-4 252038 # kXXH3 Reviewed By: mrambacher Differential Revision: D31905249 Pulled By: pdillinger fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1 |
|
Peter Dillinger | 5bf9a7d5ee |
Clarify caching behavior for index and filter partitions (#9068)
Summary: Somewhat confusingly, index and filter partition blocks are never owned by table readers, even with cache_index_and_filter_blocks=false. They still go into block cache (possibly pinned by table reader) if there is a block cache. If no block cache, they are only loaded transiently on demand. This PR primarily clarifies the options APIs and some internal code comments. Also, this closes a hypothetical data corruption vulnerability where some but not all index partitions are pinned. I haven't been able to reproduce a case where it can happen (the failure seems to propagate to abort table open) but it's worth patching nonetheless. Fixes https://github.com/facebook/rocksdb/issues/8979 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9068 Test Plan: existing tests :-/ I could cover the new code using sync points, but then I'd have to very carefully relax my `assert(false)` Reviewed By: ajkr Differential Revision: D31898284 Pulled By: pdillinger fbshipit-source-id: f2511a7d3a36bc04b627935d8e6cfea6422f98be |
|
Zhichao Cao | 6d93b87588 |
Add lowest_used_cache_tier to ImmutableDBOptions to enable or disable Secondary Cache (#9050)
Summary: Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050 Test Plan: added new tests Reviewed By: anand1976 Differential Revision: D31744769 Pulled By: zhichao-cao fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6 |
|
Peter Dillinger | b234a3f569 |
Improve data block construction performance (#9040)
Summary: ... by bypassing tracking of last_key in BlockBuilder when last_key is already known (for BlockBasedTableBuilder::data_block). I tried extracting a base class of BlockBuilder without the last_key tracking at all, but that became complicated by NewFlushBlockPolicy() in the public API referencing BlockBuilder, which would need to be the base class, and I don't want to replace nearly all the internal references to BlockBuilder. Possible follow-up: * Investigate / consider using AddWithLastKey in more places This improvement should stack with https://github.com/facebook/rocksdb/issues/9039 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9040 Test Plan: TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000 Compiled with DEBUG_LEVEL=0 Test vs. control runs simulaneous for better accuracy, units = ops/sec Run 1: 278929 vs. 267799 (+4.2%) Run 2: 281836 vs. 267432 (+5.4%) Run 3: 278279 vs. 270454 (+2.9%) (This benchmark is chosen to have detectable signal-to-noise, not to represent expected improvement percent on real workloads.) Reviewed By: mrambacher Differential Revision: D31706033 Pulled By: pdillinger fbshipit-source-id: 8a50fe6fefdd67b6d7665ffa687bbdcf5ad0d5ec |
|
Peter Dillinger | ad5325a736 |
Experimental support for SST unique IDs (#8990)
Summary: * New public header unique_id.h and function GetUniqueIdFromTableProperties which computes a universally unique identifier based on table properties of table files from recent RocksDB versions. * Generation of DB session IDs is refactored so that they are guaranteed unique in the lifetime of a process running RocksDB. (SemiStructuredUniqueIdGen, new test included.) Along with file numbers, this enables SST unique IDs to be guaranteed unique among SSTs generated in a single process, and "better than random" between processes. See https://github.com/pdillinger/unique_id * In addition to public API producing 'external' unique IDs, there is a function for producing 'internal' unique IDs, with functions for converting between the two. In short, the external ID is "safe" for things people might do with it, and the internal ID enables more "power user" features for the future. Specifically, the external ID goes through a hashing layer so that any subset of bits in the external ID can be used as a hash of the full ID, while also preserving uniqueness guarantees in the first 128 bits (bijective both on first 128 bits and on full 192 bits). Intended follow-up: * Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into the third 64-bit value of the unique ID.) * Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990 Test Plan: Unit tests added, and checking of unique ids in stress test. NOTE in stress test we do not generate nearly enough files to thoroughly stress uniqueness, but the test trims off pieces of the ID to check for uniqueness so that we can infer (with some assumptions) stronger properties in the aggregate. Reviewed By: zhichao-cao, mrambacher Differential Revision: D31582865 Pulled By: pdillinger fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243 |
|
Jay Zhuang | b4326b5273 |
Fix gcc-11 compile error (#9043)
Summary: gcc11 added new static check. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9043 Test Plan: Added CI for gcc11 build Reviewed By: zhichao-cao Differential Revision: D31716005 Pulled By: jay-zhuang fbshipit-source-id: 9f53be6f2f9e58e39b83359f6bbf66f945d57429 |
|
Peter Dillinger | 9d66d6d13e |
Two performance improvements in BlockBuilder (#9039)
Summary: Primarily, this change reserves space in the std::string for building the next block once a block is finished, using `block_size` as reservation size. Note: also tried reusing same std::string in the common "unbuffered" path but that showed no benefit or regression. Secondarily, this slightly reduces the work in resetting `restarts_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9039 Test Plan: TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000 Compiled with DEBUG_LEVEL=0 Test vs. control runs simulaneous for better accuracy, units = ops/sec Run 1, Primary change only: 292697 vs. 280267 (+4.4%) Run 2, Primary change only: 288763 vs. 279621 (+3.3%) Run 1, Secondary change only: 260065 vs. 254232 (+2.3%) Run 2, Secondary change only: 275925 vs. 272248 (+1.4%) Run 1, Both changes: 284890 vs. 270372 (+5.3%) Run 2, Both changes: 263511 vs. 258188 (+2.0%) Reviewed By: zhichao-cao Differential Revision: D31701253 Pulled By: pdillinger fbshipit-source-id: 7e40810afbb98e6b6446955e77bda59e69b19ffd |
|
leipeng | 4c277ab201 |
MergingIterator: rearrange fields to reduce paddings (#9024)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9024 Reviewed By: pdillinger Differential Revision: D31614752 Pulled By: ajkr fbshipit-source-id: ef19ae243127f992e982a5a3b8ddefe7946246f8 |
|
Andrew Kryczka | c0ec58ecb9 |
stop populating unused/invalid MergingIterator heaps (#8975)
Summary: I was looking at https://github.com/facebook/rocksdb/issues/2636 and got very confused that `MergingIterator::AddIterator()` is populating `min_heap_` with dangling pointers. There is justification in the comments that `min_heap_` will be cleared before it's used, but it'd be cleaner to not populate it with dangling pointers in the first place. Also made similar change in the constructor for consistency, although the pointers there would not be dangling, just unused. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8975 Test Plan: rely on existing tests Reviewed By: pdillinger, hx235 Differential Revision: D31273767 Pulled By: ajkr fbshipit-source-id: 127ca9dd1f82f77f55dd0c3f19511de3282fc229 |
|
Zhichao Cao | 699f45049d |
Introduce a mechanism to dump out blocks from block cache and re-insert to secondary cache (#8912)
Summary: Background: Cache warming up will cause potential read performance degradation due to reading blocks from storage to the block cache. Since in production, the workload and access pattern to a certain DB is stable, it is a potential solution to dump out the blocks belonging to a certain DB to persist storage (e.g., to a file) and bulk-load the blocks to Secondary cache before the DB is relaunched. For example, when migrating a DB form host A to host B, it will take a short period of time, the access pattern to blocks in the block cache will not change much. It is efficient to dump out the blocks of certain DB, migrate to the destination host and insert them to the Secondary cache before we relaunch the DB. Design: we introduce the interface of CacheDumpWriter and CacheDumpRead for user to store the blocks dumped out from block cache. RocksDB will encode all the information and send the string to the writer. User can implement their own writer it they want. CacheDumper and CacheLoad are introduced to save the blocks and load the blocks respectively. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8912 Test Plan: add new tests to lru_cache_test and pass make check. Reviewed By: pdillinger Differential Revision: D31452871 Pulled By: zhichao-cao fbshipit-source-id: 11ab4f5d03e383f476947116361d54188d36ec48 |
|
mrambacher | 53e595d1f3 |
Cleanup multiple implementations of VectorIterator (#8901)
Summary: There were three implementations of VectorIterator (util/vector_iterator, test_util/testutil.h and LoggingForwardVectorIterator). Merged them into one class to increase code coverage/testing and reduce duplication. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8901 Reviewed By: pdillinger Differential Revision: D31022673 Pulled By: mrambacher fbshipit-source-id: 8e3acbd2dfd60b4df609d02cc72846de2389d531 |
|
mrambacher | 13ae16c315 |
Cleanup includes in dbformat.h (#8930)
Summary: This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead. Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing. Hopefully, this sort of code hygiene cleanup will speed up the builds... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930 Reviewed By: pdillinger Differential Revision: D31142788 Pulled By: mrambacher fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d |
|
Hui Xiao | d6bd1a0291 |
Support "level_at_creation" in TablePropertiesCollectorFactory::Context (#8919)
Summary: Context: Exposing the level of the sst file (i.e, table) where it is created in `TablePropertiesCollectorFactory::Context` allows users of `TablePropertiesCollectorFactory` to customize some implementation details of `TablePropertiesCollectorFactory` and `TablePropertiesCollector` based on the level of creation. For example, `TablePropertiesCollector::NeedCompact()` can return different values based on level of creation. - Declared an extra field `level_at_creation` in `TablePropertiesCollectorFactory::Context` - Allowed `level_at_creation` to be passed in as an argument in `IntTblPropCollectorFactory::CreateIntTblPropCollector()` and `UserKeyTablePropertiesCollectorFactory::CreateIntTblPropCollector()`, the latter of which is an internal wrapper of user's passed-in `TablePropertiesCollectorFactory::CreateTablePropertiesCollector()` used in table-building process - Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` passed into both `BlockBasedTableBuilder` and `PlainTableBuilder` - `PlainTableBuilder` previously did not capture `level_at_creation` from `TableBuilderOptions` in `PlainTableFactory`. In order for it to call the method with this parameter, this PR also made `PlainTableBuilder` capture `level_at_creation` as a required parameter - Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` its overridden functions in its derived classes, including `RegularKeysStartWithAFactory::CreateIntTblPropCollector()` in `table_properties_collector_test.cc`, `SstFileWriterPropertiesCollectorFactory::CreateIntTblPropCollector()` in `sst_file_writer_collectors.h` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8919 Test Plan: - Passed the added assertion for `context.level_at_creation` - Passed existing tests - Run `Make` to make sure adding a required parameter to `PlainTableBuilder`'s constructor does not break anything Reviewed By: anand1976 Differential Revision: D30951729 Pulled By: hx235 fbshipit-source-id: c4a0173b0d9344a4cf47e1b987d759c1c73cb474 |
|
ricky | b59b7570cf |
More clear error message on uncompressing block (#8934)
Summary: The origin error message of uncompressing block is confusing, which may result from either build support or data corruption. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8934 Reviewed By: ltamasi Differential Revision: D31112588 Pulled By: pdillinger fbshipit-source-id: 1cbf2d4fbcb0ef376cf942246d06f48cb603f852 |
|
mrambacher | e0f697d2bd |
Make SliceTransform into a Customizable class (#8641)
Summary: Made SliceTransform into a Customizable class. Would be nice to write a test that stored and used a custom transform in an SST table. There are a set of tests (DBBlockFliterTest.PrefixExtractor*, SamePrefixTest.InDomainTest, PrefixTest.PrefixAndWholeKeyTest that run the same with or without a SliceTransform/PrefixFilter. Is this expected? Pull Request resolved: https://github.com/facebook/rocksdb/pull/8641 Reviewed By: zhichao-cao Differential Revision: D31142793 Pulled By: mrambacher fbshipit-source-id: bb08672fccbfdc263dcae21f25a62307e1facda1 |
|
Yanqin Jin | b512f4bc76 |
Batch blob read IO for MultiGet (#8699)
Summary: In batched `MultiGet()`, RocksDB batches blob read IO and uses `RandomAccessFileReader::MultiRead()` to read the blobs instead of issuing multiple `Read()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8699 Test Plan: ``` make check ``` Reviewed By: ltamasi Differential Revision: D31030861 Pulled By: riversand963 fbshipit-source-id: a0df6060cbfd54cff9515a4eee08807b1dbcb0c8 |
|
Peter Dillinger | 2819c7840e |
Fix PrepopulateBlockCache::kFlushOnly (#8750)
Summary: kFlushOnly currently means "always" except in the case of remote compaction. This makes it flushes only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8750 Test Plan: test updated Reviewed By: akankshamahajan15 Differential Revision: D30968034 Pulled By: pdillinger fbshipit-source-id: 5dbd24dde18852a0e937a540995fba9bfbe89037 |
|
anand76 | 7743f033b1 |
More robust checking of IO uring completion data (#8894)
Summary: Potential bugs in the IO uring implementation can cause bad data to be returned in the completion queue. Add some checks in the PosixRandomAccessFile::MultiRead completion handling code to catch such errors and fail the entire MultiRead. Also log some diagnostic messages and stack trace. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8894 Reviewed By: siying, pdillinger Differential Revision: D30826982 Pulled By: anand1976 fbshipit-source-id: af91815ac760e095d6cc0466cf8bd5c10167fd15 |
|
Peter Dillinger | bda8d93ba9 |
Fix and detect headers with missing dependencies (#8893)
Summary: It's always annoying to find a header does not include its own dependencies and only works when included after other includes. This change adds `make check-headers` which validates that each header can be included at the top of a file. Some headers are excluded e.g. because of platform or external dependencies. rocksdb_namespace.h had to be re-worked slightly to enable checking for failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.) Fixes mostly involve adding and cleaning up #includes, but for FileTraceWriter, a constructor was out-of-lined to make a forward declaration sufficient. This check is not currently run with `make check` but is added to CircleCI build-linux-unity since that one is already relatively fast. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893 Test Plan: existing tests and resolving issues detected by new check Reviewed By: mrambacher Differential Revision: D30823300 Pulled By: pdillinger fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572 |
|
Levi Tamasi | 7e78d7c540 |
Support timestamps in SstFileWriter (#8899)
Summary: As a first step of supporting user-defined timestamps with ingestion, the patch adds timestamp support to `SstFileWriter`; namely, it adds new versions of the `Put` and `Delete` APIs that take timestamps. (`Merge` and `DeleteRange` are currently not supported with user-defined timestamps in general but once those features are implemented, we can handle them in `SstFileWriter` in a similar fashion.) The new APIs validate the size of the timestamp provided by the client. Similarly, calls to the pre-existing timestamp-less APIs are now disallowed when user-defined timestamps are in use according to the comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8899 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D30850699 Pulled By: ltamasi fbshipit-source-id: 779154373618f19b8f0797976bb7286783c57b67 |
|
Hui Xiao | 91b95cadee |
Account for dictionary-building buffer in global memory limit (#8428)
Summary: Context: Some data blocks are temporarily buffered in memory in BlockBasedTableBuilder for building compression dictionary used in data block compression. Currently this memory usage is not counted toward our global memory usage utilizing block cache capacity. To improve that, this PR charges that memory usage into the block cache to achieve better memory tracking and limiting. - Reserve memory in block cache for buffered data blocks that are used to build a compression dictionary - Release all the memory associated with buffering the data blocks mentioned above in EnterUnbuffered(), which is called when (a) buffer limit is exceeded after buffering OR (b) the block cache becomes full after reservation OR (c) BlockBasedTableBuilder calls Finish() Pull Request resolved: https://github.com/facebook/rocksdb/pull/8428 Test Plan: - Passing existing unit tests - Passing new unit tests Reviewed By: ajkr Differential Revision: D30755305 Pulled By: hx235 fbshipit-source-id: 6e66665020b775154a94c4c5e0f2adaeaff13981 |