mirror of https://github.com/facebook/rocksdb.git
371 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Peter Dillinger | e01073252b |
Tests verifying non-zero checksums of zero bytes (#11260)
Summary: Adds unit tests verifying that a block payload and checksum of all zeros is not falsely considered valid data. The test exhaustively checks that for blocks up to some length (default 20K, more exhaustively 10M) of all zeros do not produce a block checksum of all zeros. Also small refactoring of an existing checksum test to use parameterized test. (Suggest hiding whitespace changes for review.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11260 Test Plan: this is the test, manual run with `ROCKSDB_THOROUGH_CHECKSUM_TEST=1` to verify up to 10M. Reviewed By: hx235 Differential Revision: D43706192 Pulled By: pdillinger fbshipit-source-id: 95e721c320ca928e7fa2400c2570fb359cc30b1f |
|
sdong | 4720ba4391 |
Remove RocksDB LITE (#11147)
Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2 |
|
Peter Dillinger | 433d7e4594 |
Improve error messages for SST footer and size errors (#11009)
Summary: Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure: * Magic number error is reported first if we attempt to open an SST file and the footer is completely bad. * Footer errors are reported with affected file. * If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009 Test Plan: unit tests added, some manual Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed). Reviewed By: siying Differential Revision: D41656272 Pulled By: pdillinger fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df |
|
anand76 | 727bad78b8 |
Format files under table/ by clang-format (#10852)
Summary: Run clang-format on files under the `table` directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10852 Reviewed By: ajkr Differential Revision: D40650732 Pulled By: anand1976 fbshipit-source-id: 2023a958e37fd6274040c5181130284600c9e0ef |
|
Peter Dillinger | 27c9705ac4 |
Use kXXH3 as default checksum (CPU efficiency) (#10778)
Summary: Since this has been supported for about a year, I think it's time to make it the default. This should improve CPU efficiency slightly on most hardware. A current DB performance comparison using buck+clang build: ``` TEST_TMPDIR=/dev/shm ./db_bench -checksum_type={1,4} -benchmarks=fillseq[-X1000] -num=3000000 -disable_wal ``` kXXH3 (+0.2% DB write throughput): `fillseq [AVG 1000 runs] : 822149 (± 1004) ops/sec; 91.0 (± 0.1) MB/sec` kCRC32c: `fillseq [AVG 1000 runs] : 820484 (± 1203) ops/sec; 90.8 (± 0.1) MB/sec` Micro benchmark comparison: ``` ./db_bench --benchmarks=xxh3[-X20],crc32c[-X20] ``` Machine 1, buck+clang build: `xxh3 [AVG 20 runs] : 3358616 (± 19091) ops/sec; 13119.6 (± 74.6) MB/sec` `crc32c [AVG 20 runs] : 2578725 (± 7742) ops/sec; 10073.1 (± 30.2) MB/sec` Machine 2, make+gcc build, DEBUG_LEVEL=0 PORTABLE=0: `xxh3 [AVG 20 runs] : 6182084 (± 137223) ops/sec; 24148.8 (± 536.0) MB/sec` `crc32c [AVG 20 runs] : 5032465 (± 42454) ops/sec; 19658.1 (± 165.8) MB/sec` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10778 Test Plan: make check, unit tests updated Reviewed By: ajkr Differential Revision: D40112510 Pulled By: pdillinger fbshipit-source-id: e59a8d50a60346137732f8668ba7cfac93be2b37 |
|
akankshamahajan | 0e7b27bfcf |
Refactor block cache tracing APIs (#10811)
Summary: Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. `DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`. New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer. This same philosophy can be applied to KV and IO tracing as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811 Test Plan: existing unit tests Old API DB::StartBlockTrace checked with db_bench tool create database ``` ./db_bench --benchmarks="fillseq" \ --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \ --cache_index_and_filter_blocks --cache_size=1048576 \ --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \ --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 ``` To trace block cache accesses when running readrandom benchmark: ``` ./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \ --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \ --cache_index_and_filter_blocks --cache_size=1048576 \ --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \ --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \ --threads=16 \ -block_cache_trace_file="/tmp/binary_trace_test_example" \ -block_cache_trace_max_trace_file_size_in_bytes=1073741824 \ -block_cache_trace_sampling_frequency=1 ``` Reviewed By: anand1976 Differential Revision: D40435289 Pulled By: akankshamahajan15 fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282 |
|
Peter Dillinger | 6de7081cf3 |
Always verify SST unique IDs on SST file open (#10532)
Summary: Although we've been tracking SST unique IDs in the DB manifest unconditionally, checking has been opt-in and with an extra pass at DB::Open time. This changes the behavior of `verify_sst_unique_id_in_manifest` to check unique ID against manifest every time an SST file is opened through table cache (normal DB operations), replacing the explicit pass over files at DB::Open time. This change also enables the option by default and removes the "EXPERIMENTAL" designation. One possible criticism is that the option no longer ensures the integrity of a DB at Open time. This is far from an all-or-nothing issue. Verifying the IDs of all SST files hardly ensures all the data in the DB is readable. (VerifyChecksum is supposed to do that.) Also, with max_open_files=-1 (default, extremely common), all SST files are opened at DB::Open time anyway. Implementation details: * `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass that is now removed. * Unit tests that manipulate/corrupt table properties have to opt out of this check, because that corrupts the "actual" unique id. (And even for testing we don't currently have a mechanism to set "no unique id" in the in-memory file metadata for new files.) * A lot of other unit test churn relates to (a) default checking on, and (b) checking on SST open even without DB::Open (e.g. on flush) * Use `FileMetaData` for more `TableCache` operations (in place of `FileDescriptor`) so that we have access to the unique_id whenever we might need to open an SST file. **There is the possibility of performance impact because we can no longer use the more localized `fd` part of an `FdWithKeyRange` but instead follow the `file_metadata` pointer. However, this change (possible regression) is only done for `GetMemoryUsageByTableReaders`.** * Removed a completely unnecessary constructor overload of `TableReaderOptions` Possible follow-up: * Verification only happens when opening through table cache. Are there more places where this should happen? * Improve error message when there is a file size mismatch vs. manifest (FIXME added in the appropriate place). * I'm not sure there's a justification for `FileDescriptor` to be distinct from `FileMetaData`. * I'm skeptical that `FdWithKeyRange` really still makes sense for optimizing some data locality by duplicating some data in memory, but I could be wrong. * An unnecessary overload of NewTableReader was recently added, in the public API nonetheless (though unusable there). It should be cleaned up to put most things under `TableReaderOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532 Test Plan: updated unit tests Performance test showing no significant difference (just noise I think): `./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000` Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec Reviewed By: jay-zhuang Differential Revision: D38765551 Pulled By: pdillinger fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2 |
|
Levi Tamasi | 81388b36e0 |
Add support for wide-column point lookups (#10540)
Summary: The patch adds a new API `GetEntity` that can be used to perform wide-column point lookups. It also extends the `Get` code path and the `MemTable` / `MemTableList` and `Version` / `GetContext` logic accordingly so that wide-column entities can be served from both memtables and SSTs. If the result of a lookup is a wide-column entity (`kTypeWideColumnEntity`), it is passed to the application in deserialized form; if it is a plain old key-value (`kTypeValue`), it is presented as a wide-column entity with a single default (anonymous) column. (In contrast, regular `Get` returns plain old key-values as-is, and returns the value of the default column for wide-column entities, see https://github.com/facebook/rocksdb/issues/10483 .) The result of `GetEntity` is a self-contained `PinnableWideColumns` object. `PinnableWideColumns` contains a `PinnableSlice`, which either stores the underlying data in its own buffer or holds on to a cache handle. It also contains a `WideColumns` instance, which indexes the contents of the `PinnableSlice`, so applications can access the values of columns efficiently. There are several pieces of functionality which are currently not supported for wide-column entities: there is currently no `MultiGetEntity` or wide-column iterator; also, `Merge` and `GetMergeOperands` are not supported, and there is no `GetEntity` implementation for read-only and secondary instances. We plan to implement these in future PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10540 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D38847474 Pulled By: ltamasi fbshipit-source-id: 42311a34ccdfe88b3775e847a5e2a5296e002b5b |
|
Changyu Bi | 9d77bf8f7b |
Fragment memtable range tombstone in the write path (#10380)
Summary: - Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact. - db_bench is updated to print out the number of range deletions executed if there is any. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380 Test Plan: - CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed. - Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable. ``` single thread: ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100 multi_thread ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100 ``` Commit |
|
sdong | 252bea405e |
Improve SubCompaction Partitioning (#10393)
Summary: Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393 Test Plan: 1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable; 2. Add a new unit test to ApproximateKeyAnchors() 3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly. Reviewed By: jay-zhuang Differential Revision: D38043783 fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f |
|
Bo Wang | c073ed7601 |
Fix typo in comments and code (#10233)
Summary: Fix typo in comments and code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10233 Test Plan: Existing unit tests should pass. Reviewed By: jay-zhuang, anand1976 Differential Revision: D37356702 Pulled By: gitbw95 fbshipit-source-id: 32c019adcc6dcc95a9882b38147a310091368e51 |
|
Peter Dillinger | 126c223714 |
Remove deprecated block-based filter (#10184)
Summary: In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using the public API, because of its inefficiency. Although we normally maintain read compatibility on old DBs forever, filters are not required for reading a DB, only for optimizing read performance. Thus, it should be acceptable to remove this code and the substantial maintenance burden it carries as useful features are developed and validated (such as user timestamp). This change completely removes the code for reading and writing the old block-based filters, net removing about 1370 lines of code no longer needed. Options removed from testing / benchmarking tools. The prior existence is only evident in a couple of places: * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in a major release to minimize source code incompatibilities. * A warning is logged when an old table file is opened that used the old block-based filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing should suffice. Unfortunately, sst_dump does not tell you whether a file uses block-based filter, and the structure of the code makes it very difficult to fix. * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`) for metaindex is maintained (for now). Other notes: * In some cases where numbers are associated with filter configurations, we have had to update the assigned numbers so that they all correspond to something that exists. * Fixed potential stat counting bug by assuming `filter_checked = false` for cases like `filter == nullptr` rather than assuming `filter_checked = true` * Removed obsolete `block_offset` and `prefix_extractor` parameters from several functions. * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)` because the caller guarantees the prefix extractor exists and is compatible Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184 Test Plan: tests updated, manually test new warning in LOG using base version to generate a DB Reviewed By: riversand963 Differential Revision: D37212647 Pulled By: pdillinger fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80 |
|
Jay Zhuang | a96a4a2f7b |
Fix ApproximateOffsetOfCompressed test (#10048)
Summary: https://github.com/facebook/rocksdb/issues/9857 introduced new an option `use_zstd_dict_trainer`, which is stored in SST as text, e.g.: ``` ... zstd_max_train_bytes=0; enabled=0;... ``` it increased the sst size a little bit and cause `ApproximateOffsetOfCompressed` test to fail: ``` Value 7053 is not in range [4000, 7050] table/table_test.cc:4019: Failure Value of: Between(c.ApproximateOffsetOf("xyz"), 4000, 7050) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10048 Test Plan: verified the test pass after the change Reviewed By: cbi42 Differential Revision: D36643688 Pulled By: jay-zhuang fbshipit-source-id: bf12d211f6ae71937259ef21b1226bd06e8da717 |
|
Peter Dillinger | 0070680cfd |
Adjust public APIs to prefer 128-bit SST unique ID (#10009)
Summary: 128 bits should suffice almost always and for tracking in manifest. Note that this changes the output of sst_dump --show_properties to only show 128 bits. Also introduces InternalUniqueIdToHumanString for presenting internal IDs for debugging purposes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10009 Test Plan: unit tests updated Reviewed By: jay-zhuang Differential Revision: D36458189 Pulled By: pdillinger fbshipit-source-id: 93ebc4a3b6f9c73ee154383a1f8b291a5d6bbef5 |
|
Hui Xiao | 3573558ec5 |
Rewrite memory-charging feature's option API (#9926)
Summary: **Context:** Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`. Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags. **Summary:** - Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that - Added missing db bench/stress test for some memory charging features - Renamed related test suite to indicate they are under the same theme of memory charging - Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication - Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926 Test Plan: - New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)` - New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)` - CI - db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465** 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078** - db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36054712 Pulled By: hx235 fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af |
|
sdong | 736a7b5433 |
Remove own ToString() (#9955)
Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471 |
|
sdong | 33742c2a9f |
Remove BlockBasedTableOptions.hash_index_allow_collision (#9454)
Summary: BlockBasedTableOptions.hash_index_allow_collision is already deprecated and has no effect. Delete it for preparing 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9454 Test Plan: Run all existing tests. Reviewed By: ajkr Differential Revision: D33805827 fbshipit-source-id: ed8a436d1d083173ec6aef2a762ba02e1eefdc9d |
|
mrambacher | 30b08878d8 |
Make FilterPolicy Customizable (#9590)
Summary: Make FilterPolicy into a Customizable class. Allow new FilterPolicy to be discovered through the ObjectRegistry Pull Request resolved: https://github.com/facebook/rocksdb/pull/9590 Reviewed By: pdillinger Differential Revision: D34327367 Pulled By: mrambacher fbshipit-source-id: 37e7edac90ec9457422b72f359ab8ef48829c190 |
|
Andrew Kryczka | babe56ddba |
Add rate limiter priority to ReadOptions (#9424)
Summary: Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working. `RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`. There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads). The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424 Test Plan: - new unit tests - new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart. - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true` - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true` - crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10` Reviewed By: hx235 Differential Revision: D33747386 Pulled By: ajkr fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c |
|
Peter Dillinger | 479eb1aad6 |
Hide deprecated, inefficient block-based filter from public API (#9535)
Summary: This change removes the ability to configure the deprecated, inefficient block-based filter in the public API. Options that would have enabled it now use "full" (and optionally partitioned) filters. Existing block-based filters can still be read and used, and a "back door" way to build them still exists, for testing and in case of trouble. About the only way this removal would cause an issue for users is if temporary memory for filter construction greatly increases. In HISTORY.md we suggest a few possible mitigations: partitioned filters, smaller SST files, or setting reserve_table_builder_memory=true. Or users who have customized a FilterPolicy using the CreateFilter/KeyMayMatch mechanism removed in https://github.com/facebook/rocksdb/issues/9501 will have to upgrade their code. (It's long past time for people to move to the new builder/reader customization interface.) This change also introduces some internal-use-only configuration strings for testing specific filter implementations while bypassing some compatibility / intelligence logic. This is intended to hint at a path toward making FilterPolicy Customizable, but it also gives us a "back door" way to configure block-based filter. Aside: updated db_bench so that -readonly implies -use_existing_db Pull Request resolved: https://github.com/facebook/rocksdb/pull/9535 Test Plan: Unit tests updated. Specifically, * BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back door configuration interface and ignoring of `use_block_based_builder`. * BlockBasedTableTest.TracingGetTest is migrated from testing block-based filter access pattern to full filter access patter, by re-ordering some things. * Options test (pretty self-explanatory) Performance test - create with `./db_bench -db=/dev/shm/rocksdb1 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0` with and without `-use_block_based_filter`, which creates a DB with 21 SST files in L0. Read with `./db_bench -db=/dev/shm/rocksdb1 -readonly -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=30` Without -use_block_based_filter: readrandom 464 ops/sec, 689280 KB DB With -use_block_based_filter: readrandom 169 ops/sec, 690996 KB DB No consistent difference with fillrandom Reviewed By: jay-zhuang Differential Revision: D34153871 Pulled By: pdillinger fbshipit-source-id: 31f4a933c542f8f09aca47fa64aec67832a69738 |
|
Peter Dillinger | fc9d4071f0 |
Fast path for detecting unchanged prefix_extractor (#9407)
Summary: Fixes a major performance regression in 6.26, where extra CPU is spent in SliceTransform::AsString when reads involve a prefix_extractor (Get, MultiGet, Seek). Common case performance is now better than 6.25. This change creates a "fast path" for verifying that the current prefix extractor is unchanged and compatible with what was used to generate a table file. This fast path detects the common case by pointer comparison on the current prefix_extractor and a "known good" prefix extractor (if applicable) that is saved at the time the table reader is opened. The "known good" prefix extractor is saved as another shared_ptr copy (in an existing field, however) to ensure the pointer is not recycled. When the prefix_extractor has changed to a different instance but same compatible configuration (rare, odd), performance is still a regression compared to 6.25, but this is likely acceptable because of the oddity of such a case. The performance of incompatible prefix_extractor is essentially unchanged. Also fixed a minor case (ForwardIterator) where a prefix_extractor could be used via a raw pointer after being freed as a shared_ptr, if replaced via SetOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9407 Test Plan: ## Performance Populate DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Running head-to-head comparisons simultaneously with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=seekrandom -num=10000000 -duration=20 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Below each is compared by ops/sec vs. baseline which is version 6.25 (multiple baseline runs because of variable machine load) v6.26: 4833 vs. 6698 (<- major regression!) v6.27: 4737 vs. 6397 (still) New: 6704 vs. 6461 (better than baseline in common case) Disabled fastpath: 4843 vs. 6389 (e.g. if prefix extractor instance changes but is still compatible) Changed prefix size (no usable filter) in new: 787 vs. 5927 Changed prefix size (no usable filter) in new & baseline: 773 vs. 784 Reviewed By: mrambacher Differential Revision: D33677812 Pulled By: pdillinger fbshipit-source-id: 571d9711c461fb97f957378a061b7e7dbc4d6a76 |
|
mrambacher | 9a116ab4b4 |
Add NewMetaDataIterator method (#8692)
Summary: Fixes a problem where the iterator for metadata was being treated as a non-user key when in fact it was a user key. This led to a problem where the property keys could not be searched for correctly. The main exposure of this problem was that the HashIndexReader could not get the "prefixes" property correctly, resulting in the failure of retrieval/creation of the BlockPrefixIndex. Added BlockBasedTableTest.SeekMetaBlocks test to validate this condition. Fixing this condition exposed two other tests (SeekWithPrefixLongerThanKey, MultiGetPrefixFilter) that passed incorrectly previously and now failed. Updated those two tests to pass. Not sure if the tests are functionally correct/still appropriate, but made them pass... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8692 Reviewed By: riversand963 Differential Revision: D33119539 Pulled By: mrambacher fbshipit-source-id: 658969fe9265f73dc184dab97cc3f4eaed2d881a |
|
mrambacher | 423538a816 |
Make MemoryAllocator into a Customizable class (#8980)
Summary: - Make MemoryAllocator and its implementations into a Customizable class. - Added a "DefaultMemoryAllocator" which uses new and delete - Added a "CountedMemoryAllocator" that counts the number of allocs and free - Updated the existing tests to use these new allocators - Changed the memkind allocator test into a generic test that can test the various allocators. - Added tests for creating all of the allocators - Added tests to verify/create the JemallocNodumpAllocator using its options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980 Reviewed By: zhichao-cao Differential Revision: D32990403 Pulled By: mrambacher fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76 |
|
Peter Dillinger | 0050a73a4f |
New stable, fixed-length cache keys (#9126)
Summary: This change standardizes on a new 16-byte cache key format for block cache (incl compressed and secondary) and persistent cache (but not table cache and row cache). The goal is a really fast cache key with practically ideal stability and uniqueness properties without external dependencies (e.g. from FileSystem). A fixed key size of 16 bytes should enable future optimizations to the concurrent hash table for block cache, which is a heavy CPU user / bottleneck, but there appears to be measurable performance improvement even with no changes to LRUCache. This change replaces a lot of disjointed and ugly code handling cache keys with calls to a simple, clean new internal API (cache_key.h). (Preserving the old cache key logic under an option would be very ugly and likely negate the performance gain of the new approach. Complete replacement carries some inherent risk, but I think that's acceptable with sufficient analysis and testing.) The scheme for encoding new cache keys is complicated but explained in cache_key.cc. Also: EndianSwapValue is moved to math.h to be next to other bit operations. (Explains some new include "math.h".) ReverseBits operation added and unit tests added to hash_test for both. Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126 Test Plan: ### Basic correctness Several tests needed updates to work with the new functionality, mostly because we are no longer relying on filesystem for stable cache keys so table builders & readers need more context info to agree on cache keys. This functionality is so core, a huge number of existing tests exercise the cache key functionality. ### Performance Create db with `TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters` And test performance with `TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4` using DEBUG_LEVEL=0 and simultaneous before & after runs. Before ops/sec, avg over 100 runs: 121924 After ops/sec, avg over 100 runs: 125385 (+2.8%) ### Collision probability I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity over many months, by making some pessimistic simplifying assumptions: * Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys) * All of every file is cached for its entire lifetime We use a simple table with skewed address assignment and replacement on address collision to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output with `./cache_bench -stress_cache_key -sck_keep_bits=40`: ``` Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached) ``` These come from default settings of 2.5M files per day of 32 MB each, and `-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality. More default assumptions, relatively pessimistic: * 100 DBs in same process (doesn't matter much) * Re-open DB in same process (new session ID related to old session ID) on average every 100 files generated * Restart process (all new session IDs unrelated to old) 24 times per day After enough data, we get a result at the end: ``` (keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected) ``` If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data: ``` (keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected) (keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected) ``` The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases: ``` 197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected) ``` I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data. Reviewed By: zhichao-cao Differential Revision: D33171746 Pulled By: pdillinger fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f |
|
Peter Dillinger | e92a0ed040 |
Optimize & clean up footer code (#9280)
Summary: Again, ahead of planned changes in https://github.com/facebook/rocksdb/issues/9058. This change improves performance (vs. pre-https://github.com/facebook/rocksdb/issues/9240 baseline) by separating a FooterBuilder from Footer, where FooterBuilder includes (inline owns) the serialized data so that it can be stack allocated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9280 Test Plan: existing tests + performance testing below Extreme case performance testing as in https://github.com/facebook/rocksdb/issues/9240 with TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness) Pre-https://github.com/facebook/rocksdb/issues/9240 baseline ( |
|
Peter Dillinger | 653c392e47 |
More refactoring ahead of footer & meta changes (#9240)
Summary: I'm working on a new format_version=6 to support context checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test updates to support that change. Test coverage data and manual inspection agree on dead code in block_based_table_reader.cc (removed). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240 Test Plan: tests enhanced to cover more cases etc. Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible. TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true} (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness) Before w/ wal: 454512 After w/ wal: 444820 (-2.1%) Before w/o wal: 1004560 After w/o wal: 998897 (-0.6%) Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case. This regression will be corrected in a follow-up PR. Reviewed By: ajkr Differential Revision: D32813769 Pulled By: pdillinger fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f |
|
Hui Xiao | 9daf07305c |
Replace TableProperties::properties_offsets map with external_sst_file_global_seqno_offset (#9212)
Summary: **Context:** Searching `TableProperties::properties_offsets` across the codebase reveals that internally it is only used to find the external SST file's global seqno offeset. Therefore we can narrow it down and replace this map property with a uint64_t property `external_sst_file_global_seqno_offset` to save memory usage related to table properties. Note: - See PR comments for discussion about potential impact on existing external usage of `TableProperties::properties_offsets` - See PR comments for discussion on keeping external SST file global seqno's offset VS using a simple flag indicating seqno's existence. **Summary:** - Replaced `TableProperties::properties_offsets` with `TableProperties::external_sst_file_global_seqno_offset` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9212 Test Plan: - Relied on existing tests should be sufficient since `TableProperties::properties_offsets` existed before and should already be tested. Reviewed By: ajkr Differential Revision: D32665941 Pulled By: hx235 fbshipit-source-id: 718e44617346dc4f3b1276ee953e61c196277795 |
|
Levi Tamasi | dc5de45af8 |
Support readahead during compaction for blob files (#9187)
Summary: The patch adds a new BlobDB configuration option `blob_compaction_readahead_size` that can be used to enable prefetching data from blob files during compaction. This is important when using storage with higher latencies like HDDs or remote filesystems. If enabled, prefetching is used for all cases when blobs are read during compaction, namely garbage collection, compaction filters (when the existing value has to be read from a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187 Test Plan: Ran `make check` and the stress/crash test. Reviewed By: riversand963 Differential Revision: D32565512 Pulled By: ltamasi fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d |
|
Peter Dillinger | 230660be73 |
Improve / clean up meta block code & integrity (#9163)
Summary: * Checksums are now checked on meta blocks unless specifically suppressed or not applicable (e.g. plain table). (Was other way around.) This means a number of cases that were not checking checksums now are, including direct read TableProperties in Version::GetTableProperties (fixed in meta_blocks ReadTableProperties), reading any block from PersistentCache (fixed in BlockFetcher), read TableProperties in SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open, maybe more. * For that to work, I moved the global_seqno+TableProperties checksum logic to the shared table/ code, because that is used by many utilies such as SstFileDumper. * Also for that to work, we have to know when we're dealing with a block that has a checksum (trailer), so added that capability to Footer based on magic number, and from there BlockFetcher. * Knowledge of trailer presence has also fixed a problem where other table formats were reading blocks including bytes for a non-existant trailer--and awkwardly kind-of not using them, e.g. no shared code checking checksums. (BlockFetcher compression type was populated incorrectly.) Now we only read what is needed. * Minimized code duplication and differing/incompatible/awkward abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block without parsing block handle) * Moved some meta block handling code from table_properties*.* * Moved some code specific to block-based table from shared table/ code to BlockBasedTable class. The checksum stuff means we can't completely separate it, but things that don't need to be in shared table/ code should not be. * Use unique_ptr rather than raw ptr in more places. (Note: you can std::move from unique_ptr to shared_ptr.) Without enhancements to GetPropertiesOfAllTablesTest (see below), net reduction of roughly 100 lines of code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163 Test Plan: existing tests and * Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that checksums are now checked on direct read of table properties by TableCache (new test would fail before this change) * Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test putting table properties under old meta name * Also generally enhanced that same test to actually test what it was supposed to be testing already, by kicking things out of table cache when we don't want them there. Reviewed By: ajkr, mrambacher Differential Revision: D32514757 Pulled By: pdillinger fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9 |
|
Peter Dillinger | dfedc74d82 |
Some checksum code refactoring (#9113)
Summary: To prepare for adding checksum to footer and "context aware" checksums. This also brings closely related code much closer together. Recently added `BlockBasedTableBuilder::ComputeBlockTrailer` for testing is made obsolete in the refactoring, as testing the checksums can happen at a lower level of abstraction. Also now checking for unrecognized checksum type on reading footer, rather than later on use. Also removed an obsolete function delcaration. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9113 Test Plan: existing tests worked before refactoring to remove `ComputeBlockTrailer`. And then refactored+improved tests using it. Reviewed By: mrambacher Differential Revision: D32090149 Pulled By: pdillinger fbshipit-source-id: 2879da683c1498ea85a3b70dace9b6d9f6b47b6e |
|
Peter Dillinger | a7d4bea43a |
Implement XXH3 block checksum type (#9069)
Summary: XXH3 - latest hash function that is extremely fast on large data, easily faster than crc32c on most any x86_64 hardware. In integrating this hash function, I have handled the compression type byte in a non-standard way to avoid using the streaming API (extra data movement and active code size because of hash function complexity). This approach got a thumbs-up from Yann Collet. Existing functionality change: * reject bad ChecksumType in options with InvalidArgument This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is likely to be handled through different configuration than ChecksumType. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069 Test Plan: tests updated, and substantially expanded. Unit tests now check that we don't accidentally change the values generated by the checksum algorithms ("schema test") and that we properly handle invalid/unrecognized checksum types in options or in file footer. DBTestBase::ChangeOptions (etc.) updated from two to one configuration changing from default CRC32c ChecksumType. The point of this test code is to detect possible interactions among features, and the likelihood of some bad interaction being detected by including configurations other than XXH3 and CRC32c--and then not detected by stress/crash test--is extremely low. Stress/crash test also updated (manual run long enough to see it accepts new checksum type). db_bench also updated for microbenchmarking checksums. ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3 crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op) xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op) xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op) xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op) crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op) xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op) xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op) xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op) crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op) xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op) xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op) xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op) As you can see, especially once warmed up, xxh3 is fastest. ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) Test for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done Results (ops/sec) for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done results-0 252118 # kNoChecksum results-1 251588 # kCRC32c results-2 251863 # kxxHash results-3 252016 # kxxHash64 results-4 252038 # kXXH3 Reviewed By: mrambacher Differential Revision: D31905249 Pulled By: pdillinger fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1 |
|
Peter Dillinger | ad5325a736 |
Experimental support for SST unique IDs (#8990)
Summary: * New public header unique_id.h and function GetUniqueIdFromTableProperties which computes a universally unique identifier based on table properties of table files from recent RocksDB versions. * Generation of DB session IDs is refactored so that they are guaranteed unique in the lifetime of a process running RocksDB. (SemiStructuredUniqueIdGen, new test included.) Along with file numbers, this enables SST unique IDs to be guaranteed unique among SSTs generated in a single process, and "better than random" between processes. See https://github.com/pdillinger/unique_id * In addition to public API producing 'external' unique IDs, there is a function for producing 'internal' unique IDs, with functions for converting between the two. In short, the external ID is "safe" for things people might do with it, and the internal ID enables more "power user" features for the future. Specifically, the external ID goes through a hashing layer so that any subset of bits in the external ID can be used as a hash of the full ID, while also preserving uniqueness guarantees in the first 128 bits (bijective both on first 128 bits and on full 192 bits). Intended follow-up: * Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into the third 64-bit value of the unique ID.) * Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990 Test Plan: Unit tests added, and checking of unique ids in stress test. NOTE in stress test we do not generate nearly enough files to thoroughly stress uniqueness, but the test trims off pieces of the ID to check for uniqueness so that we can infer (with some assumptions) stronger properties in the aggregate. Reviewed By: zhichao-cao, mrambacher Differential Revision: D31582865 Pulled By: pdillinger fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243 |
|
Jay Zhuang | b4326b5273 |
Fix gcc-11 compile error (#9043)
Summary: gcc11 added new static check. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9043 Test Plan: Added CI for gcc11 build Reviewed By: zhichao-cao Differential Revision: D31716005 Pulled By: jay-zhuang fbshipit-source-id: 9f53be6f2f9e58e39b83359f6bbf66f945d57429 |
|
mrambacher | 13ae16c315 |
Cleanup includes in dbformat.h (#8930)
Summary: This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead. Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing. Hopefully, this sort of code hygiene cleanup will speed up the builds... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930 Reviewed By: pdillinger Differential Revision: D31142788 Pulled By: mrambacher fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d |
|
Hui Xiao | 91b95cadee |
Account for dictionary-building buffer in global memory limit (#8428)
Summary: Context: Some data blocks are temporarily buffered in memory in BlockBasedTableBuilder for building compression dictionary used in data block compression. Currently this memory usage is not counted toward our global memory usage utilizing block cache capacity. To improve that, this PR charges that memory usage into the block cache to achieve better memory tracking and limiting. - Reserve memory in block cache for buffered data blocks that are used to build a compression dictionary - Release all the memory associated with buffering the data blocks mentioned above in EnterUnbuffered(), which is called when (a) buffer limit is exceeded after buffering OR (b) the block cache becomes full after reservation OR (c) BlockBasedTableBuilder calls Finish() Pull Request resolved: https://github.com/facebook/rocksdb/pull/8428 Test Plan: - Passing existing unit tests - Passing new unit tests Reviewed By: ajkr Differential Revision: D30755305 Pulled By: hx235 fbshipit-source-id: 6e66665020b775154a94c4c5e0f2adaeaff13981 |
|
Merlin Mao | f58d276764 |
Make TraceRecord and Replayer public (#8611)
Summary: New public interfaces: `TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h". `Replayer`, available in `rocksdb/utilities/replayer.h`. User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file. Unit tests: - `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes. - `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611 Reviewed By: ajkr Differential Revision: D30266329 Pulled By: autopear fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8 |
|
Andrew Kryczka | ed8eb436db |
Move slow valgrind tests behind -DROCKSDB_FULL_VALGRIND_RUN (#8475)
Summary: Various tests had disabled valgrind due to it slowing down and timing out (as is the case right now) the CI runs. Where a test was disabled with no comment, I assumed slowness was the cause. For these tests that were slow under valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475 Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly Reviewed By: jay-zhuang Differential Revision: D29504843 Pulled By: ajkr fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7 |
|
mrambacher | b788e3f497 |
Increase range for GeneralTableTest.ApproximateOffsetOfCompressed (#8387)
Summary: Newer versions of Snappy (1.1 patch 8) were failing this test because the offsets were outside of the expected range. In some experiments: - On a RH machine with 1.1.0, the offset of "k04" and "xyy" were 3331 and 6665. - On an Ubuntu machine with 1.1.8, the same keys were at 3501 and 7004. - On a Mac with 1.1.8, the offsets were 3499 and 7001. AFAICT, the test environments are either using an older version of Snappy or no Snappy at all. This change increases the range to allow the tests to pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8387 Reviewed By: pdillinger Differential Revision: D29064475 Pulled By: mrambacher fbshipit-source-id: fac01927576765b8aff9f57e08a63a2ae210855f |
|
Zhichao Cao | f44e69c64a |
Use DbSessionId as cache key prefix when secondary cache is enabled (#8360)
Summary: Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360 Test Plan: add the test to lru_cache_test Reviewed By: pdillinger Differential Revision: D29006215 Pulled By: zhichao-cao fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814 |
|
Levi Tamasi | d83542ca83 |
Make it possible to apply only a subrange of table property collectors (#8298)
Summary: This patch does two things: 1) Introduces some aliases in order to eliminate/prevent long-winded type names w/r/t the internal table property collectors (see e.g. `std::vector<std::unique_ptr<IntTblPropCollectorFactory>>`). 2) Makes it possible to apply only a subrange of table property collectors during table building by turning `TableBuilderOptions::int_tbl_prop_collector_factories` from a pointer to a `vector` into a range (i.e. a pair of iterators). Rationale: I plan to introduce a BlobDB related table property collector, which should only be applied during table creation if blob storage is enabled at the moment (which can be changed dynamically). This change will make it possible to include/ exclude the BlobDB related collector as needed without having to introduce a second `vector` of collectors in `ColumnFamilyData` with pretty much the same contents. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8298 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D28430910 Pulled By: ltamasi fbshipit-source-id: a81d28f2c59495865300f43deb2257d2e6977c8e |
|
mrambacher | 8948dc8524 |
Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262)
Summary: The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form). Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes. Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262 Reviewed By: pdillinger Differential Revision: D28226540 Pulled By: mrambacher fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf |
|
Peter Dillinger | d2ca04e3ed |
Add more LSM info to FilterBuildingContext (#8246)
Summary: Add `num_levels`, `is_bottommost`, and table file creation `reason` to `FilterBuildingContext`, in anticipation of more powerful Bloom-like filter support. To support this, added `is_bottommost` and `reason` to `TableBuilderOptions`, which allowed removing `reason` parameter from `rocksdb::BuildTable`. I attempted to remove `skip_filters` from `TableBuilderOptions`, because filter construction decisions should arise from options, not one-off parameters. I could not completely remove it because the public API for SstFileWriter takes a `skip_filters` parameter, and translating this into an option change would mean awkwardly replacing the table_factory if it is BlockBasedTableFactory with new filter_policy=nullptr option. I marked this public skip_filters option as deprecated because of this oddity. (skip_filters on the read side probably makes sense.) At least `skip_filters` is now largely hidden for users of `TableBuilderOptions` and is no longer used for implementing the optimize_filters_for_hits option. Bringing the logic for that option closer to handling of FilterBuildingContext makes it more obvious that hese two are using the same notion of "bottommost." (Planned: configuration options for Bloom-like filters that generalize `optimize_filters_for_hits`) Recommended follow-up: Try to get away from "bottommost level" naming of things, which is inaccurate (see VersionStorageInfo::RangeMightExistAfterSortedRun), and move to "bottommost run" or just "bottommost." Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246 Test Plan: extended an existing unit test to exercise and check various filter building contexts. Also, existing tests for optimize_filters_for_hits validate some of the "bottommost" handling, which is now closely connected to FilterBuildingContext::is_bottommost through TableBuilderOptions::is_bottommost Reviewed By: mrambacher Differential Revision: D28099346 Pulled By: pdillinger fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a |
|
Peter Dillinger | 85becd94c1 |
Refactor: use TableBuilderOptions to reduce parameter lists (#8240)
Summary: Greatly reduced the not-quite-copy-paste giant parameter lists of rocksdb::NewTableBuilder, rocksdb::BuildTable, BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor. Moved weird separate parameter `uint32_t column_family_id` of TableFactory::NewTableBuilder into TableBuilderOptions. Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t target_file_size` is not randomly placed between uint64_t timestamps (was easy to mix up). Replaced a couple of fields of BlockBasedTableBuilder::Rep with a FilterBuildingContext. The motivation for this change is making it easier to pass along more data into new fields in FilterBuildingContext (follow-up PR). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240 Test Plan: ASAN make check Reviewed By: mrambacher Differential Revision: D28075891 Pulled By: pdillinger fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f |
|
mrambacher | 0ca6d6297f |
Rename variables in ImmutableCFOptions to avoid conflicts with ImmutableDBOptions (#8227)
Summary: Renaming ImmutableCFOptions::info_log and statistics to logger and stats. This is stage 2 in creating an ImmutableOptions class. It is necessary because the names match those in ImmutableOptions and have different types. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8227 Reviewed By: jay-zhuang Differential Revision: D28000967 Pulled By: mrambacher fbshipit-source-id: 3bf2aa04e8f1e8724d825b7deacf41080c14420b |
|
Yanqin Jin | 09528f9fa1 |
Fix a bug for SeekForPrev with partitioned filter and prefix (#8137)
Summary: According to https://github.com/facebook/rocksdb/issues/5907, each filter partition "should include the bloom of the prefix of the last key in the previous partition" so that SeekForPrev() in prefix mode can return correct result. The prefix of the last key in the previous partition does not necessarily have the same prefix as the first key in the current partition. Regardless of the first key in current partition, the prefix of the last key in the previous partition should be added. The existing code, however, does not follow this. Furthermore, there is another issue: when finishing current filter partition, `FullFilterBlockBuilder::AddPrefix()` is called for the first key in next filter partition, which effectively overwrites `last_prefix_str_` prematurely. Consequently, when the filter block builder proceeds to the next partition, `last_prefix_str_` will be the prefix of its first key, leaving no way of adding the bloom of the prefix of the last key of the previous partition. Prefix extractor is FixedLength.2. ``` [ filter part 1 ] [ filter part 2 ] abc d ``` When SeekForPrev("abcd"), checking the filter partition will land on filter part 2 because "abcd" > "abc" but smaller than "d". If the filter in filter part 2 happens to return false for the test for "ab", then SeekForPrev("abcd") will build incorrect iterator tree in non-total-order mode. Also fix a unit test which starts to fail following this PR. `InDomain` should not fail due to assertion error when checking on an arbitrary key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8137 Test Plan: ``` make check ``` Without this fix, the following command will fail pretty soon. ``` ./db_stress --acquire_snapshot_one_in=10000 --avoid_flush_during_recovery=0 \ --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 \ --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=17 \ --bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 \ --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \ --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 \ --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 \ --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 \ --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox \ --db_write_buffer_size=8388608 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_blob_files=0 \ --enable_compaction_filter=0 --enable_pipelined_write=1 --file_checksum_impl=big --flush_one_in=1000000 \ --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \ --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=2 --ingest_external_file_one_in=0 \ --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \ --log2_keys_per_lock=10 --long_running_snapshots=1 --mark_for_compaction_one_file_in=0 \ --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000000 --max_key_len=3 \ --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 \ --max_write_buffer_size_to_maintain=8388608 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False \ --nooverwritepercent=0 --open_files=500000 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=0 --pause_background_one_in=1000000 \ --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --read_fault_one_in=0 --read_only=0 \ --readpercent=45 --recycle_log_file_num=0 --reopen=20 --secondary_catch_up_one_in=0 \ --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 \ --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False \ --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 \ --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_blob_db=0 --use_block_based_filter=0 \ --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 \ --use_multiget=0 --use_ribbon_filter=0 --use_txn=0 --user_timestamp_size=8 --verify_checksum=1 \ --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 \ --write_dbid_to_manifest=1 --writepercent=35 ``` Reviewed By: pdillinger Differential Revision: D27553054 Pulled By: riversand963 fbshipit-source-id: 60e391e4a2d8d98a9a3172ec5d6176b90ec3de98 |
|
Andrew Kryczka | c20a7cd6c7 |
Apply `sample_for_compression` to all block-based tables (#8105)
Summary: Previously it only applied to block-based tables generated by flush. This restriction was undocumented and blocked a new use case. Now compression sampling applies to all block-based tables we generate when it is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8105 Test Plan: new unit test Reviewed By: riversand963 Differential Revision: D27317275 Pulled By: ajkr fbshipit-source-id: cd9fcc5178d6515e8cb59c6facb5ac01893cb5b0 |
|
mrambacher | 3dff28cf9b |
Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033)
Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67 |
|
Peter Dillinger | 0028e3398b |
Make format_version=5 new default (#8017)
Summary: Haven't seen any production issues with new Bloom filter and it's now > 1 year old (added in 6.6.0). Updated check_format_compatible.sh and HISTORY.md Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017 Test Plan: tests updated (or prior bugs fixed) Reviewed By: ajkr Differential Revision: D26762197 Pulled By: pdillinger fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2 |
|
Andrew Kryczka | 78ee8564ad |
Integrity protection for live updates to WriteBatch (#7748)
Summary: This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.). The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer. When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748 Test Plan: - an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught - add to stress/crash test to verify it works in variety of configs/operations without intentional corruption - [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc. Reviewed By: pdillinger Differential Revision: D25754492 Pulled By: ajkr fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866 |
|
mrambacher | c1a65a4de4 |
Make StringEnv, StringSink, StringSource use FS classes (#7786)
Summary: Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones. The StringSink and StringSource classes were changed to be based on the corresponding FS file classes. Part of a cleanup to use the newer interfaces. This change also eliminates some of the casts/wrappers to LegacyFile classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786 Reviewed By: jay-zhuang Differential Revision: D25761460 Pulled By: anand1976 fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f |