mirror of https://github.com/facebook/rocksdb.git
1554 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Peter Dillinger | b515a5db3f |
Replace ScopedArenaIterator with ScopedArenaPtr<InternalIterator> (#12470)
Summary: ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want. So this adds ScopedArenaPtr<T> to replace those uses. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470 Test Plan: CI (including ASAN/UBSAN) Reviewed By: jowlyzhang Differential Revision: D55254362 Pulled By: pdillinger fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18 |
|
anand76 | 3b736c4aa3 |
Fix heap use after free error on retry after checksum mismatch (#12464)
Summary: Fix the heap use after free bug caused by freeing the file system IO buffer in `BlockFetcher::ReadBlock()` instead of the caller. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12464 Test Plan: Update the `DBIOCorruptionTest` tests Reviewed By: akankshamahajan15 Differential Revision: D55206920 Pulled By: anand1976 fbshipit-source-id: fd6b608a61cd229b20c1e5f348ff3cc92328de0f |
|
Andrew Kryczka | bf98dcf9a8 |
Fix kBlockCacheTier read when merge-chain base value is in a blob file (#12462)
Summary: The original goal is to propagate failures from `GetContext::SaveValue()` -> `GetContext::GetBlobValue()` -> `BlobFetcher::FetchBlob()` up to the user. This call sequence happens when a merge chain ends with a base value in a blob file. There's also fixes for bugs encountered along the way where non-ok statuses were ignored/overwritten, and a bit of plumbing work for functions that had no capability to return a status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12462 Test Plan: A repro command ``` db=/dev/shm/dbstress_db ; exp=/dev/shm/dbstress_exp ; rm -rf $db $exp ; mkdir -p $db $exp ./db_stress \ --clear_column_family_one_in=0 \ --test_batches_snapshots=0 \ --write_fault_one_in=0 \ --use_put_entity_one_in=0 \ --prefixpercent=0 \ --read_fault_one_in=0 \ --readpercent=0 \ --reopen=0 \ --set_options_one_in=10000 \ --delpercent=0 \ --delrangepercent=0 \ --open_metadata_write_fault_one_in=0 \ --open_read_fault_one_in=0 \ --open_write_fault_one_in=0 \ --destroy_db_initially=0 \ --ingest_external_file_one_in=0 \ --iterpercent=0 \ --nooverwritepercent=0 \ --db=$db \ --enable_blob_files=1 \ --expected_values_dir=$exp \ --max_background_compactions=20 \ --max_bytes_for_level_base=2097152 \ --max_key=100000 \ --min_blob_size=0 \ --open_files=-1 \ --ops_per_thread=100000000 \ --prefix_size=-1 \ --target_file_size_base=524288 \ --use_merge=1 \ --value_size_mult=32 \ --write_buffer_size=524288 \ --writepercent=100 ``` It used to fail like: ``` ... frame https://github.com/facebook/rocksdb/issues/9: 0x00007fc63903bc93 libc.so.6`__GI___assert_fail(assertion="HasDefaultColumn(columns)", file="fbcode/internal_repo_rocksdb/repo/db/wide/wide_columns_helper.h", line=33, function="static const rocksdb::Slice &rocksdb::WideColumnsHelper::GetDefaultColumn(const rocksdb::WideColumns &)") at assert.c:101:3 frame https://github.com/facebook/rocksdb/issues/10: 0x00000000006f7e92 db_stress`rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::PinnableWideColumns*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, rocksdb::PinnedIteratorsManager*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) [inlined] rocksdb::WideColumnsHelper::GetDefaultColumn(columns=size=0) at wide_columns_helper.h:33 frame https://github.com/facebook/rocksdb/issues/11: 0x00000000006f7e76 db_stress`rocksdb::Version::Get(this=0x00007fc5ec763000, read_options=<unavailable>, k=<unavailable>, value=0x0000000000000000, columns=0x00007fc6035fd1d8, timestamp=<unavailable>, status=0x00007fc6035fd250, merge_context=0x00007fc6035fce40, max_covering_tombstone_seq=0x00007fc6035fce90, pinned_iters_mgr=0x00007fc6035fcdf0, value_found=0x0000000000000000, key_exists=0x0000000000000000, seq=0x0000000000000000, callback=0x0000000000000000, is_blob=0x0000000000000000, do_merge=<unavailable>) at version_set.cc:2492 frame https://github.com/facebook/rocksdb/issues/12: 0x000000000051e245 db_stress`rocksdb::DBImpl::GetImpl(this=0x00007fc637a86000, read_options=0x00007fc6035fcf60, key=<unavailable>, get_impl_options=0x00007fc6035fd000) at db_impl.cc:2408 frame https://github.com/facebook/rocksdb/issues/13: 0x000000000050cec2 db_stress`rocksdb::DBImpl::GetEntity(this=0x00007fc637a86000, _read_options=<unavailable>, column_family=<unavailable>, key=0x00007fc6035fd3c8, columns=0x00007fc6035fd1d8) at db_impl.cc:2109 frame https://github.com/facebook/rocksdb/issues/14: 0x000000000074f688 db_stress`rocksdb::(anonymous namespace)::MemTableInserter::MergeCF(this=0x00007fc6035fd450, column_family_id=2, key=0x00007fc6035fd3c8, value=0x00007fc6035fd3a0) at write_batch.cc:2656 frame https://github.com/facebook/rocksdb/issues/15: 0x00000000007476fc db_stress`rocksdb::WriteBatchInternal::Iterate(wb=0x00007fc6035fe698, handler=0x00007fc6035fd450, begin=12, end=<unavailable>) at write_batch.cc:607 frame https://github.com/facebook/rocksdb/issues/16: 0x000000000074d7dd db_stress`rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, rocksdb::TrimHistoryScheduler*, bool, unsigned long, rocksdb::DB*, bool, bool, bool) [inlined] rocksdb::WriteBatch::Iterate(this=<unavailable>, handler=0x00007fc6035fd450) const at write_batch.cc:505 frame https://github.com/facebook/rocksdb/issues/17: 0x000000000074d77b db_stress`rocksdb::WriteBatchInternal::InsertInto(write_group=<unavailable>, sequence=<unavailable>, memtables=<unavailable>, flush_scheduler=<unavailable>, trim_history_scheduler=<unavailable>, ignore_missing_column_families=<unavailable>, recovery_log_number=0, db=0x00007fc637a86000, concurrent_memtable_writes=<unavailable>, seq_per_batch=false, batch_per_txn=<unavailable>) at write_batch.cc:3084 frame https://github.com/facebook/rocksdb/issues/18: 0x0000000000631d77 db_stress`rocksdb::DBImpl::PipelinedWriteImpl(this=0x00007fc637a86000, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000) at db_impl_write.cc:807 frame https://github.com/facebook/rocksdb/issues/19: 0x000000000062ceeb db_stress`rocksdb::DBImpl::WriteImpl(this=<unavailable>, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000, batch_cnt=0, pre_release_callback=0x0000000000000000, post_memtable_callback=0x0000000000000000) at db_impl_write.cc:312 frame https://github.com/facebook/rocksdb/issues/20: 0x000000000062c8ec db_stress`rocksdb::DBImpl::Write(this=0x00007fc637a86000, write_options=0x00007fc6035feca8, my_batch=0x00007fc6035fe698) at db_impl_write.cc:157 frame https://github.com/facebook/rocksdb/issues/21: 0x000000000062b847 db_stress`rocksdb::DB::Merge(this=0x00007fc637a86000, opt=0x00007fc6035feca8, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, value=0x00007fc6035fe830) at db_impl_write.cc:2544 frame https://github.com/facebook/rocksdb/issues/22: 0x000000000062b6ef db_stress`rocksdb::DBImpl::Merge(this=0x00007fc637a86000, o=<unavailable>, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, val=0x00007fc6035fe830) at db_impl_write.cc:72 frame https://github.com/facebook/rocksdb/issues/23: 0x00000000004d6397 db_stress`rocksdb::NonBatchedOpsStressTest::TestPut(this=0x00007fc637041000, thread=0x00007fc6370dbc00, write_opts=0x00007fc6035feca8, read_opts=0x00007fc6035fe9c8, rand_column_families=<unavailable>, rand_keys=size=1, value={P\xe9_\x03\xc6\x7f\0\0}) at no_batched_ops_stress.cc:1317 frame https://github.com/facebook/rocksdb/issues/24: 0x000000000049361d db_stress`rocksdb::StressTest::OperateDb(this=0x00007fc637041000, thread=0x00007fc6370dbc00) at db_stress_test_base.cc:1148 ... ``` Reviewed By: ltamasi Differential Revision: D55157795 Pulled By: ajkr fbshipit-source-id: 5f7c1380ead5794c29d41680028e34b839744764 |
|
anand76 | 98d8a85624 |
New PerfContext counters for block cache bytes read (#12459)
Summary: Add PerfContext counters for measuring the cumulative size of blocks found in the block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12459 Reviewed By: ajkr Differential Revision: D55170694 Pulled By: anand1976 fbshipit-source-id: 8cbba76eece116cefce7f00e2fc9d74757661d25 |
|
Yu Zhang | 13e1c32a18 |
Follow ups for TimedPut and write time property (#12455)
Summary: This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including: 1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry. 2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries. 3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible. Test plans: Added unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455 Reviewed By: pdillinger Differential Revision: D55165422 Pulled By: jowlyzhang fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47 |
|
Kshitij Wadhwa | 4ce1dc930c |
don't run ZSTD_TrainDictionary in BlockBasedTableBuilder if there isn't compression needed (#12453)
Summary: fixes https://github.com/facebook/rocksdb/issues/12409 ### Issue ZSTD_TrainDictionary [[link]( |
|
anand76 | 4868c10b44 |
Retry block reads on checksum mismatch (#12427)
Summary: On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`. Tests: Add new unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427 Reviewed By: ajkr Differential Revision: D55025941 Pulled By: anand1976 fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e |
|
Yu Zhang | f2546b6623 |
Support returning write unix time in iterator property (#12428)
Summary: This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is: 1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned. 2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown. Needed follow up: 1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it. 2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion. 3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428 Test Plan: Added unit test Reviewed By: pdillinger Differential Revision: D54967067 Pulled By: jowlyzhang fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a |
|
Andrew Kryczka | 3f5bd46a07 |
Add `ContinueCallback` to `GetMergeOperands()` (#12438)
Summary: The use case is similar to `MergeOperator::ShouldMerge()` for `Get()`: preventing reads into LSM components for merge operands that are of no interest to the user. `MergeOperator::ShouldMerge()` cannot be reused here because: - Its name does not make sense in the context of `GetMergeOperands()` since `GetMergeOperands()` never invokes merge - The callback is part of the `MergeOperator`, but an option specific to the read operation makes more sense to me If there are any ideas for an API design that covers both `MergeOperator::ShouldMerge()`'s use cases and `GetMergeOperandsOptions::continue_cb`'s use cases, that would be ideal, but for now this is what I came up with. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12438 Reviewed By: hx235 Differential Revision: D54914669 Pulled By: ajkr fbshipit-source-id: 5f3ff78d3890adc0b1b74bedf3921221930ce63a |
|
Yu Zhang | 1104eaa35e |
Add initial support for TimedPut API (#12419)
Summary: This PR adds support for `TimedPut` API. We introduced a new type `kTypeValuePreferredSeqno` for entries added to the DB via the `TimedPut` API. The life cycle of such an entry on the write/flush/compaction paths are: 1) It is initially added to memtable as: `<user_key, seq, kTypeValuePreferredSeqno>: {value, write_unix_time}` 2) When it's flushed to L0 sst files, it's converted to: `<user_key, seq, kTypeValuePreferredSeqno>: {value, preferred_seqno}` when we have easy access to the seqno to time mapping. 3) During compaction, if certain conditions are met, we swap in the `preferred_seqno` and the entry will become: `<user_key, preferred_seqno, kTypeValue>: value`. This step helps fast track these entries to the cold tier if they are eligible after the sequence number swap. On the read path: A `kTypeValuePreferredSeqno` entry acts the same as a `kTypeValue` entry, the unix_write_time/preferred seqno part packed in value is completely ignored. Needed follow ups: 1) The seqno to time mapping accessible in flush needs to be extended to cover the `write_unix_time` for possible `kTypeValuePreferredSeqno` entries. This also means we need to track these `write_unix_time` in memtable. 2) Compaction filter support for the new `kTypeValuePreferredSeqno` type for feature parity with other `kTypeValue` and equivalent types. 3) Stress test coverage for the feature Pull Request resolved: https://github.com/facebook/rocksdb/pull/12419 Test Plan: Added unit tests Reviewed By: pdillinger Differential Revision: D54920296 Pulled By: jowlyzhang fbshipit-source-id: c8b43f7a7c465e569141770e93c748371ff1da9e |
|
Changyu Bi | 36c1b0aded |
Allow SstFileReader to verify number of entries in SST files (#12418)
Summary: Add `SstFileReader::VerifyNumEntries()` for this purpose. I added the same functionality to `sst_dump` in https://github.com/facebook/rocksdb/issues/12322. Since sst_file_reader.h is exposed to users while sst_dump.h is not, it seems more appropriate to add SST files related APIs here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12418 Test Plan: `./sst_file_reader_test --gtest_filter="*VerifyNumEntries*"` Reviewed By: jowlyzhang Differential Revision: D54764271 Pulled By: cbi42 fbshipit-source-id: 22ebfe04bbb0b152762cee13d4210b147b36d3e9 |
|
Peter Dillinger | a53ed91691 |
Fix/improve temperature handling for file ingestion (#12402)
Summary: Partly following up on leftovers from https://github.com/facebook/rocksdb/issues/12388 In terms of public API: * Make it clear that IngestExternalFileArg::file_temperature is just a hint for opening the existing file, though it was previously used for both copy-from temp hint and copy-to temp, which was bizarre. * Specify how IngestExternalFile assigns temperature to file ingested into DB. (See details in comments.) This approach is not perfect in terms of matching how the DB assigns temperatures, but was the simplest way to get close. The key complication for matching DB temperature assignments is that ingestion files are copied (to a destination temp) before their target level is determined (in general). * Add a temperature option to SstFileWriter::Open so that files intended for ingestion can be initially written to a chosen temperature. * Note that "fail_if_not_bottommost_level" is obsolete/confusing use of "bottommost" In terms of the implementation, there was a similar bit of oddness with the internal CopyFile API, which only took one temperature, ambiguously applicable to the source, destination, or both. This is also fixed. Eventual suggested follow-up: * Before copying files for ingestion, determine a tentative level assignment to use for destination temperature, and keep that even if final level assignment happens to be different at commit time (rare). * More temperature handling for CreateColumnFamilyWithImport and Checkpoints. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12402 Test Plan: Deeply revamped ExternalSSTFileBasicTest.IngestWithTemperature to test the new changes. Previously this test was insufficient because it was only looking at temperatures according to the DB manifest. Incorporating FileTemperatureTestFS allows us to also test the temperatures in the storage layer. Used macros instead of functions for better tracing to critical source location on test failures. Some enhancements to FileTemperatureTestFS in the process of developing the revamped test. Reviewed By: jowlyzhang Differential Revision: D54442794 Pulled By: pdillinger fbshipit-source-id: 41d9d0afdc073e6a983304c10bbc07c70cc7e995 |
|
yuzhangyu@fb.com | 1cfdece85d |
Run internal cpp modernizer on RocksDB repo (#12398)
Summary: When internal cpp modernizer attempts to format rocksdb code, it will replace macro `ROCKSDB_NAMESPACE` with its default definition `rocksdb` when collapsing nested namespace. We filed a feedback for the tool T180254030 and the team filed a bug for this: https://github.com/llvm/llvm-project/issues/83452. At the same time, they suggested us to run the modernizer tool ourselves so future auto codemod attempts will be smaller. This diff contains: Running `xplat/scripts/codemod_service/cpp_modernizer.sh` in fbcode/internal_repo_rocksdb/repo (excluding some directories in utilities/transactions/lock/range/range_tree/lib that has a non meta copyright comment) without swapping out the namespace macro `ROCKSDB_NAMESPACE` Followed by RocksDB's own `make format` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12398 Test Plan: Auto tests Reviewed By: hx235 Differential Revision: D54382532 Pulled By: jowlyzhang fbshipit-source-id: e7d5b40f9b113b60e5a503558c181f080b9d02fa |
|
Richard Barnes | d7b8756976 |
Remove extra semi colon from internal_repo_rocksdb/repo/db/table_cache_sync_and_async.h
Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: palmje Differential Revision: D54362208 fbshipit-source-id: a47acd4c794c899fccb65285b116b50d9566ea12 |
|
Jay Huh | c00c16855d |
Access DBImpl* and CFD* by CFHImpl* in Iterators (#12395)
Summary: In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR https://github.com/facebook/rocksdb/issues/12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs: https://github.com/facebook/rocksdb/issues/11925 #11943, and https://github.com/facebook/rocksdb/issues/11977. To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12395 Test Plan: # Summary In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR #12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs: #11925 #11943, and #11977. To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`. # Test Plan There should be no behavior changes. Existing tests and CI for the correctness tests. **Test for Perf Regression** Build ``` $> make -j64 release ``` Setup ``` $> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none ``` Run ``` TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000 ``` Before the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.552 micros/op 1810157 ops/sec 0.552 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 4.502 micros/op 222143 ops/sec 4.502 seconds 1000000 operations; (0 of 1000000 found) ``` After the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.520 micros/op 1924401 ops/sec 0.520 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 4.532 micros/op 220657 ops/sec 4.532 seconds 1000000 operations; (0 of 1000000 found) ``` Reviewed By: pdillinger Differential Revision: D54332713 Pulled By: jaykorean fbshipit-source-id: b28d897ad519e58b1ca82eb068a6319544a4fae5 |
|
Yu Zhang | 2940acac00 |
Persist table options use_delta_encoding in options file (#11987)
Summary: This option is used for encoding keys in block based table files. It has been having a default true value since its introduction. Users may not notice this option is not persisted in options file unless they are explicitly setting it to false. If the users expect `Iterator::GetProperty("rocksdb.iterator.is-key-pinned")` to return 1 when setting `ReadOptions.pin_data = true`, they should have noticed loading options file won't work and have work around for this by always explicitly set this option to false for opening DB. This change won't impact those users except that now they can remove their work around. If the users are not relying on key pinning behavior at all and as a result didn't notice the option is not persisted, this change shouldn't have any visible behavior impact either. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11987 Reviewed By: hx235 Differential Revision: D54093238 Pulled By: jowlyzhang fbshipit-source-id: 256a3348c44cf91349034d1f6e242c437b32b9a5 |
|
anand76 | d227276147 |
Deprecate some variants of Get and MultiGet (#12327)
Summary: A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327 Reviewed By: pdillinger Differential Revision: D53828151 Pulled By: anand1976 fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050 |
|
Peter Dillinger | bfd00bba9c |
Use format_version=6 by default (#12352)
Summary: It's in production for a large storage service, and it was initially released 6 months ago (8.6.0). IMHO that's enough room for "easy downgrade" to most any user's previously integrated version, even if they only update a few times a year. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12352 Test Plan: tests updated, including format capatibility test table_test: ApproximateOffsetOfCompressed is affected because adding index block to metaindex adds about 13 bytes to SST files in format_version 6. This test has historically been problematic and one reason is that, apparently, not only could it pass/fail depending on snappy compression version, but also how long your host name is, because of db_host_id. I've cleared that out for the test, which takes care of format_version=6 and hopefully improves long-term reliability. Suggested follow-up: FinishImpl in table_test.cc takes a table_options that is ignored in some cases and might not match the ioptions.table_factory configuration unless the caller is very careful. This should be cleaned up somehow. Reviewed By: anand1976 Differential Revision: D53786884 Pulled By: pdillinger fbshipit-source-id: 1964cbd40d3ab0a821fdc01c458031df716fcf51 |
|
Yu Zhang | f405e55cfa |
Add support in SstFileWriter to not persist user defined timestamps (#12348)
Summary: This PR adds support in `SstFileWriter` to create SST files without persisting timestamps when the column family has enabled UDTs in Memtable only feature. The sst files created from flush and compaction do not contain timestamps, we want to make the sst files created by `SstFileWriter` to follow the same pattern and not persist timestamps. This is to prepare for ingesting external SST files for this type of column family. There are timestamp-aware APIs and non timestamp-aware APIs in `SstFileWriter`. The former are exclusively used for when the column family's comparator is timestamp-aware, a.k.a `Comparator::timestamp_size() > 0`, while the latter are exclusively used for the column family's comparator is non timestamp-aware, a.k.a `Comparator::timestamp_size() == 0`. There are sanity checks to make sure these APIs are correctly used. In this PR, the APIs usage continue with above enforcement, where even though timestamps are not eventually persisted, users are still asked to use only the timestamp-aware APIs. But because data points will logically all have minimum timestamps, we don't allow multiple versions of the same user key (without timestamp) to be added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12348 Test Plan: Added unit tests Manual inspection of generated sst files with `sst_dump` Reviewed By: ltamasi Differential Revision: D53732667 Pulled By: jowlyzhang fbshipit-source-id: e43beba0d3a1736b94ee5c617163a6280efd65b7 |
|
Peter Dillinger | 54cb9c77d9 |
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary: The following are risks associated with pointer-to-pointer reinterpret_cast: * Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do. * Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally. I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement: * Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have `struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic. * Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance. With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain. A couple of related interventions included here: * Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle. * Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse). Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work. I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308 Test Plan: existing tests, CI Reviewed By: ltamasi Differential Revision: D53204947 Pulled By: pdillinger fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2 |
|
Peter Dillinger | 1d6dbfb8b7 |
Rename IntTblPropCollector -> InternalTblPropColl (#12320)
Summary: I've always found this name difficult to read, because it sounds like it's for collecting int(eger) table properties. I'm fixing this now to set up for a change that I have stubbed out in the public API (table_properties.h): a new adapter function `TablePropertiesCollector::AsInternal()` that allows RocksDB-provided TablePropertiesCollectors (such as CompactOnDeletionCollector) to implement the easier-to-upgrade internal interface while still (superficially) implementing the public interface. In addition to added flexibility, this should be a performance improvement as the adapter class UserKeyTablePropertiesCollector can be avoided for such cases where a RocksDB-provided collector is used (AsInternal() returns non-nullptr). table_properties.h is the only file with changes that aren't simple find-replace renaming. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12320 Test Plan: existing tests, CI Reviewed By: ajkr Differential Revision: D53336945 Pulled By: pdillinger fbshipit-source-id: 02535bcb30bbfb00e29e8478af62e5dad50a63b8 |
|
Changyu Bi | c6b1f6d182 |
Augment sst_dump tool to verify num_entries in table property (#12322)
Summary: sst_dump --command=check can now compare number of keys in a file with num_entries in table property and reports corruption is there is a mismatch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12322 Test Plan: - new unit test for API `SstFileDumper::ReadSequential` - ran sst_dump on a good and a bad file: ``` sst_dump --file=./32316112.sst options.env is 0x7f68bfcb5000 Process ./32316112.sst Sst file format: block-based from [] to [] sst_dump --file=./32316115.sst options.env is 0x7f6d0d2b5000 Process ./32316115.sst Sst file format: block-based from [] to [] ./32316115.sst: Corruption: Table property has num_entries = 6050408 but scanning the table returns 6050406 records. ``` Reviewed By: jowlyzhang Differential Revision: D53320481 Pulled By: cbi42 fbshipit-source-id: d84c996346a9575a5a2ea5f5fb09a9d3ee672cd6 |
|
Peter Dillinger | 76c834e441 |
Remove 'virtual' when implied by 'override' (#12319)
Summary: ... to follow modern C++ style / idioms. Used this hack: ``` for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319 Test Plan: existing tests, CI Reviewed By: jaykorean Differential Revision: D53275303 Pulled By: pdillinger fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377 |
|
Peter Dillinger | 61ed0de600 |
Add more detail to some statuses (#12307)
Summary: and also fix comment/label on some MacOS CI jobs. Motivated by a crash test failure missing a definitive indicator of the genesis of the status: ``` file ingestion error: Operation failed. Try again.: ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12307 Test Plan: just cosmetic changes. These statuses should not arise frequently enough to be a performance issue (copying messages). Reviewed By: jaykorean Differential Revision: D53199529 Pulled By: pdillinger fbshipit-source-id: ad83daaa5d80f75c9f81158e90fb6d9ecca33fe3 |
|
akankshamahajan | b9cb7b9644 |
Provide support for FSBuffer for point lookups (#12266)
Summary: Provide support for FSBuffer for point lookups It also add support for compaction and scan reads that goes through BlockFetcher when readahead/prefetching is not enabled. Some of the compaction/Scan reads goes through FilePrefetchBuffer and some through BlockFetcher. This PR add support to use underlying file system scratch buffer for reads that go through BlockFetcher as for FilePrefetch reads, design is complicated to support this feature. Design - In order to use underlying FileSystem provided scratch for Reads, it uses MultiRead with 1 request instead of Read API which required API change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12266 Test Plan: Stress test using underlying file system scratch buffer internally. Reviewed By: anand1976 Differential Revision: D53019089 Pulled By: akankshamahajan15 fbshipit-source-id: 4fe3d090d77363320e4b67186fd4d51c005c0961 |
|
Yu Zhang | 071a146fa0 |
Add support for range deletion when user timestamps are not persisted (#12254)
Summary:
For the user defined timestamps in memtable only feature, some special handling for range deletion blocks are needed since both the key (start_key) and the value (end_key) of a range tombstone can contain user-defined timestamps. Handling for the key is taken care of in the same way as the other data blocks in the block based table. This PR adds the special handling needed for the value (end_key) part. This includes:
1) On the write path, when L0 SST files are first created from flush, user-defined timestamps are removed from an end key of a range tombstone. There are places where it's logically removed (replaced with a min timestamp) because there is still logic with the running comparator that expects a user key that contains timestamp. And in the block based builder, it is eventually physically removed before persisted in a block.
2) On the read path, when range deletion block is being read, we artificially pad a min timestamp to the end key of a range tombstone in `BlockBasedTableReader`.
3) For file boundary `FileMetaData.largest`, we artificially pad a max timestamp to it if it contains a range deletion sentinel. Anytime when range deletion end_key is used to update file boundaries, it's using max timestamp instead of the range tombstone's actual timestamp to mark it as an exclusive end.
|
|
Peter Dillinger | 4e60663b31 |
Remove unnecessary, confusing 'extern' (#12300)
Summary: In C++, `extern` is redundant in a number of cases: * "Global" function declarations and definitions * "Global" variable definitions when already declared `extern` For consistency and simplicity, I've removed these in code that *we own*. In a couple of cases, I removed obsolete declarations, and for MagicNumber constants, I have consolidated the declarations into a header file (format.h) as standard best practice would prescribe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12300 Test Plan: no functional changes, CI Reviewed By: ajkr Differential Revision: D53148629 Pulled By: pdillinger fbshipit-source-id: fb8d927959892e03af09b0c0d542b0a3b38fd886 |
|
Changyu Bi | 11d4ac87ea |
Add some missing status checks in SstFileWriter (#12281)
Summary: Add some missing status checks and documentation for SstFileWriter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12281 Test Plan: existing unit tests. Reviewed By: jaykorean, ajkr Differential Revision: D53064036 Pulled By: cbi42 fbshipit-source-id: 686d90e24c18c8a4ee81668663a7780a69a45d4c |
|
Peter Dillinger | cb08a682d4 |
Fix/cleanup SeqnoToTimeMapping (#12253)
Summary: The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios. * Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc. * Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.) * Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior. * Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.) * Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries. * Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.` * Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior. * Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file. Intended follow-up (me or others): * Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense. * There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL. * The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253 Test Plan: unit tests updated Reviewed By: jowlyzhang Differential Revision: D52913733 Pulled By: pdillinger fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593 |
|
akankshamahajan | cad76a2e1e |
Fix bug in auto_readahead_size that returned wrong key (#12229)
Summary: IndexType::kBinarySearchWithFirstKey + BlockCacheLookupForReadAheadSize enabled => FindNextUserEntryInternal assertion fails or iterator lands at a wrong key because BlockCacheLookupForReadAheadSize moves the index_iter_ and in internal_wrapper.h, result_.key didn't update and pointed to wrong key. Also ikey_ was also pointing to iter_.key() instead of copying the key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12229 Test Plan: ``` rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 mkdir /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 ./db_stress -threads=1 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_setting_blob_options_dynamically=0 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=0 --backup_one_in=0 --batch_protection_bytes_per_key=0 --blob_cache_size=0 --blob_compaction_readahead_size=0 --blob_compression_type=lz4 --blob_file_size=0 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0 --blob_garbage_collection_force_threshold=0 --block_protection_bytes_per_key=0 --block_size=2048 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=0 --compaction_ttl=0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 --db_write_buffer_size=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_blob_garbage_collection=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=3 --ingest_external_file_one_in=10 --initial_auto_readahead_size=0 --iterpercent=55 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_blob_size=8 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=10000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=1 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --snapshot_hold_ops=0 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=0 --use_blob_cache=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=0 --verify_checksum_one_in=0 --verify_db_one_in=0 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=1 --verify_sst_unique_id_in_manifest=0 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=0 > repro.out Verification failed. Expected state has key 0000000000000077000000000000004178, iterator is at key 0000000000000077000000000000008A78 Column family: default, op_logs: S 0000000000000077000000000000003D7878787878 NNNN No writes or ops? Verification failed :( ``` Reviewed By: ajkr Differential Revision: D52710655 Pulled By: akankshamahajan15 fbshipit-source-id: 9d2e684e190fb0832bdce3337bce1c6548cd054d |
|
Chdy | 21d5a8f54f |
Fix a bug in sst_dump when parsing PlainTable (#12223)
Summary: ### Summary: The sst_dump tool occur IO Error when reading data in PlainTable, as shown in the follow ```bash ❯ ./sst_dump --file=/tmp/write_example --command=scan --show_properties --verify_checksum options.env is 0x60000282dc00 Process /tmp/write_example/001630.sst Sst file format: plain table /tmp/filepicker_example/001630.sst: IO error: While pread offset 0 len 758: /tmp/filepicker_example/001630.sst: Bad address Process /tmp/filepicker_example/001624.sst ``` #### Reason The root cause is that `fopts.use_mmap_reads` is false, `NewRandomAccessFile` will produce an `PosixRandomAccessFile` file. but `soptions_.use_mmap_reads` is true, This will result in unexpected calls in the `MmapDataIfNeeded` function. ```c++ Status SstFileDumper::GetTableReader(const std::string& file_path) { ... if (s.ok()) { if (magic_number == kPlainTableMagicNumber || magic_number == kLegacyPlainTableMagicNumber || magic_number == kCuckooTableMagicNumber) { soptions_.use_mmap_reads = true; ... // WARN: fopts.use_mmap_reads is false fs->NewRandomAccessFile(file_path, fopts, &file, nullptr); file_.reset(new RandomAccessFileReader(std::move(file), file_path)); } ... } if (s.ok()) { // soptions_.use_mmap_reads is true s = NewTableReader(ioptions_, soptions_, internal_comparator_, file_size, &table_reader_); } return s; } ``` The following read logic was executed on a `PosixRandomAccessFile` file, Eventually, `PosixRandomAccessFile::Read` will be called with a `nullptr` `scratch` ```c++ Status PlainTableReader::MmapDataIfNeeded() { if (file_info_.is_mmap_mode) { // Get mmapped memory. // Executing the following logic on the PosixRandomAccessFile file is incorrect return file_info_.file->Read( IOOptions(), 0, static_cast<size_t>(file_size_), &file_info_.file_data, nullptr, nullptr, Env::IO_TOTAL /* rate_limiter_priority */); } return Status::OK(); } ``` #### Fix: When parsing PlainTable, set the variable `fopts.use_mmap_reads` equal `soptions_.use_mmap_reads`, When the `soptions_.use_mmap_reads` is true, `NewRandomAccessFile` will produce an `PosixMmapReadableFile` file. This will work correctly in the `MmapDataIfNeeded` function ``` ❯ ./sst_dump --file=/tmp/write_example --command=scan --show_properties --verify_checksum options.env is 0x6000009323e0 Process /tmp/write_example/001630.sst Sst file format: plain table from [] to [] 'keys496' seq:0, type:1 => values1496 'keys497' seq:0, type:1 => values1497 'keys498' seq:0, type:1 => values1498 Table Properties: ------------------------------ # data blocks: 1 # entries: 3 # deletions: 0 # merge operands: 0 # range deletions: 0 raw key size: 45 raw average key size: 15.000000 raw value size: 42 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12223 Reviewed By: cbi42 Differential Revision: D52706238 Pulled By: ajkr fbshipit-source-id: 2f9f518ec81d1cbde00bd65ab6bd304796836c0a |
|
Peter Dillinger | 5da900f28a |
Fix a case of ignored corruption in creating backups (#12200)
Summary: We often need to read the table properties of an SST file when taking a backup. However, we currently do not check checksums for this step, and even with that enabled, we ignore failures. This change ensures we fail creating a backup if corruption is detected in that step of reading table properties. To get this working properly (with existing unit tests), we also add some temperature handling logic like already exists in BackupEngineImpl::ReadFileAndComputeChecksum and elsewhere in BackupEngine. Also, SstFileDumper needed a fix to its error handling logic. This was originally intended to help diagnose some mysterious failures (apparent corruptions) seen in taking backups in the crash test, though that is now fixed in https://github.com/facebook/rocksdb/pull/12206 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12200 Test Plan: unit test added that corrupts table properties, along with existing tests Reviewed By: ajkr Differential Revision: D52520674 Pulled By: pdillinger fbshipit-source-id: 032cfc0791428f3b8147d34c7d424ab128e28f42 |
|
akankshamahajan | 5cb2d09d47 |
Refactor FilePrefetchBuffer code (#12097)
Summary: Summary - Refactor FilePrefetchBuffer code - Implementation: FilePrefetchBuffer maintains a deque of free buffers (free_bufs_) of size num_buffers_ and buffers (bufs_) which contains the prefetched data. Whenever a buffer is consumed or is outdated (w.r.t. to requested offset), that buffer is cleared and returned to free_bufs_. If a buffer is available in free_bufs_, it's moved to bufs_ and is sent for prefetching. num_buffers_ defines how many buffers are maintained that contains prefetched data. If num_buffers_ == 1, it's a sequential read flow. Read API will be called on that one buffer whenever the data is requested and is not in the buffer. If num_buffers_ > 1, then the data is prefetched asynchronosuly in the buffers whenever the data is consumed from the buffers and that buffer is freed. If num_buffers > 1, then requested data can be overlapping between 2 buffers. To return the continuous buffer overlap_bufs_ is used. The requested data is copied from 2 buffers to the overlap_bufs_ and overlap_bufs_ is returned to the caller. - Merged Sync and Async code flow into one in FilePrefetchBuffer. Test Plan - - Crash test passed - Unit tests - Pending - Benchmarks Pull Request resolved: https://github.com/facebook/rocksdb/pull/12097 Reviewed By: ajkr Differential Revision: D51759552 Pulled By: akankshamahajan15 fbshipit-source-id: 69a352945affac2ed22be96048d55863e0168ad5 |
|
Peter Dillinger | ed46981bea |
Fix and defend against FilePrefetchBuffer combined with mmap reads (#12206)
Summary: FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200. In this change, * Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption. * Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior. * Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer * Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues. Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206 Test Plan: * Manually verified that the new check aids in debugging. * Unit test added, that fails if either fix is missed. * Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200 Reviewed By: akankshamahajan15 Differential Revision: D52551701 Pulled By: pdillinger fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc |
|
git-hulk | f11a0237b6 |
sst_dump: display metaindex_handle and the index_handle's offset and size in footer information (#12204)
Summary: Before applying this PR, the footer details: ``` Footer Details: -------------------------------------- metaindex handle: B0E499405C index handle: 8AC49940CD17 table_magic_number: 9863518390377041911 format version: 5 ``` and after ``` Footer Details: -------------------------------------- metaindex handle: B0E499405C offset: 134640176 size: 92 index handle: 8AC49940CD17 offset: 134636042 size: 3021 table_magic_number: 9863518390377041911 format version: 5 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12204 Reviewed By: cbi42 Differential Revision: D52547832 Pulled By: ajkr fbshipit-source-id: 5ff58ed347f9caf919bbdc6b242e3306d2525653 |
|
Hui Xiao | 06e593376c |
Group SST write in flush, compaction and db open with new stats (#11910)
Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff |
|
Jason Volk | 83e38c0a58 |
Fix SystemClock not passed from environment to PERF_CPU_TIMER_GUARD. (#12180)
Summary: The hardcoded nullptr argument for SystemClock to PERF_CPU_TIMER_GUARD ignored any SystemClock instance provided by the env; this was probably an oversight. In practice, the defaulting SystemClock could lead to excessive `clock_gettime(CLOCK_THREAD_CPUTIME_ID)` syscalls if `report_bg_io_stats=true` which cannot be mitigated by the embedder. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12180 Reviewed By: hx235 Differential Revision: D52421750 Pulled By: ajkr fbshipit-source-id: 92f8a93cebe9f8030ea5f6c3bf35398078e6bdfe |
|
akankshamahajan | e7c6259447 |
Make auto_readahead_size default true (#12080)
Summary: Make auto_readahead_size option default true Pull Request resolved: https://github.com/facebook/rocksdb/pull/12080 Test Plan: benchmarks and exisiting tests Reviewed By: anand1976 Differential Revision: D52152132 Pulled By: akankshamahajan15 fbshipit-source-id: f1515563564e77df457dff2e865e4ede8c3ddf44 |
|
akankshamahajan | d926593df5 |
Fix stress tests failure for auto_readahead_size (#12131)
Summary: When auto_readahead_size is enabled, Prev operation calls SeekForPrev in db_iter so that - BlockBasedTableIterator can point index_iter_ to the right block. - disable readahead_cache_lookup. However, there can be cases where SeekForPrev might not go through Version_set and call BlockBasedTableIterator SeekForPrev. In that case, when BlockBasedTableIterator::Prev is called, it returns NotSupported error. This more like a corner case. So to handle that case, removed SeekForPrev calling from db_iter and reseeking index_iter_ in Prev operation. block_iter_'s key already point to right block. So reseeking to index_iter_ solves the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12131 Test Plan: - Tested on db_stress command that was failing - `./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --best_efforts_recovery=1 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_verifydb=1 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35` - make crash_test -j32 Reviewed By: anand1976 Differential Revision: D51986326 Pulled By: akankshamahajan15 fbshipit-source-id: 90e11e63d1f1894770b457a44d8b213ae5512df9 |
|
Peter Dillinger | c96d9a0fbb |
Allow TablePropertiesCollectorFactory to return null collector (#12129)
Summary: As part of building another feature, I wanted this: * Custom implementations of `TablePropertiesCollectorFactory` may now return a `nullptr` collector to decline processing a file, reducing callback overheads in such cases. * Polished, clarified some related API comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12129 Test Plan: unit test added Reviewed By: ltamasi Differential Revision: D51966667 Pulled By: pdillinger fbshipit-source-id: 2991c08fe6ce3a8c9f14c68f1495f5a17bca2770 |
|
akankshamahajan | c77b50a4fd |
Add AsyncIO support for tuning readahead_size by block cache lookup (#11936)
Summary: Add support for tuning of readahead_size by block cache lookup for async_io. **Design/ Implementation** - **BlockBasedTableIterator.cc** - `BlockCacheLookupForReadAheadSize` callback API lookups in the block cache and tries to reduce the start and end offset passed. This function looks into the block cache for the blocks between `start_offset` and `end_offset` and add all the handles in the queue. It then iterates from the end in the handles to find first miss block and update the end offset to that block. It also iterates from the start and find first miss block and update the start offset to that block. ``` _read_curr_block_ argument : True if this call was due to miss in the cache and caller wants to read that block synchronously. False if current call is to prefetch additional data in extra buffers (due to ReadAsync call in FilePrefetchBuffer) ``` In case there is no data to be read in that callback (because of upper_bound or all blocks are in cache), it updates start and end offset to be equal and that `FilePrefetchBuffer` interprets that as 0 length to be read. **FilePrefetchBuffer.cc** - FilePrefetchBuffer calls the callback - `ReadAheadSizeTuning` and pass the start and end offset to that callback to get updated start and end offset to read based on cache hits/misses. 1. In case of Read calls (when offset passed to FilePrefetchBuffer is on cache miss and that data needs to be read), _read_curr_block_ is passed true. 2. In case of ReadAsync calls, when buffer is all consumed and can go for additional prefetching, the start offset passed is the initial end offset of prev buffer (without any updated offset based on cache hit/miss). Foreg. if following are the data blocks with cache hit/miss and start offset and Read API found miss on DB1 and based on readahead_size (50) it passes end offset to be 50. [DB1 - miss- 0 ] [DB2 - hit -10] [DB3 - miss -20] [DB4 - miss-30] [DB5 - hit-40] [DB6 - hit-50] [DB7 - miss-60] [DB8 - miss - 70] [DB9 - hit - 80] [DB6 - hit 90] - For Read call - updated start offset remains 0 but end offset updates to DB4, as DB5 is in cache. - Read calls saves initial end offset 50 as that was meant to be prefetched. - Now for next ReadAsync call - the start offset will be 50 (previous buffer initial end offset) and based on readahead_size, end offset will be 100 - On callback, because of cache hits - callback will update the start offset to 60 and end offset to 80 to read only 2 data blocks (DB7 and DB8). - And for that ReadAsync call - initial end offset will be set to 100 which will again used by next ReadAsync call as start offset. - `initial_end_offset_` in `BufferInfo` is used to save the initial end offset of that buffer. - If let's say DB5 and DB6 overlaps in 2 buffers (because of alignment), `prev_buf_end_offset` is passed to make sure already prefetched data is not prefetched again in second buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11936 Test Plan: - Ran crash_test several times. - New unit tests added. Reviewed By: anand1976 Differential Revision: D50906217 Pulled By: akankshamahajan15 fbshipit-source-id: 0d75d3c98274e98aa34901b201b8fb05232139cf |
|
Levi Tamasi | 0ebe1614cb |
Eliminate some code duplication in MergeHelper (#12121)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12121 The patch eliminates some code duplication by unifying the two sets of `MergeHelper::TimedFullMerge` overloads using variadic templates. It also brings the order of parameters into sync when it comes to the various `TimedFullMerge*` methods. Reviewed By: jaykorean Differential Revision: D51862483 fbshipit-source-id: e3f832a6ff89ba34591451655cf11025d0a0d018 |
|
Andrew Kryczka | 06dc32ef25 |
internal_repo_rocksdb (435146444452818992) (#12115)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12115 Reviewed By: jowlyzhang Differential Revision: D51745742 Pulled By: ajkr fbshipit-source-id: 67000d07783b413924798dd9c1751da27e119d53 |
|
Andrew Kryczka | be3bc36811 |
internal_repo_rocksdb (-8794174668376270091) (#12114)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12114 Reviewed By: jowlyzhang Differential Revision: D51745613 Pulled By: ajkr fbshipit-source-id: 27ca4bda275cab057d3a3ec99f0f92cdb9be5177 |
|
raffertyyu | a7779458bd |
sst_dump support cuckoo table (#12098)
Summary: https://github.com/facebook/rocksdb/issues/11446 Support Cuckoo Table format in sst_dump. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12098 Reviewed By: jowlyzhang Differential Revision: D51594094 Pulled By: ajkr fbshipit-source-id: ba9092818bc3cc0f207b000391aa21d564570df2 |
|
cz2h | 324453e579 |
Fix rowcache get returning incorrect timestamp (#11952)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/7930.
When there is a timestamp associated with stored records, get from row cache will return the timestamp provided in query instead of the timestamp associated with the stored record.
## Cause of error:
Currently a row_handle is fetched using row_cache_key(contains a timestamp provided by user query) and the row_handle itself does not persist timestamp associated with the object. Hence the [GetContext::SaveValue()
](
|
|
Yu Zhang | 84a54e1e28 |
Fix some bugs in index builder and reader for the UDT in memtable only feature (#12062)
Summary: These bugs surfaced while I was trying to add the stress test for the feature: Bug 1) On the index building path: the optimization to use user key instead of internal key as separator needed a bit tweak for when user defined timestamps can be removed. Because even though the user key look different now and eligible to be used as separator, when their user-defined timestamps are removed, they could be equal and that invariant no longer stands. Bug 2) On the index reading path: one path that builds the second level index iterator for `PartitionedIndexReader` are not passing the corresponding `user_defined_timestamps_persisted` flag. As a result, the default `true` value be used leading to no minimum timestamps padded when they should be. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12062 Test Plan: For bug 1): added separate unit test `BlockBasedTableReaderTest::Get` to exercise the `Get` API. It's a different code path from `MultiGet` so worth having its own test. Also in order to cover the bug, the test is modified to generate key values with the same user provided key, different timestamps and different sequence numbers. The test reads back different versions of the same user provided key. `MultiGet` takes one `ReadOptions` with one read timestamp so we cannot test retrieving different versions of the same key easily. For bug 2): simply added options `BlockBasedTableOptions.metadata_cache_options.partition_pinning = PinningTier::kAll` to exercise all the index iterator creating paths. Reviewed By: ltamasi Differential Revision: D51508280 Pulled By: jowlyzhang fbshipit-source-id: 8b174d3d70373c0599266ac1f467f2bd4d7ea6e5 |
|
Hui Xiao | f337533b6f |
Ensure and clarify how RocksDB calls TablePropertiesCollector's functions (#12053)
Summary: **Context/Summary:** It's intuitive for users to assume `TablePropertiesCollector::Finish()` is called only once by RocksDB internal by the word "finish". However, this is currently not true as RocksDB also calls this function in `BlockBased/PlainTableBuilder::GetTableProperties()` to populate user collected properties on demand. This PR avoids that by moving that populating to where we first call `Finish()` (i.e, `NotifyCollectTableCollectorsOnFinish`) Bonus: clarified in the API that `GetReadableProperties()` will be called after `Finish()` and added UT to ensure that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12053 Test Plan: - Modified test `DBPropertiesTest.GetUserDefinedTableProperties` to ensure `Finish()` only called once. - Existing test particularly `db_properties_test, table_properties_collector_test` verify the functionality `NotifyCollectTableCollectorsOnFinish` and `GetReadableProperties()` are not broken by this change. Reviewed By: ajkr Differential Revision: D51095434 Pulled By: hx235 fbshipit-source-id: 1c6275258f9b99dedad313ee8427119126817973 |
|
anand76 | e81393e81e |
Add some stats to observe the usefulness of scan prefetching (#11981)
Summary: Add stats for better observability of scan prefetching. Its only implemented for sync scan right now. These stats can help inform future improvements in scan prefetching. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11981 Test Plan: Add a new unit test Reviewed By: akankshamahajan15 Differential Revision: D50516505 Pulled By: anand1976 fbshipit-source-id: cb1cc6cf02df8295930a49c62b11870020df3f97 |
|
Changyu Bi | d5bc30befa |
Enforce status checking after Valid() returns false for IteratorWrapper (#11975)
Summary: ... when compiled with ASSERT_STATUS_CHECKED = 1. The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782. Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975 Test Plan: * `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check` * I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task. Reviewed By: ajkr Differential Revision: D50383790 Pulled By: cbi42 fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce |