mirror of https://github.com/facebook/rocksdb.git
12608 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
anand76 | 63a105a481 |
Enable recycle_log_file_num option for point in time recovery (#12403)
Summary: This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL. This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403 Test Plan: 1. Add new unit tests to verify behavior upon corruption 2. Re-enable disabled tests for verifying recycling behavior Reviewed By: ajkr Differential Revision: D54544824 Pulled By: anand1976 fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a |
|
anand76 | 98d8a85624 |
New PerfContext counters for block cache bytes read (#12459)
Summary: Add PerfContext counters for measuring the cumulative size of blocks found in the block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12459 Reviewed By: ajkr Differential Revision: D55170694 Pulled By: anand1976 fbshipit-source-id: 8cbba76eece116cefce7f00e2fc9d74757661d25 |
|
Yu Zhang | 13e1c32a18 |
Follow ups for TimedPut and write time property (#12455)
Summary: This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including: 1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry. 2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries. 3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible. Test plans: Added unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455 Reviewed By: pdillinger Differential Revision: D55165422 Pulled By: jowlyzhang fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47 |
|
Richard Barnes | 6a1c2abe9d |
Remove extra semi colon from hbt/src/tagstack/tests/SlicerTest.cpp (#12461)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12461 X-link: https://github.com/facebookincubator/dynolog/pull/233 `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: rahku Differential Revision: D55087324 fbshipit-source-id: e8a03d33cad72a7d378e58f85eb550a03f6c2897 |
|
Kshitij Wadhwa | 4ce1dc930c |
don't run ZSTD_TrainDictionary in BlockBasedTableBuilder if there isn't compression needed (#12453)
Summary: fixes https://github.com/facebook/rocksdb/issues/12409 ### Issue ZSTD_TrainDictionary [[link]( |
|
Jay Huh | 3f3f4660bd |
wal_read_status check in RecoverLogFiles (#12460)
Summary: Fixing the not-checked status failure as in https://github.com/facebook/rocksdb/actions/runs/8334988399/job/22809612148. When the status is not ok() for any reason, we do not check the `wal_read_status` because it's not necessary. It's causing the test failure when running with `ASSERT_STATUS_CHECKED=1` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12460 Test Plan: Existing tests Reviewed By: ajkr Differential Revision: D55104844 Pulled By: jaykorean fbshipit-source-id: 919b1fddca835494f9087c51c4da6eabc9e8df2b |
|
Richard Barnes | 6ddfa5f061 |
Remove extra semi colon from internal_repo_rocksdb/repo/util/filelock_test.cc
Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: palmje Differential Revision: D55087322 fbshipit-source-id: ca4db7285444306d6c91545cd2c33483dfe05385 |
|
Richard Barnes | fc40165614 |
Remove extra semi colon from internal_repo_rocksdb/repo/tools/ldb_cmd.cc
Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: palmje Differential Revision: D54362227 fbshipit-source-id: ac634ba34f9351ba559c4ed96448f51d6ef33175 |
|
anand76 | 4868c10b44 |
Retry block reads on checksum mismatch (#12427)
Summary: On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`. Tests: Add new unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427 Reviewed By: ajkr Differential Revision: D55025941 Pulled By: anand1976 fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e |
|
Jay Huh | b4e9f5a400 |
Update Remote Compaction Tests to include more than one CF (#12430)
Summary: Update `compaction_service_test` to make sure remote compaction works with multiple column family set up. Minor refactor to get rid of duplicate code Fixing one quick bug in the existing test util: Test util's `FilesPerLevel` didn't honor `cf_id` properly) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12430 Test Plan: ``` ./compaction_service_test ``` Reviewed By: ajkr Differential Revision: D54883035 Pulled By: jaykorean fbshipit-source-id: 83b4f6f566fed5c4824bfef7de01074354a72b44 |
|
Hui Xiao | 2443ebf810 |
Don't write to WAL after previous WAL write error (#12448)
Summary:
**Context/Summary:**
WAL write can continue onto the the WAL file that has encountered error and thus crash at
|
|
Levi Tamasi | a93edea7e5 |
Deduplicate WriteBatchWithIndex::{Get,GetEntity}FromBatch (#12442)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12442 The patch deduplicates and unifies the logic of `WriteBatchWithIndex::{Get,GetEntity}FromBatch` using templates and makes some small code hygiene improvements, including consistently clearing the output value in the various non-success cases. Reviewed By: jaykorean Differential Revision: D54922935 fbshipit-source-id: c92e89f905a3c80cef57c2c840f49f806629238f |
|
Jay Huh | db1dea22b1 |
MultiCfIterator Implementations (#12422)
Summary: This PR continues https://github.com/facebook/rocksdb/issues/12153 by implementing the missing `Iterator` APIs - `Seek()`, `SeekForPrev()`, `SeekToLast()`, and `Prev`. A MaxHeap Implementation has been added to handle the reverse direction. The current implementation does not include upper/lower bounds yet. These will be added in subsequent PRs. The API is still marked as under construction and will be lifted after being added to the stress test. Please note that changing the iterator direction in the middle of iteration is expensive, as it requires seeking the element in each iterator again in the opposite direction and rebuilding the heap along the way. The first `Next()` after `SeekForPrev()` requires changing the direction under the current implementation. We may optimize this in later PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12422 Test Plan: The `multi_cf_iterator_test` has been extended to cover the API implementations. Reviewed By: pdillinger Differential Revision: D54820754 Pulled By: jaykorean fbshipit-source-id: 9eb741508df0f7bad598fb8e6bd5cdffc39e81d1 |
|
Changyu Bi | 3d5be596a5 |
Fix a bug in iterator with UDT + `ReadOptions::pin_data` (#12451)
Summary: with https://github.com/facebook/rocksdb/issues/12414 enabling `ReadOptions::pin_data`, this bug surfaced as corrupted per key-value checksum during crash test. `saved_key_.GetUserKey()` could be pinned user key, so DBIter should not overwrite it. In one case, it only surfaces when iterator skips many keys of the same user key. To stress that code path, this PR also added `max_sequential_skip_in_iterations` to crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12451 Test Plan: - Set ReadOptions::pin_data to true, the bug can be reproed quickly with `./db_stress --persist_user_defined_timestamps=1 --user_timestamp_size=8 --writepercent=35 --delpercent=4 --delrangepercent=1 --iterpercent=20 --nooverwritepercent=1 --prefix_size=8 --prefixpercent=10 --readpercent=30 --memtable_protection_bytes_per_key=8 --block_protection_bytes_per_key=2 --clear_column_family_one_in=0`. - Set max_sequential_skip_in_iterations to 1 for the other occurrence of the bug. Reviewed By: jowlyzhang Differential Revision: D55003766 Pulled By: cbi42 fbshipit-source-id: 23e1049129456684dafb028b6132b70e0afc07fb |
|
Yu Zhang | f2546b6623 |
Support returning write unix time in iterator property (#12428)
Summary: This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is: 1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned. 2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown. Needed follow up: 1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it. 2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion. 3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428 Test Plan: Added unit test Reviewed By: pdillinger Differential Revision: D54967067 Pulled By: jowlyzhang fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a |
|
Andrew Kryczka | 4d5ebad971 |
Fix kBlockCacheTier read with table cache miss (#12443)
Summary: Thanks ltamasi for pointing out this bug. We were incorrectly overwriting `Status::Incomplete` with `Status::OK` after a table cache miss failed to open the file due to the read being memory-only (`kBlockCacheTier`). The fix is to simply stop overwriting the status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12443 Reviewed By: cbi42 Differential Revision: D54930128 Pulled By: ajkr fbshipit-source-id: 52f912a2e93b46e71d79fc5968f8ca35b299213d |
|
Andrew Kryczka | 3f5bd46a07 |
Add `ContinueCallback` to `GetMergeOperands()` (#12438)
Summary: The use case is similar to `MergeOperator::ShouldMerge()` for `Get()`: preventing reads into LSM components for merge operands that are of no interest to the user. `MergeOperator::ShouldMerge()` cannot be reused here because: - Its name does not make sense in the context of `GetMergeOperands()` since `GetMergeOperands()` never invokes merge - The callback is part of the `MergeOperator`, but an option specific to the read operation makes more sense to me If there are any ideas for an API design that covers both `MergeOperator::ShouldMerge()`'s use cases and `GetMergeOperandsOptions::continue_cb`'s use cases, that would be ideal, but for now this is what I came up with. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12438 Reviewed By: hx235 Differential Revision: D54914669 Pulled By: ajkr fbshipit-source-id: 5f3ff78d3890adc0b1b74bedf3921221930ce63a |
|
Peter Dillinger | c3c0cfc3a8 |
Create an UnownedPtr type (#12447)
Summary: ... that is more hygienic as an "optional reference" than a raw pointer, and likely more efficient than std::optional<std::reference_wrapper<T>>. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12447 Test Plan: unit test included (with manual verification that "must not compile" sections currently do not) Reviewed By: jowlyzhang Differential Revision: D54957917 Pulled By: pdillinger fbshipit-source-id: bbd89218df803617b1a170ebddc9e56c9b52bf93 |
|
Changyu Bi | 096fb9b67d |
Fix data race in WalManager (#12439)
Summary: Crash tests were failing due to data race in accessing `purge_wal_files_last_run_`. This PR changes it to atomic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12439 Test Plan: - existing UT - not able to repro with `python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --WAL_ttl_seconds=1` and TSAN yet, will monitor internal crash tests Reviewed By: anand1976 Differential Revision: D54920817 Pulled By: cbi42 fbshipit-source-id: 80ee026b1785ad5dba11295ed35c88889df5f5a6 |
|
Yu Zhang | 1104eaa35e |
Add initial support for TimedPut API (#12419)
Summary: This PR adds support for `TimedPut` API. We introduced a new type `kTypeValuePreferredSeqno` for entries added to the DB via the `TimedPut` API. The life cycle of such an entry on the write/flush/compaction paths are: 1) It is initially added to memtable as: `<user_key, seq, kTypeValuePreferredSeqno>: {value, write_unix_time}` 2) When it's flushed to L0 sst files, it's converted to: `<user_key, seq, kTypeValuePreferredSeqno>: {value, preferred_seqno}` when we have easy access to the seqno to time mapping. 3) During compaction, if certain conditions are met, we swap in the `preferred_seqno` and the entry will become: `<user_key, preferred_seqno, kTypeValue>: value`. This step helps fast track these entries to the cold tier if they are eligible after the sequence number swap. On the read path: A `kTypeValuePreferredSeqno` entry acts the same as a `kTypeValue` entry, the unix_write_time/preferred seqno part packed in value is completely ignored. Needed follow ups: 1) The seqno to time mapping accessible in flush needs to be extended to cover the `write_unix_time` for possible `kTypeValuePreferredSeqno` entries. This also means we need to track these `write_unix_time` in memtable. 2) Compaction filter support for the new `kTypeValuePreferredSeqno` type for feature parity with other `kTypeValue` and equivalent types. 3) Stress test coverage for the feature Pull Request resolved: https://github.com/facebook/rocksdb/pull/12419 Test Plan: Added unit tests Reviewed By: pdillinger Differential Revision: D54920296 Pulled By: jowlyzhang fbshipit-source-id: c8b43f7a7c465e569141770e93c748371ff1da9e |
|
Changyu Bi | f77b788545 |
Fix a bug in `LRUCacheShard::LRU_Insert` (#12429)
Summary: we saw crash test fail with ``` lru_cache.cc:249: void rocksdb::lru_cache::LRUCacheShard::LRU_Remove(rocksdb::lru_cache::LRUHandle *): Assertion `high_pri_pool_usage_ >= e->total_charge' failed. ``` One cause for this is that `lru_low_pri_` pointer is not updated in `LRU_insert()` before we try to balance high pri and low pri pool in `MaintainPoolSize();`. A repro unit test is provided. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12429 Test Plan: Not able to reproduce the failure with db_stress yet. `./lru_cache_test --gtest_filter="*InsertAfterReducingCapacity*`. It fails the assertion before this PR. Reviewed By: pdillinger Differential Revision: D54908919 Pulled By: cbi42 fbshipit-source-id: f485fdbc0ea61c8092a0be5fe561a59c15c78fd3 |
|
Hui Xiao | fa4978c566 |
Re-suppress tolerable manual compaction stress test failures (#12437)
Summary: **Context/Summary:** Previously manual compaction stress test failures won't terminate stress test. https://github.com/facebook/rocksdb/pull/12414 made more manual compaction failures terminate the stress test for signal boosting. A downside to that PR: some tolerable manual compaction stress test failures also unnecessarily terminate stress test. Ideally we should exclude exactly those tolerable errors (left as a TODO) from being able to terminate. For now we approximate those errors by Aborted(), InvalidArgument(), NotSupported() etc. It's still an improvement to pre-https://github.com/facebook/rocksdb/pull/12414 situation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12437 Test Plan: - No more tolerable manual compaction stress test failures terminating stress test. Reviewed By: cbi42 Differential Revision: D54913010 Pulled By: hx235 fbshipit-source-id: c43fa79d8f9c1c8b4f8786f8f46508b0ad619a9e |
|
Changyu Bi | e91263edb9 |
Fix data race in AutoRollLogger (#12436)
Summary: `logger_` can be destructed in `ResetLogger()` so we should access them under `mutex_`. Similarly `status_` can be updated only under `mutex_` or in constructor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12436 Test Plan: I tried running tsan crash test with log_file_time_to_roll = 2, but not able to repro yet. Will monitor internal crash tests. Reviewed By: hx235 Differential Revision: D54916371 Pulled By: cbi42 fbshipit-source-id: 4a3e3b40fbc2ae242598afdbd4bed5fb8ccf8d65 |
|
Peter Dillinger | dd24bda137 |
Fix windows build and CI (#12426)
Summary: Issue https://github.com/facebook/rocksdb/issues/12421 describes a regression in the migration from CircleCI to GitHub Actions in which failing build steps no longer fail Windows CI jobs. In GHA with pwsh (new preferred powershell command), only the last non-builtin command (or something like that) affects the overall success/failure result, and failures in external commands do not exit the script, even with `$ErrorActionPreference = 'Stop'` and `$PSNativeCommandErrorActionPreference = $true`. Switching to `powershell` causes some obscure failure (not seen in CircleCI) about the `-Lo` option to `curl`. Here we work around this using the only reasonable-but-ugly way known: explicitly check the result after every non-trivial build step. This leaves us highly susceptible to future regressions with unchecked build steps in the future, but a clean solution is not known. This change also fixes the build errors that were allowed to creep in because of the CI regression. Also decreased the unnecessarily long running time of DBWriteTest.WriteThreadWaitNanosCounter. For background, this problem explicitly contradicts GitHub's documentation, and GitHub has known about the problem for more than a year, with no evidence of caring or intending to fix. https://github.com/actions/runner-images/issues/6668 Somehow CircleCI doesn't have this problem. And even though cmd.exe and powershell have been perpetuating DOS-isms for decades, they still seem to be a somewhat active "hot mess" when it comes to sensible, consistent, and documented behavior. Fixes https://github.com/facebook/rocksdb/issues/12421 A history of some things I tried in development is here: https://github.com/facebook/rocksdb/compare/main...pdillinger:rocksdb:debug_windows_ci_orig Pull Request resolved: https://github.com/facebook/rocksdb/pull/12426 Test Plan: CI, including https://github.com/facebook/rocksdb/issues/12434 where I have temporarily enabled other Windows builds on PR with this change Reviewed By: cbi42 Differential Revision: D54903698 Pulled By: pdillinger fbshipit-source-id: 116bcbebbbf98f347c7cf7dfdeebeaaed7f76827 |
|
Levi Tamasi | 7c290f72b8 |
Implement WriteBatchWithIndex::GetEntityFromBatch (#12424)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12424 The PR adds a wide-column point lookup API `GetEntityFromBatch` to `WriteBatchWithIndex`. Similarly to APIs like `DB::GetEntity`, this new API returns wide-column entities as-is, and wraps plain values in an entity with a single column (the anonymous default column). Also, similarly to `WriteBatchWithIndex::GetFromBatch`, it only reads data from the batch itself. Reviewed By: jaykorean Differential Revision: D54826535 fbshipit-source-id: 92604f3ebd90fe1afbd36f2d2194b7dee0011efa |
|
Changyu Bi | ba022dd44c |
Disable `enable_checksum_handoff` in crash test (#12431)
Summary: since it been causing a few crash tests failures, I suspect it'll be easy to repro locally. Also fixed how to print its corruption message so it does not crash with output cannot be utf-8 decoded. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12431 Reviewed By: hx235 Differential Revision: D54881023 Pulled By: cbi42 fbshipit-source-id: 47208a637cd69b30d2545154849405e37db62ed3 |
|
Hui Xiao | 30243c6573 |
Add missing db crash options (#12414)
Summary: **Context/Summary:** We are doing a sweep in all public options, including but not limited to the `Options`, `Read/WriteOptions`, `IngestExternalFileOptions`, cache options.., to find and add the uncovered ones into db crash. The options included in this PR require minimum changes to db crash other than adding the options themselves. A bonus change: to surface new issues by improved coverage in stderror, we decided to fail/terminate crash test for manual compactions (CompactFiles, CompactRange()) on meaningful errors. See https://github.com/facebook/rocksdb/pull/12414/files#diff-5c4ced6afb6a90e27fec18ab03b2cd89e8f99db87791b4ecc6fa2694284d50c0R2528-R2532, https://github.com/facebook/rocksdb/pull/12414/files#diff-5c4ced6afb6a90e27fec18ab03b2cd89e8f99db87791b4ecc6fa2694284d50c0R2330-R2336 for more. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12414 Test Plan: - Run `python3 ./tools/db_crashtest.py --simple blackbox` for 10 minutes to ensure no trivial failure - Run `python3 tools/db_crashtest.py --simple blackbox --compact_files_one_in=1 --compact_range_one_in=1 --read_fault_one_in=1 --write_fault_one_in=1 --interval=50` for a while to ensure the bonus change does not result in trivial crash/termination of stress test Reviewed By: ajkr, jowlyzhang, cbi42 Differential Revision: D54691774 Pulled By: hx235 fbshipit-source-id: 50443dfb6aaabd8e24c79a2e42b68c6de877be88 |
|
Pavel Ferencz | 122510dc09 |
Workaround for an issue with Cmake builds that happens when cross-compiling using Cmake on macOS x86_64 CPUs with target CPU arm64 (#12240)
Summary: This is a temporary workaround for Cmake bug that only sets the correct CMAKE_SYSTEM_PROCESSOR for cross-compilation when target CMAKE_SYSTEM_NAME differs from CMAKE_HOST_NAME: https://gitlab.kitware.com/cmake/cmake/-/issues/25640 Fix cross-compilation on macOS x86_64 CPUs with target CPU arm64 by manually setting CMAKE_SYSTEM_PROCESSOR to the cross-compilation target whenever CMAKE_SYSTEM_PROCESSOR doesn't match CMAKE_OSX_ARCHITECTURES after project() call. This is probably a Cmake bug that happens on macOS. Closes https://github.com/facebook/rocksdb/issues/12239 The issue happens when RocksDB is built using the follwoing command: cmake -G "Unix Makefiles" -DCMAKE_SYSTEM_PROCESSOR=arm64 .. The build itself succeeds, but because Cmake wrongly sets CMAKE_SYSTEM_PROCESSOR to x86_64 instead of arm64 and causes crc32c_arm64.cc not to be compiled. This in turn makes the project fails any linking with RocksDB: ``` Undefined symbols for architecture arm64: "crc32c_arm64(unsigned int, unsigned char const*, unsigned long)", referenced from: rocksdb::crc32c::ExtendARMImpl(unsigned int, char const*, unsigned long) in librocksdb.a(crc32c.cc.o) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12240 Reviewed By: cbi42 Differential Revision: D54811365 Pulled By: pdillinger fbshipit-source-id: 0958e3092806dadd2f61d582b7251af13a5f3f06 |
|
Peter Dillinger | c0ae5be934 |
Disable flaky part of TransactionLogIteratorCheckWhenArchive (#12423)
Summary: https://github.com/facebook/rocksdb/issues/12397 attempted to make the test more honest about its failures, and they're really showing up in CI now (but not locally). Disable pending investigation Pull Request resolved: https://github.com/facebook/rocksdb/pull/12423 Test Plan: watch CI Reviewed By: ltamasi Differential Revision: D54817705 Pulled By: pdillinger fbshipit-source-id: 4721834c49b225ac52d1a28ecb06b9d05de977b3 |
|
Alan Paxton | d9a441113e |
JNI get_helper code sharing / multiGet() use efficient batch C++ support (#12344)
Summary: Implement RAII-based helpers for JNIGet() and multiGet() Replace JNI C++ helpers `rocksdb_get_helper, rocksdb_get_helper_direct`, `multi_get_helper`, `multi_get_helper_direct`, `multi_get_helper_release_keys`, `txn_get_helper`, and `txn_multi_get_helper`. The model is to entirely do away with a single helper, instead a number of utility methods allow each separate JNI `Get()` and `MultiGet()` method to organise their parameters efficiently, then call the underlying C++ `db->Get()`, `db->MultiGet()`, `txn->Get()`, or `txn->MultiGet()` method itself, and use further utilities to retrieve results. Roughly speaking: * get keys into C++ form * Call C++ Get() * get results and status into Java form We achieve a useful performance gain as part of this work; by using the updated C++ multiGet we immediately pick up its performance gains (batch improvements to multiGet C++ were previously implemented, but not until now used by Java/JNI). multiGetBB already uses the batched C++ multiGet(), and all other benchmarks show consistent improvement after the changes: ## Before: ``` Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5315.459 ± 20.465 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5673.115 ± 78.299 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2616.860 ± 46.994 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1700.058 ± 24.034 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 791.171 ± 13.955 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6129.929 ± 94.200 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7012.405 ± 97.886 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2799.014 ± 39.352 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1417.205 ± 22.272 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 655.594 ± 13.050 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6147.247 ± 82.711 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7004.213 ± 79.251 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2715.154 ± 110.017 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1408.070 ± 31.714 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 623.829 ± 57.374 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6119.243 ± 116.313 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 6931.873 ± 128.094 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2678.253 ± 39.113 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1337.384 ± 19.500 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 625.596 ± 14.525 ops/s ``` ## After: ``` Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5191.074 ± 78.250 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5378.692 ± 260.682 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2590.183 ± 34.844 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1634.793 ± 34.022 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 786.455 ± 8.462 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 256 thrpt 25 5285.055 ± 11.676 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 1024 thrpt 25 5586.758 ± 213.008 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 4096 thrpt 25 2527.172 ± 17.106 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 16384 thrpt 25 1819.547 ± 12.958 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 65536 thrpt 25 803.861 ± 9.963 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 256 thrpt 25 5253.793 ± 28.020 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 1024 thrpt 25 5705.591 ± 20.556 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 4096 thrpt 25 2523.377 ± 15.415 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 16384 thrpt 25 1815.344 ± 11.309 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 65536 thrpt 25 820.792 ± 3.192 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 256 thrpt 25 5262.184 ± 20.477 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 1024 thrpt 25 5706.959 ± 23.123 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 4096 thrpt 25 2520.362 ± 9.170 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 16384 thrpt 25 1789.185 ± 14.239 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 65536 thrpt 25 818.401 ± 12.132 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6978.310 ± 14.084 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7664.242 ± 22.304 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2881.778 ± 81.054 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1599.826 ± 7.190 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 737.520 ± 6.809 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 256 thrpt 25 6974.376 ± 10.716 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 1024 thrpt 25 7637.440 ± 45.877 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 4096 thrpt 25 2820.472 ± 42.231 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 16384 thrpt 25 1716.663 ± 8.527 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 65536 thrpt 25 755.848 ± 7.514 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 256 thrpt 25 6943.651 ± 20.040 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 1024 thrpt 25 7679.415 ± 9.114 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 4096 thrpt 25 2844.564 ± 13.388 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 16384 thrpt 25 1729.545 ± 5.983 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 65536 thrpt 25 783.218 ± 1.530 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 256 thrpt 25 6944.276 ± 29.995 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 1024 thrpt 25 7670.301 ± 8.986 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 4096 thrpt 25 2839.828 ± 12.421 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 16384 thrpt 25 1730.005 ± 9.209 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 65536 thrpt 25 787.096 ± 1.977 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6896.944 ± 21.530 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7622.407 ± 12.824 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2927.538 ± 19.792 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1598.041 ± 4.312 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 744.564 ± 9.236 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 256 thrpt 25 6853.760 ± 78.041 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 1024 thrpt 25 7360.917 ± 355.365 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 4096 thrpt 25 2848.774 ± 13.409 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 16384 thrpt 25 1727.688 ± 3.329 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 65536 thrpt 25 776.088 ± 7.517 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 256 thrpt 25 6910.339 ± 14.366 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 1024 thrpt 25 7633.660 ± 10.830 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 4096 thrpt 25 2787.799 ± 81.775 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 16384 thrpt 25 1726.517 ± 6.830 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 65536 thrpt 25 787.597 ± 3.362 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 256 thrpt 25 6922.445 ± 10.493 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 1024 thrpt 25 7604.710 ± 48.043 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 4096 thrpt 25 2848.788 ± 15.783 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 16384 thrpt 25 1730.837 ± 6.497 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 65536 thrpt 25 794.557 ± 1.869 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6918.716 ± 15.766 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 7626.692 ± 9.394 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2871.382 ± 72.155 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1598.786 ± 4.819 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 748.469 ± 7.234 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 256 thrpt 25 6922.666 ± 17.131 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 1024 thrpt 25 7623.890 ± 8.805 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 4096 thrpt 25 2850.698 ± 18.004 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 16384 thrpt 25 1727.623 ± 4.868 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 65536 thrpt 25 774.534 ± 10.025 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 256 thrpt 25 5486.251 ± 13.582 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 1024 thrpt 25 4920.656 ± 44.557 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 4096 thrpt 25 3922.913 ± 25.686 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 16384 thrpt 25 2873.106 ± 4.336 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 65536 thrpt 25 802.404 ± 8.967 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 256 thrpt 25 4817.996 ± 18.042 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 1024 thrpt 25 4243.922 ± 13.929 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 4096 thrpt 25 3175.998 ± 7.773 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 16384 thrpt 25 2321.990 ± 12.501 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 65536 thrpt 25 1753.028 ± 7.130 ops/s ``` Closes https://github.com/facebook/rocksdb/issues/11518 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12344 Reviewed By: cbi42 Differential Revision: D54809714 Pulled By: pdillinger fbshipit-source-id: bee3b949720abac073bce043b59ce976a11e99eb |
|
Changyu Bi | 36c1b0aded |
Allow SstFileReader to verify number of entries in SST files (#12418)
Summary: Add `SstFileReader::VerifyNumEntries()` for this purpose. I added the same functionality to `sst_dump` in https://github.com/facebook/rocksdb/issues/12322. Since sst_file_reader.h is exposed to users while sst_dump.h is not, it seems more appropriate to add SST files related APIs here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12418 Test Plan: `./sst_file_reader_test --gtest_filter="*VerifyNumEntries*"` Reviewed By: jowlyzhang Differential Revision: D54764271 Pulled By: cbi42 fbshipit-source-id: 22ebfe04bbb0b152762cee13d4210b147b36d3e9 |
|
Alan Paxton | c4d37da826 |
Java API - Fix handling of CF handles in DB subclasses (#12417)
Summary: The most general `open()` method for each of RocksDB, TtlDB, OptimisticTransactionDB and TransactionDB should - ensure the default CF is supplied in the list of descriptors - cache the default CF handle - store open CF handles for automatic close on DB close The `close()` method in each of these DB subclasses should `close()` all the owned CF handles. I can’t find a cleaner way to build some generalised open/close that does this for all DB subclasses, so it exists as cut and paste with variations in the 4 different DB subclasses. Added some slightly paranoid testing that CF handles explicitly referred to as default in a list of CF handles in the general open methods, and the simple open that doesn’t supply a CF, end up reading and writing to the same CF. Prompted by the fact that this code is a bit opaque; the first returned handle is the DB. As part of this, fix the bug where the Java side of `OptimisticsTransactionDB` was not setting up default column family; this was visible when setting up an iterator; add a test to validate that the iterator is OK. A single Java reference to the default column family was not being created in the OptimisticsTransactionDB RocksDB subclass; it should be created in all subclasses. The same problem had previously been fixed for TtlDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12417 Reviewed By: ajkr Differential Revision: D54807643 Pulled By: pdillinger fbshipit-source-id: 66f34e56a822a009a8f2018d401cf8940d91aa35 |
|
Peter Dillinger | 7622029101 |
Fix flaky TransactionLogIteratorCheckWhenArchive (#12397)
Summary: Seen in https://github.com/facebook/rocksdb/actions/runs/8086592802/job/22096691572?pr=12388 ``` [ RUN ] DBTestXactLogIterator.TransactionLogIteratorCheckWhenArchive db/db_log_iter_test.cc:173:23: runtime error: member call on address 0x0000023956f0 which does not point to an object of type 'rocksdb::DBTestXactLogIterator' 0x0000023956f0: note: object is of type 'rocksdb::DBTestBase' 00 00 00 00 98 ae f7 da 75 7f 00 00 a0 5d 39 02 00 00 00 00 80 ff 39 02 00 00 00 00 95 00 00 00 ^~~~~~~~~~~~~~~~~~~~~~~ vptr for 'rocksdb::DBTestBase' UndefinedBehaviorSanitizer: undefined-behavior db/db_log_iter_test.cc:173:23 in ``` This is almost certainly caused by the sync point callback happening on asynchronous file deletion in the DB while the end of the test is reached and the destruction of the `DBTestXactLogIterator` has reached `DBTestBase::~DBTestBase()`. Either closing the DB or disabling sync points before the end of the test should suffice to fix, and we'll do both. And assert that the sync point callback is actually hit each time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12397 Test Plan: unable to reproduce, but ran 1000 iterations of the test with UBSAN Reviewed By: ltamasi Differential Revision: D54326687 Pulled By: pdillinger fbshipit-source-id: cc09a4dcd2f237d5b45d910364d6aa56bbd46d50 |
|
anand76 | 179afd5bef |
Add a FS flag to detect and correct corruption (#12408)
Summary: Add a flag in `IOOptions` to request the file system to make best efforts to detect data corruption and reconstruct the data if possible. This will be used by RocksDB to retry a read if the previous read returns corrupt data (checksum mismatch). Add a new op to `FSSupportedOps` that, if supported, will trigger this behavior in RocksDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12408 Reviewed By: akankshamahajan15 Differential Revision: D54564218 Pulled By: anand1976 fbshipit-source-id: bc401dcd22a7d320bf63b5183c41714acdce39f5 |
|
Yu Zhang | 210c8df820 |
Use do_validate flag to control timestamp based validation in WriteCommittedTxn::GetForUpdate (#12369)
Summary: When PR https://github.com/facebook/rocksdb/issues/9629 introduced user-defined timestamp support for `WriteCommittedTxn`, it adds this usage mandate for API `GetForUpdate` when UDT is enabled. The `do_validate` flag has to be true, and user should have already called `Transaction::SetReadTimestampForValidation` to set a read timestamp for validation. The rationale behind this mandate is this: 1) with do_vaildate = true, `GetForUpdate` could verify this relationships: let's denote the user-defined timestamp in db for the key as `Ts_db` and the read timestamp user set via `Transaction::SetReadTimestampForValidation` as `Ts_read`. UDT based validation will only pass if `Ts_db <= Ts_read`. |
|
Andrew Kryczka | 27a2473668 |
Best-effort recovery support for atomic flush (#12406)
Summary: This PR updates `VersionEditHandlerPointInTime` to recover all or none of the updates in an AtomicGroup. This makes best-effort recovery properly handle atomic flushes during recovery, so the features are now allowed to both be enabled at once. The new logic requires that AtomicGroups do not contain column family additions or removals. AtomicGroups are currently written for atomic flush, which does not include such edits. Column family additions or removals are recovered independently of AtomicGroups. The new logic needs to be aware of removal, though, so that a dropped CF does not prevent completion of an AtomicGroup recovery. The new logic treats each AtomicGroup as if it contains updates for all existing column families, even though it is possible to create AtomicGroups that only affect a subset of column families. This simplifies the logic at the expense of recovering less data in certain edge case scenarios. The usage of `MaybeCreateVersion()` is pretty tricky. The goal is to create a barrier at the start of an AtomicGroup such that all valid states up to that point will be applied to `versions_`. Here is a summary. - `MaybeCreateVersion(..., false)` creates a `Version` on a negative edge trigger (transition from valid to invalid). It was previously called when applying each update. Now, it is only called when applying non-AtomicGroup updates. - `MaybeCreateVersion(..., true)` creates a `Version` on a positive level trigger (valid state). It was previously called only at the end of iteration. Now, it is additionally called before processing an AtomicGroup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12406 Reviewed By: jaykorean, cbi42 Differential Revision: D54494904 Pulled By: ajkr fbshipit-source-id: 0114a9fe1d04b471d086dcab5978ea8a3a56ad52 |
|
Radek Hubner | 583fded565 |
Fix regression for Javadoc jar build (#12404)
Summary: https://github.com/facebook/rocksdb/issues/12371 Introduced regression not defining dependency between `create_javadoc` and `rocksdb_javadocs_jar` build targets. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12404 Reviewed By: pdillinger Differential Revision: D54516862 Pulled By: ajkr fbshipit-source-id: 785a99b2caf979395ae0de60e40e7d1b93059adb |
|
Peter Dillinger | a53ed91691 |
Fix/improve temperature handling for file ingestion (#12402)
Summary: Partly following up on leftovers from https://github.com/facebook/rocksdb/issues/12388 In terms of public API: * Make it clear that IngestExternalFileArg::file_temperature is just a hint for opening the existing file, though it was previously used for both copy-from temp hint and copy-to temp, which was bizarre. * Specify how IngestExternalFile assigns temperature to file ingested into DB. (See details in comments.) This approach is not perfect in terms of matching how the DB assigns temperatures, but was the simplest way to get close. The key complication for matching DB temperature assignments is that ingestion files are copied (to a destination temp) before their target level is determined (in general). * Add a temperature option to SstFileWriter::Open so that files intended for ingestion can be initially written to a chosen temperature. * Note that "fail_if_not_bottommost_level" is obsolete/confusing use of "bottommost" In terms of the implementation, there was a similar bit of oddness with the internal CopyFile API, which only took one temperature, ambiguously applicable to the source, destination, or both. This is also fixed. Eventual suggested follow-up: * Before copying files for ingestion, determine a tentative level assignment to use for destination temperature, and keep that even if final level assignment happens to be different at commit time (rare). * More temperature handling for CreateColumnFamilyWithImport and Checkpoints. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12402 Test Plan: Deeply revamped ExternalSSTFileBasicTest.IngestWithTemperature to test the new changes. Previously this test was insufficient because it was only looking at temperatures according to the DB manifest. Incorporating FileTemperatureTestFS allows us to also test the temperatures in the storage layer. Used macros instead of functions for better tracing to critical source location on test failures. Some enhancements to FileTemperatureTestFS in the process of developing the revamped test. Reviewed By: jowlyzhang Differential Revision: D54442794 Pulled By: pdillinger fbshipit-source-id: 41d9d0afdc073e6a983304c10bbc07c70cc7e995 |
|
Jay Huh | 3412195367 |
Introduce MultiCfIterator (#12153)
Summary: This PR introduces a new implementation of `Iterator` via a new public API called `NewMultiCfIterator()`. The new API takes a vector of column family handles to build a cross-column-family iterator, which internally maintains multiple `DBIter`s as child iterators from a consistent database state. When a key exists in multiple column families, the iterator selects the value (and wide columns) from the first column family containing the key, following the order provided in the `column_families` parameter. Similar to the merging iterator, a min heap is used to iterate across the child iterators. Backward iteration and direction change functionalities will be implemented in future PRs. The comparator used to compare keys across different column families will be derived from the iterator of the first column family specified in `column_families`. This comparator will be checked against the comparators from all other column families that the iterator will traverse. If there's a mismatch with any of the comparators, the initialization of the iterator will fail. Please note that this PR is not enough for users to start using `MultiCfIterator`. The `MultiCfIterator` and related APIs are still marked as "**DO NOT USE - UNDER CONSTRUCTION**". This PR is just the first of many PRs that will follow soon. This PR includes the following: - Introduction and partial implementation of the `MultiCfIterator`, which implements the generic `Iterator` interface. The implementation includes the construction of the iterator, `SeekToFirst()`, `Next()`, `Valid()`, `key()`, `value()`, and `columns()`. - Unit tests to verify iteration across multiple column families in two distinct scenarios: (1) keys are unique across all column families, and (2) the same keys exist in multiple column families. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12153 Reviewed By: pdillinger Differential Revision: D52308697 Pulled By: jaykorean fbshipit-source-id: b03e69f13b40af5a8f0598d0f43a0bec01ef8294 |
|
jsteemann | 3fff57fa6a |
fix linking without thread status support (#12400)
Summary: When compiling with `-DNROCKSDB_THREAD_STATUS`, some functions in ThreadStatusUtil are declared but their definition is missing. Their definitions are only compiled when not defining `NROCKSDB_THREAD_STATUS`. This causes problems on linking, when the linker cannot find the definitions of - ThreadStatusUtil::GetThreadOperation - ThreadStatusUtil::SetEnableTracking This PR fixes it by adding stubs for these functions in case `NROCKSDB_THREAD_STATUS` is defined. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12400 Reviewed By: ajkr Differential Revision: D54510769 Pulled By: cbi42 fbshipit-source-id: e79e9257492d3dba59615e9e306df7e79838d73b |
|
yuzhangyu@fb.com | 1cfdece85d |
Run internal cpp modernizer on RocksDB repo (#12398)
Summary: When internal cpp modernizer attempts to format rocksdb code, it will replace macro `ROCKSDB_NAMESPACE` with its default definition `rocksdb` when collapsing nested namespace. We filed a feedback for the tool T180254030 and the team filed a bug for this: https://github.com/llvm/llvm-project/issues/83452. At the same time, they suggested us to run the modernizer tool ourselves so future auto codemod attempts will be smaller. This diff contains: Running `xplat/scripts/codemod_service/cpp_modernizer.sh` in fbcode/internal_repo_rocksdb/repo (excluding some directories in utilities/transactions/lock/range/range_tree/lib that has a non meta copyright comment) without swapping out the namespace macro `ROCKSDB_NAMESPACE` Followed by RocksDB's own `make format` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12398 Test Plan: Auto tests Reviewed By: hx235 Differential Revision: D54382532 Pulled By: jowlyzhang fbshipit-source-id: e7d5b40f9b113b60e5a503558c181f080b9d02fa |
|
Richard Barnes | d7b8756976 |
Remove extra semi colon from internal_repo_rocksdb/repo/db/table_cache_sync_and_async.h
Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: palmje Differential Revision: D54362208 fbshipit-source-id: a47acd4c794c899fccb65285b116b50d9566ea12 |
|
Richard Barnes | ced333ee45 |
Remove extra semi colon from instagram/ranking/mezql/shots/parser/fast/Token.cpp
Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: palmje Differential Revision: D54362213 fbshipit-source-id: 0bbc9e5fce917fc4f72423f0a4c8cb2c2b1759dd |
|
jsteemann | 965364972d |
fix compile warning (#12399)
Summary: Fix compile warning ``` monitoring/thread_status_util.cc: In static member function ‘static void rocksdb::ThreadStatusUtil::NewColumnFamilyInfo(const rocksdb::DB*, const rocksdb::ColumnFamilyData*, const std::string&, const rocksdb::Env*)’: monitoring/thread_status_util.cc:193:55: warning: unused parameter ‘env’ [-Wunused-parameter] 193 | const Env* env) {} | ~~~~~~~~~~~^~~ ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12399 Reviewed By: jaykorean Differential Revision: D54424333 Pulled By: cbi42 fbshipit-source-id: 3dcb89f85d3a63b1b0d0d6a8b277f49ce03b6d1a |
|
Jay Huh | c00c16855d |
Access DBImpl* and CFD* by CFHImpl* in Iterators (#12395)
Summary: In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR https://github.com/facebook/rocksdb/issues/12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs: https://github.com/facebook/rocksdb/issues/11925 #11943, and https://github.com/facebook/rocksdb/issues/11977. To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12395 Test Plan: # Summary In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR #12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs: #11925 #11943, and #11977. To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`. # Test Plan There should be no behavior changes. Existing tests and CI for the correctness tests. **Test for Perf Regression** Build ``` $> make -j64 release ``` Setup ``` $> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none ``` Run ``` TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000 ``` Before the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.552 micros/op 1810157 ops/sec 0.552 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 4.502 micros/op 222143 ops/sec 4.502 seconds 1000000 operations; (0 of 1000000 found) ``` After the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.520 micros/op 1924401 ops/sec 0.520 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 4.532 micros/op 220657 ops/sec 4.532 seconds 1000000 operations; (0 of 1000000 found) ``` Reviewed By: pdillinger Differential Revision: D54332713 Pulled By: jaykorean fbshipit-source-id: b28d897ad519e58b1ca82eb068a6319544a4fae5 |
|
Jay Huh | 5bcc184975 |
Update APIs to support generic unique identifier format (#12384)
Summary: The current design proposes using a combination of `job_id`, `db_id`, and `db_session_id` to create a unique identifier for remote compaction jobs. However, this approach may not be suitable for users who prefer a different format for the unique identifier. At Meta, we are utilizing generic compute offload to offload compaction tasks to remote workers. The compute offload client generates a UUID for each task, which requires an update to the current RocksDB API for onboarding purposes. Users still have the option to create the unique identifier by combining `job_id`, `db_id`, and `db_session_id` if they prefer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12384 Test Plan: ``` $> ./compaction_service_test 13:29:35 [==========] Running 14 tests from 1 test case. [----------] Global test environment set-up. [----------] 14 tests from CompactionServiceTest [ RUN ] CompactionServiceTest.BasicCompactions [ OK ] CompactionServiceTest.BasicCompactions (2642 ms) [ RUN ] CompactionServiceTest.ManualCompaction [ OK ] CompactionServiceTest.ManualCompaction (454 ms) [ RUN ] CompactionServiceTest.CancelCompactionOnRemoteSide [ OK ] CompactionServiceTest.CancelCompactionOnRemoteSide (1643 ms) [ RUN ] CompactionServiceTest.FailedToStart [ OK ] CompactionServiceTest.FailedToStart (1332 ms) [ RUN ] CompactionServiceTest.InvalidResult [ OK ] CompactionServiceTest.InvalidResult (1516 ms) [ RUN ] CompactionServiceTest.SubCompaction [ OK ] CompactionServiceTest.SubCompaction (551 ms) [ RUN ] CompactionServiceTest.CompactionFilter [ OK ] CompactionServiceTest.CompactionFilter (563 ms) [ RUN ] CompactionServiceTest.Snapshot [ OK ] CompactionServiceTest.Snapshot (124 ms) [ RUN ] CompactionServiceTest.ConcurrentCompaction [ OK ] CompactionServiceTest.ConcurrentCompaction (660 ms) [ RUN ] CompactionServiceTest.CompactionInfo [ OK ] CompactionServiceTest.CompactionInfo (984 ms) [ RUN ] CompactionServiceTest.FallbackLocalAuto [ OK ] CompactionServiceTest.FallbackLocalAuto (343 ms) [ RUN ] CompactionServiceTest.FallbackLocalManual [ OK ] CompactionServiceTest.FallbackLocalManual (380 ms) [ RUN ] CompactionServiceTest.RemoteEventListener [ OK ] CompactionServiceTest.RemoteEventListener (491 ms) [ RUN ] CompactionServiceTest.TablePropertiesCollector [ OK ] CompactionServiceTest.TablePropertiesCollector (169 ms) [----------] 14 tests from CompactionServiceTest (11854 ms total) [----------] Global test environment tear-down [==========] 14 tests from 1 test case ran. (11855 ms total) [ PASSED ] 14 tests. ``` Reviewed By: hx235 Differential Revision: D54220339 Pulled By: jaykorean fbshipit-source-id: 5a9054f31933d1996adca02082eb37b6d5353224 |
|
Changyu Bi | 4aed229fa7 |
Add `write_memtable_time` to perf level `kEnableWait` (#12394)
Summary: .. so write time can be measured under the new perf level for single-threaded writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12394 Test Plan: * add a new UT `PerfContextTest.WriteMemtableTimePerfLevel` Reviewed By: anand1976 Differential Revision: D54326263 Pulled By: cbi42 fbshipit-source-id: d0e334d9581851ba6cf53c776c0bd876365d1e00 |
|
Peter Dillinger | 13ef21c22e |
default_write_temperature option (#12388)
Summary: Currently SST files that aren't applicable to last_level_temperature nor file_temperature_age_thresholds are written with temperature kUnknown, which is a little weird and doesn't support CF-based tiering. The default_temperature option only affects how kUnknown is interpreted for stats. This change adds a new per-CF option default_write_temperature that determines the temperature of new SST files when those other options do not apply. Also made a change to ignore last_level_temperature with FIFO compaction, because I found that could lead to an infinite loop in compaction. Needed follow-up: Fix temperature handling with external file ingestion Pull Request resolved: https://github.com/facebook/rocksdb/pull/12388 Test Plan: unit tests extended appropriately. (Ignore whitespace changes when reviewing.) Reviewed By: jowlyzhang Differential Revision: D54266574 Pulled By: pdillinger fbshipit-source-id: c9ec9a74dbf22be6e986f77f9689d05fea8ef0bb |
|
Adam Retter | 5458eda5f0 |
Pass build parallelism flag to Docker builds (#12392)
Summary: Passed the `-j` flag through to builds happening inside Docker containers. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12392 Reviewed By: cbi42 Differential Revision: D54311937 Pulled By: ajkr fbshipit-source-id: 5cf1bfe4b9059cc2d078fb5331812f32cf9e89ab |
|
Greg Sadetsky | eab876bb49 |
fix out of date macos instructions in INSTALL.md (#12393)
Summary: closes https://github.com/facebook/rocksdb/issues/12349 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12393 Reviewed By: cbi42 Differential Revision: D54311983 Pulled By: ajkr fbshipit-source-id: 3109ad80bdd5d656756364d3d2a60dd15c339fcc |