rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-27 02:44:18 +00:00

Author	SHA1	Message	Date
Andrew Kryczka	77d160ef47	Consolidate `ErrorHandler`'s recovery status variables (#11937 ) Summary: cbi42 pointed out a race condition in which `recovery_io_error_` and `recovery_error_` could be updated inconsistently due to releasing the DB mutex in `EventHelpers::NotifyOnBackgroundError()`. There doesn't seem to be a point to having two status objects, so this PR consolidates them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11937 Reviewed By: cbi42 Differential Revision: D50105793 Pulled By: ajkr fbshipit-source-id: 3de95baccfa44351a49a5c2aa0986c9bc81baa8f	2023-10-10 06:31:45 -07:00
Andrew Kryczka	8a9cfd5292	Make stopped writes block on recovery (#11879 ) Summary: Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush. This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879 Reviewed By: anand1976 Differential Revision: D49571196 Pulled By: ajkr fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0	2023-10-10 06:29:01 -07:00
darionyaphet	ee0829ba76	fix typo snapshto (#11817 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11817 Reviewed By: jaykorean Differential Revision: D50103497 Pulled By: ltamasi fbshipit-source-id: 77c5cf86ff7eb5021fc91b03225882536163af7b	2023-10-09 19:10:06 -07:00
darionyaphet	229a6e5f55	Remove unnecessary comments (#11833 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11833 Reviewed By: jaykorean Differential Revision: D50103376 Pulled By: ltamasi fbshipit-source-id: 0da49252c3e584b9d77e9fd3f27453d4b24afe6e	2023-10-09 19:05:48 -07:00
Levi Tamasi	51d7e6a49e	Clean up WriteBatchWithIndexInternal a bit (#11930 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11930 The patch cleans up and refactors the logic in/around `WriteBatchWithIndexInternal` a bit as groundwork for further changes. Specifically, the class is turned back into a stateless collection of static helpers (which is the way it was before PR 6851). Note that there were two apparent reasons for introducing this instance state in PR 6851: a) encapsulating `MergeContext` and b) resolving objects like `Logger` and `Statistics` based on a variety of handles. However, neither reason seems justified at this point. Regarding a), the `MultiGetFromBatchAndDB` logic passes in its own `MergeContext` objects via a second set of methods that do not use the member `MergeContext`. As for b), `Logger` and friends are only needed for Merge, which is only supported if a column family handle is provided; in turn, the column family handle enables us to resolve all the necessary objects without the need for any other handles like `DB` or `DBOptions`. In addition to the above, the patch changes the type of `BaseDeltaIterator::merge_result_` to `std::string` from `PinnableSlice` (since no pinning is ever done) and makes some other small code quality improvements. Reviewed By: jaykorean Differential Revision: D50038302 fbshipit-source-id: 5f34abe2e808bdaea0f3a8033b5764ebd446b85d	2023-10-09 15:25:35 -07:00
Hui Xiao	21a12363e1	Add EXPERIMENTAL comments about XXOptions::io_activity (#11926 ) Summary: Context/Summary: this option is experimental right now Pull Request resolved: https://github.com/facebook/rocksdb/pull/11926 Test Plan: no code change Reviewed By: jaykorean Differential Revision: D49985000 Pulled By: hx235 fbshipit-source-id: a5b439ed35e3d6bb04c125f222ac29cd3842d1a1	2023-10-06 12:37:51 -07:00
Yu Zhang	2dc63c8911	Add the default WritableFile::GetFileSize implementation back for com… (#11927 ) Summary: As mentioned in https://github.com/facebook/rocksdb/issues/11726, we should defer user feasible API changes to major release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11927 Reviewed By: anand1976 Differential Revision: D50016723 Pulled By: jowlyzhang fbshipit-source-id: 59781442602fadb9906e37aad2021e3178723db5	2023-10-06 10:34:44 -07:00
Peter Dillinger	1d5bddbc58	Bootstrap, pre-populate seqno_to_time_mapping (#11922 ) Summary: This change has two primary goals (follow-up to https://github.com/facebook/rocksdb/issues/11917, https://github.com/facebook/rocksdb/issues/11920): * Ensure the DB seqno_to_time_mapping has entries that allow us to put a good time lower bound on any writes that happen after setting up preserve/preclude options (either in a new DB, new CF, SetOptions, etc.) and haven't yet aged out of that time window. This allows us to remove a bunch of work-arounds in tests. * For new DBs using preserve/preclude options, automatically reserve some sequence numbers and pre-map them to cover the time span back to the preserve/preclude cut-off time. In the future, this will allow us to import data from another DB by key, value, and write time by assigning an appropriate seqno in this DB for that write time. Note that the pre-population (historical mappings) does not happen if the original options at DB Open time do not have preserve/preclude, so it is recommended to create initial column families at that time with create_missing_column_families, to take advantage of this (future) feature. (Adding these historical mappings after DB Open would risk non-monotonic seqno_to_time_mapping, which is dubious if not dangerous.) Recommended follow-up: * Solve existing race conditions (not memory safety) where parallel operations like CreateColumnFamily or SetDBOptions could leave the wrong setting in effect. * Make SeqnoToTimeMapping more gracefully handle a possible case in which too many mappings are added for the time range of concern. It seems like there could be cases where data is massively excluded from the cold tier because of entries falling off the front of the mapping list (causing GetProximalSeqnoBeforeTime() to return 0). (More investigation needed.) No release note for the minor bug fix because this is still an experimental feature with limited usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11922 Test Plan: tests added / updated Reviewed By: jowlyzhang Differential Revision: D49956563 Pulled By: pdillinger fbshipit-source-id: 92beb918c3a298fae9ca8e509717b1067caa1519	2023-10-06 08:21:21 -07:00
Hui Xiao	8e949116f7	Fix comments about creation_time/oldest_ancester_time/oldest_key_time (#11921 ) Summary: Code reference for the comments change: `40b618f234/table/block_based/block_based_table_builder.cc`?fbclid=IwAR0JlfnG8wysclFP5wv0fSngFbi_j32BUCKbFayeGdr10tzDhyyk5QqpclA#L2093 `40b618f234/db/flush_job.cc`?fbclid=IwAR1ri6eTX3wyD_2fAEBRzFSwZItcbmDS8LaB11k1letDMQmB2L8nF6TfXDs#L945-L949 `40b618f234/db/compaction/compaction_job.cc (L1882-L1904)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11921 Reviewed By: cbi42 Differential Revision: D49921304 Pulled By: hx235 fbshipit-source-id: 2ae17e43c0fd52044404d7b63fea254d2d1f3595	2023-10-04 14:42:35 -07:00
Peter Dillinger	141b872bd4	Improve efficiency of create_missing_column_families, light refactor (#11920 ) Summary: In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF. Also in this change: * Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern * Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible) * Improved comments in several places Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920 Test Plan: tests added / updated Reviewed By: jaykorean Differential Revision: D49919147 Pulled By: pdillinger fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1	2023-10-04 14:14:22 -07:00
akankshamahajan	40b618f234	Enable auto_readahead_size in db_stress (#11916 ) Summary: Depends on https://github.com/facebook/rocksdb/pull/11884 This PR only enables the option in db_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11916 Reviewed By: anand1976 Differential Revision: D49834479 Pulled By: akankshamahajan15 fbshipit-source-id: 103a64fd7b23236493a8f3064d4c5af83656bd18	2023-10-03 14:41:26 -07:00
Adam Retter	c13569e41d	RocksDB now requires gflags v2.2.0 (#10933 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10933 Reviewed By: jaykorean Differential Revision: D49872302 Pulled By: jowlyzhang fbshipit-source-id: 15f4e177bed59455ff58a0b48a3f6a55973d0b38	2023-10-03 09:58:49 -07:00
akankshamahajan	97f6f475bc	Fix various failures in auto_readahead_size (#11884 ) Summary: 1. Error in TestIterateAgainstExpected API - `Assertion index < pre_read_expected_values.size() && index < post_read_expected_values.size() failed.` Fix - `Prev` op is not supported with `auto_readahead_size`. So added support to Reseek in db_iter, if Prev is called. In BlockBasedTableIterator, index_iter_ already moves forward. So there is no way to do Prev from BlockBasedTableIterator. 2. Error - `void rocksdb::BlockBasedTableIterator::BlockCacheLookupForReadAheadSize(uint64_t, size_t, size_t&): Assertion index_iter_->value().handle.offset() == offset` Fix - Remove prefetch_buffer to be used when uncompressed dict is read. 3. ** Error in TestPrefixScan API - `db_stress: db/db_iter.cc:369: bool rocksdb::DBIter::FindNextUserEntryInternal(bool, const rocksdb::Slice): Assertion !skipping_saved_key \|\| CompareKeyForSkip(ikey_.user_key, saved_key_.GetUserKey()) > 0 failed. Received signal 6 (Aborted) Invoking GDB for stack trace... db_stress: table/merging_iterator.cc:1036: bool rocksdb::MergingIterator::SkipNextDeleted(): Assertion comparator_->Compare(range_tombstone_iters_[i]->start_key(), pik) <= 0 failed` Fix* - SeekPrev also calls 1) SeekPrev , 2)Seek and then 3)Prev in some cases in db_iter.cc leading to failure of Prev operation. These backward operations also call Seek. Added direction to disable lookup once direction is backwards in BlockBasedTableIterator.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/11884 Test Plan: Ran various flavors of crash tests locally for the whole duration Reviewed By: anand1976 Differential Revision: D49834201 Pulled By: akankshamahajan15 fbshipit-source-id: 9a007b4d46a48002c43dc4623a400ecf47d997fe	2023-10-02 17:47:24 -07:00
Jay Huh	5fbea87859	Disallow start_time == end_time in offpeak time and compare at minute level to allow 24hr offpeak (#11911 ) Summary: Since allowing 24hr peak by setting start_time = end_time is not so intuitive, we are not going to allow it (e.g. `00:00-00:00` doesn't looks like a value that would cover 24hr.). Instead, we are going to compare at minute level (i.e. dropping the seconds to the nearest minute) so that `00:00-23:59` will cover 24hrs. The entire minute from 23:59:00 23:59:59 will be covered with this change. Minor fixes from previous PR - release build error - fixed random seed in test Pull Request resolved: https://github.com/facebook/rocksdb/pull/11911 Test Plan: `DBOptionsTest::OffPeakTimes` `make -j64 static_lib` to test release build issue that was fixed Reviewed By: pdillinger Differential Revision: D49787795 Pulled By: jaykorean fbshipit-source-id: e8d045b95f54f61d5dd5f1bb473579f8d55c18b3	2023-10-02 16:52:39 -07:00
Andrew Kryczka	10fd05e394	Give retry flushes their own functions (#11903 ) Summary: Recovery triggers flushes for very different scenarios: (1) `FlushReason::kErrorRecoveryRetryFlush`: a flush failed (2) `FlushReason::kErrorRecovery`: a WAL may be corrupted (3) `FlushReason::kCatchUpAfterErrorRecovery`: immutable memtables may have accumulated The old code called called `FlushAllColumnFamilies()` in all cases, which uses manual flush functions: `AtomicFlushMemTables()` and `FlushMemTable()`. Forcing flushing the latest data on all CFs was useful for (2) because it ensures all CFs move past the corrupted WAL. However, those code paths were overkill for (1) and (3), where only already-immutable memtables need to be flushed. There were conditionals to exclude some of the extraneous logic but I found there was still too much happening. For example, both of the manual flush functions enter the write thread. Entering the write thread is inconvenient because then we can't allow stalled writes to wait on a retrying flush to finish. Instead of continuing down the path of adding more conditionals to the manual flush functions, this PR introduces a dedicated function for cases (1) and (3): `RetryFlushesForErrorRecovery()`. Also I cleaned up the manual flush functions to remove existing conditionals for these cases as they're no longer needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11903 Reviewed By: cbi42 Differential Revision: D49693812 Pulled By: ajkr fbshipit-source-id: 7630ac539b9d6c92052c13a3cdce53256134d990	2023-10-02 16:26:24 -07:00
Levi Tamasi	b00fa5597e	Fix the handling of wide-column base values in the max_successive_merges logic (#11913 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11913 The `max_successive_merges` logic currently does not handle wide-column base values correctly, since it uses the `Get` API, which only returns the value of the default column. The patch fixes this by switching to `GetEntity` and passing all columns (if applicable) to the merge operator. Reviewed By: jaykorean Differential Revision: D49795097 fbshipit-source-id: 75eb7cc9476226255062cdb3d43ab6bd1cc2faa3	2023-10-02 16:25:25 -07:00
Peter Dillinger	7bebd3036d	Update tiered storage tests (ahead of next change) (#11917 ) Summary: After https://github.com/facebook/rocksdb/issues/11905, I am preparing a DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping. Intended follow-up However, there are a number of test changes I want to make prior to that to make it clear that I am not regressing the tests and production behavior at the same time. * Start mock time in the tests well beyond epoch (time 0) so that we aren't normally reaching into pre-history for current time minus the preserve/preclude duration. * Majorly clean up BasicSeqnoToTimeMapping to avoid confusing hard-coded bounds on GetProximalTimeBeforeSeqno() results. * There is an unresolved/unexplained issue marked with FIXME that should be investigated when GetProximalTimeBeforeSeqno() is put into production. * MultiCFs test was strangely generating 5 L0 files, four of which would be compacted into an L1, and then letting TTL compaction compact 1@L0+1@L1. Changing the starting time of the tests seemed to mess up the TTL compaction. But I suspect the TTL compaction was unintentional, so I've cut it down to just 4 L0 files, which compacts predictably. * Unrelated: allow ROCKSDB_NO_STACK=1 to skip printing a stack trace on assertion failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11917 Test Plan: no changes to production code Reviewed By: jowlyzhang Differential Revision: D49841436 Pulled By: pdillinger fbshipit-source-id: 753348ace9c548e82bcb77fcc8b2ffb7a6beeb0a	2023-10-02 16:19:05 -07:00
Andrew Kryczka	be879cc56b	stress test verification value mismatch message (#11912 ) Summary: Separate the message for value mismatch from the message for an extra value in the DB Pull Request resolved: https://github.com/facebook/rocksdb/pull/11912 Reviewed By: hx235 Differential Revision: D49792137 Pulled By: ajkr fbshipit-source-id: 311bc1801843a15367f409ead88ef755acbde468	2023-10-02 16:07:39 -07:00
leipeng	d98a9cfb27	test: WritableFile derived class: add missing GetFileSize() override (#11726 ) Summary: Missed `GetFileSize()` forwarding , this PR fix this issue, and mark `WritableFile::GetFileSize()` as pure virtual to detect such issue in compile time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11726 Reviewed By: ajkr Differential Revision: D49791240 Pulled By: jowlyzhang fbshipit-source-id: ef219508d6b15c9a24df9b706a9fdc33cc6a286e	2023-09-29 15:58:08 -07:00
Andrew Kryczka	3c4cc6c2cc	flip default `DBOptions::fail_if_options_file_error` (#11800 ) Summary: Changed `DBOptions::fail_if_options_file_error` default from `false` to `true`. It is safer to fail an operation by default when it encounters an error. Also changed the API doc to list items in the conventional way for listing items in a sentence. The slashes weren't working well as one got dropped, probably because it looked like a typo. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11800 Test Plan: rely on CI Reviewed By: jowlyzhang Differential Revision: D49030532 Pulled By: ajkr fbshipit-source-id: e606062aa25f9063d8c6fb0d03aebca5c2bc56d3	2023-09-29 15:15:32 -07:00
Jay Huh	63ed868840	Offpeak in db option (#11893 ) Summary: RocksDB's primary function is to facilitate read and write operations. Compactions, while essential for minimizing read amplifications and optimizing storage, can sometimes compete with these primary tasks. Especially during periods of high read/write traffic, it's vital to ensure that primary operations receive priority, avoiding any potential disruptions or slowdowns. Conversely, during off-peak times when traffic is minimal, it's an opportune moment to tackle low-priority tasks like TTL based compactions, optimizing resource usage. In this PR, we are incorporating the concept of off-peak time into RocksDB by introducing `daily_offpeak_time_utc` within the DBOptions. This setting is formatted as "HH:mm-HH:mm" where the first one before "-" is the start time and the second one is the end time, inclusive. It will be later used for resource optimization in subsequent PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11893 Test Plan: - New Unit Test Added - `DBOptionsTest::OffPeakTimes` - Existing Unit Test Updated - `OptionsTest`, `OptionsSettableTest` Reviewed By: pdillinger Differential Revision: D49714553 Pulled By: jaykorean fbshipit-source-id: fef51ea7c0fede6431c715bff116ddbb567c8752	2023-09-29 13:03:39 -07:00
Peter Dillinger	02443dd93f	Refactor, clean up, fixes, and more testing for SeqnoToTimeMapping (#11905 ) Summary: This change is before a planned DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping (bug fix with existing test work-arounds). Intended follow-up However, I found enough issues with SeqnoToTimeMapping to warrant this PR first, including very small fixes in DB implementation related to API contract of SeqnoToTimeMapping. Functional fixes / changes: * This fixes some mishandling of boundary cases. For example, if the user decides to stop writing to DB, the last written sequence number would perpetually have its write time updated to "now" and would always be ineligible for migration to cold tier. Part of the problem is that the SeqnoToTimeMapping would return a seqno known to have been written before (immediately or otherwise) the requested time, but compaction_job.cc would include that seqno in the preserve/exclude set. That is fixed (in part) by adding one in compaction_job.cc * That problem was worse because a whole range of seqnos could be updated perpetually with new times in SeqnoToTimeMapping::Append (if no writes to DB). That logic was apparently optimized for GetOldestApproximateTime (now GetProximalTimeBeforeSeqno), which is not used in production, to the detriment of GetOldestSequenceNum (now GetProximalSeqnoBeforeTime), which is used in production. (Perhaps plans changed during development?) This is fixed in Append to optimize for accuracy of GetProximalSeqnoBeforeTime. (Unit tests added and updated.) * Related: SeqnoToTimeMapping did not have a clear contract about the relationships between seqnos and times, just the idea of a rough correspondence. Now the class description makes it clear that the write time of each recorded seqno comes before or at the associated time, to support getting best results for GetProximalSeqnoBeforeTime. And this makes it easier to make clear the contract of each API function. * Update `DBImpl::RecordSeqnoToTimeMapping()` to follow this ordering in gathering samples. Some part of these changes has required an expanded test work-around for the problem (see intended follow-up above) that the DB does not immediately ensure recent seqnos are covered by its mapping. These work-arounds will be removed with that planned work. An apparent compaction bug is revealed in PrecludeLastLevelTest::RangeDelsCauseFileEndpointsToOverlap, so that test is disabled. Filed GitHub issue #11909 Cosmetic / code safety things (not exhaustive): * Fix some confusing names. * `seqno_time_mapping` was used inconsistently in places. Now just `seqno_to_time_mapping` to correspond to class name. * Rename confusing `GetOldestSequenceNum` -> `GetProximalSeqnoBeforeTime` and `GetOldestApproximateTime` -> `GetProximalTimeBeforeSeqno`. Part of the motivation is that our times and seqnos here have the same underlying type, so we want to be clear about which is expected where to avoid mixing. * Rename `kUnknownSeqnoTime` to `kUnknownTimeBeforeAll` because the value is a bad choice for unknown if we ever add ProximalAfterBlah functions. * Arithmetic on SeqnoTimePair doesn't make sense except for delta encoding, so use better names / APIs with that in mind. * (OMG) Don't allow direct comparison between SeqnoTimePair and SequenceNumber. (There is no checking that it isn't compared against time by accident.) * A field name essentially matching the containing class name is a confusing pattern (`seqno_time_mapping_`). * Wrap calls to confusing (but useful) upper_bound and lower_bound functions to have clearer names and more code reuse. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11905 Test Plan: GetOldestSequenceNum (now GetProximalSeqnoBeforeTime) and TruncateOldEntries were lacking unit tests, despite both being used in production (experimental feature). Added those and expanded others. Reviewed By: jowlyzhang Differential Revision: D49755592 Pulled By: pdillinger fbshipit-source-id: f72a3baac74d24b963c77e538bba89a7fc8dce51	2023-09-29 11:21:59 -07:00
Jay Huh	2cfe53ec05	Add helpful message for ldb when unknown option found (#11907 ) Summary: Users may run into an issue when running ldb on db that's in a different version and they have different set of options: `Failed: Invalid argument: Could not find option: <MISSING_OPTION>` They can work around this by setting `--ignore_unknown_options`, but the error message is not clear for users to find why the option is missing. It's also hard for the users to find the `ignore_unknown_options` option especially if they are not familiar with the codebase or `ldb` tool. This PR changes the error message to help users to find out what's wrong and possible workaround for the issue Pull Request resolved: https://github.com/facebook/rocksdb/pull/11907 Test Plan: Testing by reproducing the issue locally ``` ❯./ldb --db=/data/users/jewoongh/db_crash_whitebox_T164195541/ get a Failed: Invalid argument: Could not find option: : unknown_option_test This tool was built with version 8.8.0. If your db is in a different version, please try again with option --ignore_unknown_options. ``` Reviewed By: jowlyzhang Differential Revision: D49762291 Pulled By: jaykorean fbshipit-source-id: 895570150fde886d5ec524908c4b2664c9230ac9	2023-09-29 09:58:40 -07:00
Levi Tamasi	01e2d33565	Add the wide-column aware merge API to the stress tests (#11906 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11906 The patch adds stress test coverage for the wide-column aware `FullMergeV3` API by implementing a new `DBStressWideMergeOperator`. This operator is similar to `PutOperator` / `PutOperatorV2` in the sense that its result is based on the last merge operand; however, the merge result can be either a plain value or a wide-column entity, depending on the value base encoded into the operand and the value of the `use_put_entity_one_in` stress test parameter. Following the same rule for merge results that we do for writes ensures that the queries issued by the validation logic receive the expected results. The new operator is used instead of `PutOperatorV2` whenever `use_put_entity_one_in` is positive. Note that the patch also makes it possible to set `use_put_entity_one_in` and `use_merge` (but not `use_full_merge_v1`) at the same time, giving `use_put_entity_one_in` precedence, so the stress test will use `PutEntity` for writes passing the `use_put_entity_one_in` check described above and `Merge` for any other writes. Reviewed By: jaykorean Differential Revision: D49760024 fbshipit-source-id: 3893602c3e7935381b484f4f5026f1983e3a04a9	2023-09-29 08:54:50 -07:00
Fan Zhang(DevX)	8b566964b8	remove unnecessary autodeps suppression tag from rocksdb/src (#11904 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11904 The tag is not needed, autodeps works fine with this file. it was added in D33962843 but the reason doing is not valid anymore. We are on the way of migrating most, if not all, users to autodeps, and deprecating the noautodeps tag. Changed the tag in template and run `python3 buckifier/buckify_rocksdb.py` for regeneration Reviewed By: jaykorean Differential Revision: D49711337 fbshipit-source-id: c21892adfbc92e2ad868413746a0938062b6a543	2023-09-28 10:42:04 -07:00
Levi Tamasi	6b4315ee8b	Extend the test coverage of FullMergeV3 (#11896 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11896 The patch extends the test coverage of the wide column aware merge logic by adding two new tests that perform general transformations during merge by implementing the `FullMergeV3` interface. The first one uses a merge operator that produces a wide-column entity as result in all cases (i.e. even if the base value is a plain key-value, or if there is no base value). The second one uses a merge operator that results in a plain key-value in all cases. Reviewed By: jaykorean Differential Revision: D49665946 fbshipit-source-id: 419b9e557c064525b659685eb8c09ae446656439	2023-09-27 14:53:25 -07:00
anand76	35a0250293	Don't call InsertSaved on compressed only secondary cache (#11889 ) Summary: In https://github.com/facebook/rocksdb/issues/11812, the ```CacheWithSecondaryAdapter::Insert``` calls ```InsertSaved``` on the secondary cache to warm it up with the compressed blocks. This should only be done if its a stacked cache with compressed and nvm cache. If its in-memory compressed only, then don't call ```InsertSaved```. Tests: Add a new unit test Pull Request resolved: https://github.com/facebook/rocksdb/pull/11889 Reviewed By: akankshamahajan15 Differential Revision: D49615758 Pulled By: anand1976 fbshipit-source-id: 156ff968ad014ac319f8840da7a48193e4cebfa9	2023-09-27 12:08:08 -07:00
Hui Xiao	fce04587b8	Only fallback to RocksDB internal prefetching on unsupported FS prefetching (#11897 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/11631 introduced an undesired fallback behavior to RocksDB internal prefetching even when FS prefetching return non-OK status other than "Unsupported". We only want to fall back when FS prefetching is not supported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11897 Test Plan: CI Reviewed By: ajkr Differential Revision: D49667055 Pulled By: hx235 fbshipit-source-id: fa36e4e5d6dc9507080217035f9d6ff8e4abda28	2023-09-26 18:44:41 -07:00
Hui Xiao	719f5511f6	No file system prefetching when Options::compaction_readahead_size is 0 (#11887 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/11631 introduced `readahead()` system call for compaction read under non direct IO. When `Options::compaction_readahead_size` is 0, the `readahead()` will issued with a small size (i.e, the block size, by default 4KB) Benchmarks shows that such readahead() call regresses the compaction read compared with "no readahead()" case (see Test Plan for more). Therefore we decided to not issue such `readhead() ` when `Options::compaction_readahead_size` is 0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11887 Test Plan: Settings: `compaction_readahead_size = 0, use_direct_reads=false` Setup: ``` TEST_TMPDIR=../ ./db_bench -benchmarks=filluniquerandom -disable_auto_compactions=true -write_buffer_size=1048576 -compression_type=none -value_size=10240 && tar -cf ../dbbench.tar -C ../dbbench/ . ``` Run: ``` for i in $(seq 3); do rm -rf ../dbbench/ && mkdir -p ../dbbench/ && tar -xf ../dbbench.tar -C ../dbbench/ . && sudo bash -c 'sync && echo 3 > /proc/sys/vm/drop_caches' && TEST_TMPDIR=../ /usr/bin/time ./db_bench_{pre_PR11631\|PR11631\|PR11631_with_improvementPR11887} -benchmarks=compact -use_existing_db=true -db=../dbbench/ -disable_auto_compactions=true -compression_type=none ; done \|& grep elapsed ``` pre-PR11631("no readahead()" case): PR11631: PR11631+this improvement: Reviewed By: ajkr Differential Revision: D49607266 Pulled By: hx235 fbshipit-source-id: 2efa0dc91bac3c11cc2be057c53d894645f683ef	2023-09-26 10:08:43 -07:00
Yu Zhang	7ea6e724fa	Mark recovery_in_prog_ to false whenever recovery thread joins (#11890 ) Summary: Make the `RecoverFromRetryableBGIOError` function always mark `recovery_in_prog_` to false when it returns. Otherwise, in below code snippet, when db closes and the `error_handler_.CancelErrorRecovery()` call successfully joined the recovery thread, the immediately following while loop will incorrectly think the error recovery is still in progress and loops in `bg_cv_.Wait()`. `1c871a4d86/db/db_impl/db_impl.cc (L542-L545)` This is the issue https://github.com/facebook/rocksdb/issues/11440 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11890 Reviewed By: anand1976 Differential Revision: D49624216 Pulled By: jowlyzhang fbshipit-source-id: ee10cf6527d95b8dd4705a326eb6208d741fe002	2023-09-25 20:15:40 -07:00
Yu Zhang	6c564e2e17	Add some convenience util APIs to facilitate using U64Ts (#11888 ) Summary: Added some util function APIs to facilitate using the U64Ts. The U64Ts format for encoding a timestamp is not entirely RocksDB internal. When users are using the user-defined timestamp feature from the transaction layer, its public APIs including `SetCommitTimestamp`, `GetCommitTimestamp`, `SetReadTimestampForValidation` are taking and returning timestamps as uint64_t. But if users want to use the APIs from the DB layer, including populating `ReadOptions.timestamp`, interpreting `Iterator::timestamp()`, these APIs are using and returning U64Ts timestamps as an encoded Slice. So these util functions are added to facilitate the usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11888 Reviewed By: ltamasi Differential Revision: D49620709 Pulled By: jowlyzhang fbshipit-source-id: ace8d782ee7c3372cf410abf761320d373e495e1	2023-09-25 19:00:39 -07:00
Changyu Bi	1c871a4d86	Only flush after recovery for retryable IOError (#11880 ) Summary: https://github.com/facebook/rocksdb/issues/11872 causes a unit test to start failing with the error message below. The cause is that the additional call to `FlushAllColumnFamilies()` in `DBImpl::ResumeImpl()` can run while DB is closing. More detailed explanation: there are two places where we call `ResumeImpl()`: 1. in `ErrorHandler::RecoverFromBGError`, for manual resume or recovery from errors like OutOfSpace through sst file manager, and 2. in `Errorhandler::RecoverFromRetryableBGIOError`, for error recovery from errors like flush failure due to retryable IOError. This is tracked by `ErrorHandler::recovery_thread_`. Here is how DB close waits for error recovery: `49da91ec09/db/db_impl/db_impl.cc (L540-L543)` `CancelErrorRecovery()` waits until `recovery_thread_` finishes and `IsRecoveryInProgress()` checks the `recovery_in_prog_` flag. The additional call to `FlushAllColumnFamilies()` in `ResumeImpl()` happens after it clears bg error and the `recovery_in_prog_` flag: `49da91ec09/db/db_impl/db_impl.cc (L436-L463)`. So if `ResumeImpl()` is called in `RecoverFromBGError()`, we can have a thread running `FlushAllColumnFamilies()` while DB is closing and thought that recovery is done. The fix is to only do the additional call to `FlushAllColumnFamilies()` when doing error recovery through `Errorhandler::RecoverFromRetryableBGIOError` by setting flags in `DBRecoverContext`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11880 Test Plan: `gtest-parallel --repeat=100 --workers=4 ./error_handler_fs_test --gtest_filter="AutoRecoverFlushError"` reproduces the error pretty reliably. ```[==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBErrorHandlingFSTest [ RUN ] DBErrorHandlingFSTest.AutoRecoverFlushError error_handler_fs_test: db/column_family.cc:1618: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed. Received signal 6 (Aborted) ... https://github.com/facebook/rocksdb/issues/10 0x00007fac4409efd6 in __GI___assert_fail (assertion=0x7fac452c0afa "last_ref", file=0x7fac452c9fb5 "db/column_family.cc", line=1618, function=0x7fac452cb950 "rocksdb::ColumnFamilySet::~ColumnFamilySet()") at assert.c:101 101 in assert.c https://github.com/facebook/rocksdb/issues/11 0x00007fac44b5324f in rocksdb::ColumnFamilySet::~ColumnFamilySet (this=0x7b5400000000) at db/column_family.cc:1618 1618 assert(last_ref); https://github.com/facebook/rocksdb/issues/12 0x00007fac44e0f047 in std::default_delete<rocksdb::ColumnFamilySet>::operator() (this=0x7b5800000940, __ptr=0x7b5400000000) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85 85 delete __ptr; https://github.com/facebook/rocksdb/issues/13 std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182 182 _M_deleter()(__old_p); https://github.com/facebook/rocksdb/issues/14 std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456 456 _M_t.reset(std::move(__p)); https://github.com/facebook/rocksdb/issues/15 rocksdb::VersionSet::~VersionSet (this=this@entry=0x7b5800000900) at db/version_set.cc:5081 5081 column_family_set_.reset(); https://github.com/facebook/rocksdb/issues/16 0x00007fac44e0f97a in rocksdb::VersionSet::~VersionSet (this=0x7b5800000900) at db/version_set.cc:5078 5078 VersionSet::~VersionSet() { https://github.com/facebook/rocksdb/issues/17 0x00007fac44bf0b2f in std::default_delete<rocksdb::VersionSet>::operator() (this=0x7b8c00000068, __ptr=0x7b5800000900) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85 85 delete __ptr; https://github.com/facebook/rocksdb/issues/18 std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182 182 _M_deleter()(__old_p); https://github.com/facebook/rocksdb/issues/19 std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456 456 _M_t.reset(std::move(__p)); https://github.com/facebook/rocksdb/issues/20 rocksdb::DBImpl::CloseHelper (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:676 676 versions_.reset(); https://github.com/facebook/rocksdb/issues/21 0x00007fac44bf1346 in rocksdb::DBImpl::CloseImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:720 720 Status DBImpl::CloseImpl() { return CloseHelper(); } https://github.com/facebook/rocksdb/issues/22 rocksdb::DBImpl::~DBImpl (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:738 738 closing_status_ = CloseImpl(); https://github.com/facebook/rocksdb/issues/23 0x00007fac44bf2bba in rocksdb::DBImpl::~DBImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:722 722 DBImpl::~DBImpl() { https://github.com/facebook/rocksdb/issues/24 0x00007fac455444d4 in rocksdb::DBTestBase::Close (this=this@entry=0x7b6c00000000) at db/db_test_util.cc:678 678 delete db_; https://github.com/facebook/rocksdb/issues/25 0x00007fac455455fb in rocksdb::DBTestBase::TryReopen (this=this@entry=0x7b6c00000000, options=...) at db/db_test_util.cc:707 707 Close(); https://github.com/facebook/rocksdb/issues/26 0x00007fac45543459 in rocksdb::DBTestBase::Reopen (this=0x7ffed74b79a0, options=...) at db/db_test_util.cc:670 670 ASSERT_OK(TryReopen(options)); https://github.com/facebook/rocksdb/issues/27 0x00000000004f2522 in rocksdb::DBErrorHandlingFSTest_AutoRecoverFlushError_Test::TestBody (this=this@entry=0x7b6c00000000) at db/error_handler_fs_test.cc:1224 1224 Reopen(options); ``` Reviewed By: jowlyzhang Differential Revision: D49579701 Pulled By: cbi42 fbshipit-source-id: 3fc8325e6dde7e7faa8bcad95060cb4e26eda638	2023-09-25 09:34:39 -07:00
akankshamahajan	bd655b9af3	Disable AutoReadaheadSize in stress tests (#11883 ) Summary: Crash tests are failing with recent change of auto_readahead_size. Disable it in stress tests and enable it with fix to clear the crash tests failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11883 Reviewed By: pdillinger Differential Revision: D49597854 Pulled By: akankshamahajan15 fbshipit-source-id: 0af8ca7414ee9b92f244ee0fb811579c3c052b41	2023-09-25 09:06:22 -07:00
Changyu Bi	49da91ec09	Update files for version 8.8 (#11878 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11878 Reviewed By: ajkr Differential Revision: D49568389 Pulled By: cbi42 fbshipit-source-id: b2022735799be9b5e81e03dfb418f8b104632ecf	2023-09-23 11:02:19 -07:00
akankshamahajan	3d67b5e8e5	Lookup ahead in block cache ahead to tune Readaheadsize (#11860 ) Summary: Implement block cache lookup to determine readahead_size during scans. It's enabled if auto_readahead_size, block_cache and iterate_upper_bound - all three are set. Design - 1. Whenever there is a cache miss and FilePrefetchBuffer is called, a callback is made to determine readahead_size for that prefetching. 2. The callback iterates over index and do block cache lookup for each data block handle until existing readahead_size is reached. Then It removes the cache hit data blocks from end to calculate optimized readahead_size. 3. Since index_iter_ is moved, it stores block handles in a queue, and use that queue to get block handle instead of doing index_iter_->Next(). 4. This is for Sync scans. Async scans support is in progress. NOTE: The issue right now is after Seek and Next, if Prev is called, there is no way to do Prev operation. index_iter_ is already pointing to a different block. So it returns "Not supported" in that case with error message - "auto tuning of readahead size is not supported with Prev op" Pull Request resolved: https://github.com/facebook/rocksdb/pull/11860 Test Plan: - Added new unit test - crash_tests - Running scans locally to check for any regression Reviewed By: anand1976 Differential Revision: D49548118 Pulled By: akankshamahajan15 fbshipit-source-id: f1aee409a71b4ad9e5bf3610f43edf30c6630c78	2023-09-22 18:12:08 -07:00
anand76	48589b961f	Fix updating the capacity of a tiered cache (#11873 ) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf	2023-09-22 18:07:46 -07:00
Yu Zhang	552bc01669	Surface timestamp from db to the transaction iterator (#11847 ) Summary: Provide an override implementation of `Iterator::timestamp` API for `BaseDeltaIterator` so that timestamp read from DB can be surfaced by an iterator created from inside of a transaction. The behavior of the API follows this rule: 1) If the entry is read from within the transaction, an empty `Slice` is returned as the timestamp, regardless of whether `Transaction::SetCommitTimestamp` is called. 2) If the entry is read from the DB, the corresponding `DBIter::timestamp()` API's result is returned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11847 Test Plan: make all check add some unit test Reviewed By: ltamasi Differential Revision: D49377359 Pulled By: jowlyzhang fbshipit-source-id: 1511ead262ce3515ee6c6e0f829f1b69a10fe994	2023-09-22 17:28:36 -07:00
Changyu Bi	0086809601	Fix a bug with atomic_flush that causes DB to stuck after a flush failure (#11872 ) Summary: With atomic_flush=true, a flush job with younger memtables wait for older memtables to be installed before install its memtables. If the flush for older memtables failed, auto-recovery starts a resume thread which can becomes stuck waiting for all background work to finish (including the flush for younger memtables). If a non-recovery flush starts now and tries to flush, it can make the situation worse since it will fail due to background error but never rollback its memtable: `269478ee46/db/db_impl/db_impl_compaction_flush.cc (L725)` This prevents any future flush to pick old memtables. A more detailed repro is in unit test. This PR fixes this issue by 1. Ensure we rollback memtables if an atomic flush fails due to background error 2. When there is a background error, abort atomic flushes that are waiting for older memtables to be installed 3. Do not schedule non-recovery flushes when there is a background error that stops background work There was another issue with atomic_flush=true where DB can hang during DB close, see more in #11867. The fix in this PR, specifically fix 2 above, should be enough to resolve it too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11872 Test Plan: new unit test. Reviewed By: jowlyzhang Differential Revision: D49556867 Pulled By: cbi42 fbshipit-source-id: 4a0210ff28a8552a99ece7fbb0f574fd24b4da3f	2023-09-22 16:43:50 -07:00
Peter Dillinger	77a1d6eafb	Fix assertion failure in AutoHCC (#11877 ) Summary: Example crash seen in crash test: ``` db_stress: cache/clock_cache.cc:237: bool rocksdb::clock_cache::{anonymous}::BeginSlotInsert(const rocksdb::clock_cache::ClockHandleBasicData&, rocksdb::clock_cache::ClockHandle&, uint64_t, bool): Assertion `already_matches == false' failed. ``` I was intentionally ignoring `already_matches` without resetting it to false for the next call. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11877 Test Plan: Reproducer no longer reproduces: ``` while ./cache_bench -cache_type=auto_hyper_clock_cache -threads=32 -populate_cache=0 -histograms=0 -report_problems -insert_percent=87 -lookup_insert_percent=2 -skew=10 -ops_per_thread=100 -cache_size=1000000; do echo hi; done ``` Reviewed By: cbi42 Differential Revision: D49562065 Pulled By: pdillinger fbshipit-source-id: 941062e6eac7a4b56157925b1cf2a0b15ff9cc9d	2023-09-22 16:42:52 -07:00
Levi Tamasi	6afde14266	Add changelog entry for wide-column full merge (#11874 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11874 Add a changelog entry for https://github.com/facebook/rocksdb/pull/11858 . Reviewed By: jaykorean Differential Revision: D49557350 fbshipit-source-id: 44fcd08e9847407d9f18dd3d9363d233f4591c84	2023-09-22 14:33:47 -07:00
Levi Tamasi	12d9386a4f	Return a special OK status when the number of merge operands exceeds a threshold (#11870 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11870 Having a large number of merge operands applied at query time can have a significant effect on performance; therefore, applications might want limit the number of deltas for any given key. However, there is currently no way to establish the number of operands for certain types of queries. The ticker `READ_NUM_MERGE_OPERANDS` only provides aggregate (not per-read) information. The `PerfContext` counters `internal_merge_count` and `internal_merge_point_lookup_count` can be used to get this information on a per-query basis for iterators and single point lookups; however, there is no per-key breakdown for `MultiGet` type APIs. The patch addresses this issue by introducing a special kind of OK status which signals that an application-defined threshold on the number of merge operands has been exceeded for a given key. The threshold can be specified on a per-query basis using a new field in `ReadOptions`. Reviewed By: jaykorean Differential Revision: D49522786 fbshipit-source-id: 4265b3848d1be5ff313a3e8fb604ddf56411dd2c	2023-09-22 13:49:19 -07:00
Peter Dillinger	f6cb763409	Fix major performance bug in AutoHCC growth phase (#11871 ) Summary: ## The Problem Mark Callaghan found a performance bug in yet-unreleased AutoHCC (which should have been found in my own testing). The observed behavior is very slow insertion performance as the table is growing into a very large structure. The root cause is the precarious combination of linear hashing (indexing into the table while allowing growth) and linear probing (for finding an empty slot to insert into). Naively combined, this is a disaster because in linear hashing, part of the table is twice as dense as first probing location as the rest. Thus, even a modest load factor like 0.6 could cause the dense part of the table to degrade to linear search. The code had a correction for this imbalance, which works in steady-state operation, but failed to account for the concentrating effect of table growth. Specifically, newly-added slots were underpopulated which allowed old slots to become over-populated and degrade to linear search, even in single-threaded operation. Here's an example: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=1 -populate_cache=0 -value_bytes=500 -cache_size=3000000000 -histograms=0 -report_problems -ops_per_thread=20000000 -resident_ratio=0.6 ``` AutoHCC: Complete in 774.213 s; Rough parallel ops/sec = 25832 FixedHCC: Complete in 19.630 s; Rough parallel ops/sec = 1018840 LRUCache: Complete in 25.842 s; Rough parallel ops/sec = 773947 ## The Fix One small change is apparently sufficient to fix the problem, but I wanted to re-optimize the whole "finding a good empty slot" algorithm to improve safety margins for good performance and to improve typical case performance. The small change is to track the newly-added slot from Grow in Insert, when applicable, and use that slot for insertion if (a) the home slot is already occupied, and (b) the newly-added slot is empty. This appears to sufficiently load new slots while avoiding over-population of either old or new slots. See `likely_empty_slot`. However I've also made the logic much more resilient to parts of the table becoming over-populated. I tested a variant that used double hashing instead of linear probing and found that hurt steady-state average-case performance, presumably due to loss of locality in the chains. And even conventional double hashing might not be ideally robust against density skew in the table (still present because of home location bias), because double hashing might choose a small increment that could take a long time to iterate to the under-populated part of the table. The compromise that seems to bring the best of each approach is this: do linear probing (+1 at a time) within a small bound (chosen bound of 4 based on performance testing) and then fall back on a double-hashing variant if no slot has been found. The double-hashing variant uses a probing increment that is always close to the golden ratio, relative to the table size, so that any under-populated regions of the table can be found relatively quickly, without introducing any additional skew. And the increment is varied slightly to avoid clustering effects that could happen with a fixed increment (regardless of how big it is). And that leaves us with one remaining problem: the double hashing increment might not be relatively prime to the table size, so the probing sequence might be a cycle that does not cover the full set of slots. To solve this we can use a technique I developed many years ago (probably also developed by others) that simply adds one (in modular arithmetic) whenever we finish a (potentially incomplete) cycle. This is a simple and reasonably efficient way to iterate over all the slots without repetition, regardless of whether the increment is not relatively prime to the table size, or even zero. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11871 Test Plan: existing correctness tests, especially ClockCacheTest.ClockTableFull Intended follow-up: make ClockTableFull test more complete for AutoHCC ## Performance Ignoring old AutoHCC performance, as we established above it could be terrible. FixedHCC and LRUCache are unaffected by this change. All tests below include this change. ### Getting up to size, single thread (same cache_bench command as above, all three run at same time) AutoHCC: Complete in 26.724 s; Rough parallel ops/sec = 748400 FixedHCC: Complete in 19.987 s; Rough parallel ops/sec = 1000631 LRUCache: Complete in 28.291 s; Rough parallel ops/sec = 706939 Single-threaded faster than LRUCache (often / sometimes) is good. FixedHCC has an obvious advantage because it starts at full size. ### Multiple threads, steady state, high hit rate ~95% Using `-threads=10 -populate_cache=1 -ops_per_thread=10000000` and still `-resident_ratio=0.6` AutoHCC: Complete in 48.778 s; Rough parallel ops/sec = 2050119 FixedHCC: Complete in 46.569 s; Rough parallel ops/sec = 2147329 LRUCache: Complete in 50.537 s; Rough parallel ops/sec = 1978735 ### Multiple threads, steady state, low hit rate ~50% Change to `-resident_ratio=0.2` AutoHCC: Complete in 49.264 s; Rough parallel ops/sec = 2029884 FixedHCC: Complete in 49.750 s; Rough parallel ops/sec = 2010041 LRUCache: Complete in 53.002 s; Rough parallel ops/sec = 1886713 Don't expect AutoHCC to be consistently faster than FixedHCC, but they are at least similar in these benchmarks. Reviewed By: jowlyzhang Differential Revision: D49548534 Pulled By: pdillinger fbshipit-source-id: 263e4f4d71d0e9a7d91db3795b48fad75408822b	2023-09-22 13:47:31 -07:00
anand76	269478ee46	Support compressed and local flash secondary cache stacking (#11812 ) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b	2023-09-21 20:30:53 -07:00
Changyu Bi	b927ba5936	Rollback other pending memtable flushes when a flush fails (#11865 ) Summary: when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen: ``` Start Flush0 for memtable M0 to SST0 Start Flush1 for memtable M1 to SST1 Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1 Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1 Error opening SST1 since it's already deleted with an error message like the following: IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory ``` This happens since: 1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false. 2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: `f42e70bf56/db/db_impl/db_impl_compaction_flush.cc (L3161)` This PR fixes the issue by rollback these pending flushes. There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error. There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865 Test Plan: * Added repro unit tests. Reviewed By: anand1976 Differential Revision: D49484922 Pulled By: cbi42 fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d	2023-09-21 15:31:29 -07:00
Yu Zhang	32fc1e6cdc	Add unit test for the multiget fix when ReadOptions.read_tier == kPersistedTier and disableWAL == true (#11854 ) Summary: Add unit tests for the fix in https://github.com/facebook/rocksdb/pull/11700 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11854 Reviewed By: anand1976 Differential Revision: D49392462 Pulled By: jowlyzhang fbshipit-source-id: bd6978e4888074fa5417f3ccda7a78a2c7eee9c6	2023-09-21 14:59:58 -07:00
Peter (Stig) Edwards	bf488c3052	Use *next_sequence -1 here (#11861 ) Summary: To fix off-by-one error: Transaction could not check for conflicts for operation at SequenceNumber 500000 as the MemTable only contains changes newer than SequenceNumber 500001. Fixes https://github.com/facebook/rocksdb/issues/11822 I think introduced in `a657ee9a9c` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11861 Reviewed By: pdillinger Differential Revision: D49457273 Pulled By: ajkr fbshipit-source-id: b527cbae4ecc7874633a11f07027adee62940d74	2023-09-21 13:52:01 -07:00
anand76	548aabfe5f	Disable compressed secondary cache if capacity is 0 (#11863 ) Summary: This PR makes disabling the compressed secondary cache by setting capacity to 0 a bit more efficient. Previously, inserts/lookups would go to the backing LRUCache before getting rejected due to 0 capacity. With this change, insert/lookup would return from ```CompressedSecondaryCache``` itself. Tests: Existing tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11863 Reviewed By: akankshamahajan15 Differential Revision: D49476248 Pulled By: anand1976 fbshipit-source-id: f0f17a5e3df7d8bfc06709f8f23c1302056ba590	2023-09-20 22:30:17 -07:00
akankshamahajan	c1a97fe1f6	Fix Assertion `roundup_len2 >= alignment' failed in crash tests (#11852 ) Summary: When auto_readahead_size is enabled in async_io, during seek, first buffer will prefetch the data - (current block + readahead till upper_bound). There can be cases where 1. first buffer prefetched all the data till upper bound, or 2. first buffer already has the data from prev seek call and second buffer prefetch further leading to alignment issues. This PR fixes that assertion and second buffer won't go for prefetching if first buffer has already prefetched till upper_bound. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11852 Test Plan: - Added new unit test that failed without this fix. - crash tests passed locally Reviewed By: pdillinger Differential Revision: D49384138 Pulled By: akankshamahajan15 fbshipit-source-id: 54417e909e4d986f1e5a17dbaea059cd4962fd4d	2023-09-20 16:13:20 -07:00
Hui Xiao	089070cb36	Expose more info about input files in `CompactionFilter::Context` (#11857 ) Summary: Context: As requested, lowest level as well as a map from input file to its table properties among all input files used in table creation (if any) are exposed in `CompactionFilter::Context`. Summary: This PR contains two commits: (1) [Refactory](`0012777f0e`) to make resonating/using what is in `Compaction:: table_properties_` easier - Separate `Compaction:: table_properties_` into `Compaction:: input_table_properties_` and `Compaction:: output_table_properties_` - Separate the "set input table properties" logic into `Compaction:: SetInputTableProperties()`) from `Compaction:: GetInputTableProperties` - Call `Compaction:: SetInputTableProperties()` as soon as possible, which is right after `Compaction::SetInputVersion()`. Bundle these two functions into one `Compaction::FinalizeInputInfo()` to minimize missing one or the other (2) [Expose more info about input files:](`6093e7dfba`) `CompactionFilter::Context::input_start_level/input_table_properties` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11857 Test Plan: - Modify existing UT ` TEST_F(DBTestCompactionFilter, CompactionFilterContextManual)` to cover new logics Reviewed By: ajkr Differential Revision: D49402540 Pulled By: hx235 fbshipit-source-id: 469fff50fa0e5964ffa5ea8db0743f61438ea392	2023-09-20 13:34:39 -07:00
chuhao zeng	8acf17002a	Fix row cache falsely return kNotFound when timestamp enabled (#11816 ) Summary: Summary: When row cache hits and a timestamp is being set in read_options, even though ROW_CACHE entry is hit, the return status is kNotFound. Cause of error: If timestamp is provided in readoptions, a callback for sequence number checking is registered [here](`8fc78a3a9e/db/db_impl/db_impl.cc (L2112)`). Hence the default value set at this [line](`694e49cbb1/table/get_context.cc (L611)`) prevents get_context from saving value found in cache. Causing the final status to be kNotFound even though the entry exist in both cache and SST file. Proposed Solution Row cache key contains a sequence number in it. If the key for row cache lookup matches the key in cache, this cache entry should be good to be exposed to user and hence we reuse the sequence number in cache key rather than passing kMaxSequenceNumber. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11816 Reviewed By: ajkr Differential Revision: D49419029 Pulled By: jowlyzhang fbshipit-source-id: 6c77e9e751628d7d8e6c389f299e29a11ea824c6	2023-09-20 11:34:38 -07:00

1 2 3 4 5 ...

12280 commits