rocksdb

Commit Graph

Author	SHA1	Message	Date
Yu Zhang	c73cf7a878	Add CompactForTieringCollector to support automatically trigger compaction for tiering use case (#12760 ) Summary: This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier. A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following: 1) Setting the ratio to be equal to or smaller than 0 disables this collector 2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met. 3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction. For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760 Test Plan: Added unit tests Reviewed By: pdillinger Differential Revision: D58734976 Pulled By: jowlyzhang fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81	2024-06-18 10:51:29 -07:00
Jonah Gao	9f95aa8269	`GetAggregatedIntProperty` accumulates property once per block cache (#12755 ) Summary: Fix issue https://github.com/facebook/rocksdb/issues/12687. A block cache may be shared by multiple column families. Therefore, when getting the aggregated property of the block cache, we need to deduplicate by instances of the block cache, meaning the same instance should only be counted once. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12755 Reviewed By: jowlyzhang Differential Revision: D58508819 Pulled By: ajkr fbshipit-source-id: 3b746841d7eac59f900387ec3b8c19dbcd20aae4	2024-06-18 10:46:55 -07:00
Peter Dillinger	3758e31f3f	Fix rare failure in DBBlockCacheTypeTest.Uncache (#12775 ) Summary: Following up on https://github.com/facebook/rocksdb/issues/12748 after seeing recurrence in https://github.com/facebook/rocksdb/actions/runs/9522985253/job/26253605587?pr=12774 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12775 Test Plan: Was able to reproduce failure and verify fix this time using COERCE_CONTEXT_SWITCH=1 :) Reviewed By: jowlyzhang Differential Revision: D58623461 Pulled By: pdillinger fbshipit-source-id: d93a5e6a4977675eac54bbd42e70ae7b29b950a4	2024-06-14 20:50:36 -07:00
Jay Huh	0ab60b8a8c	MultiCfIterator - Handle case of invalid key from child iter manual prefix iteration (#12773 ) Summary: Instead of completely disallowing `MultiCfIterator` when one or more child iterators will do manual prefix iteration (as suggested in https://github.com/facebook/rocksdb/issues/12770 ), just let `MultiCfIterator` operate as is even when there's a possibility of undefined result from child iterators. If one or more child iterators cause the heap to be empty, just return early and `Valid()` will return false. It is still possible that heap is not empty when one or more child iterators are returning wrong keys. Basically, MultiCfIterator behaves the same as what we described in https://github.com/facebook/rocksdb/wiki/Prefix-Seek#manual-prefix-iterating - "RocksDB will not return error when it is misused and the iterating result will be undefined." Pull Request resolved: https://github.com/facebook/rocksdb/pull/12773 Test Plan: MultiCfIterator added back to the stress test ``` python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1 --verify_iterator_with_expected_state_one_in=2 ``` Reviewed By: cbi42 Differential Revision: D58612055 Pulled By: jaykorean fbshipit-source-id: e0dd942bed98382c59d463412dd8f163e6790b93	2024-06-14 15:59:17 -07:00
Yu Zhang	f5e44f3490	Fix manual flush hanging on waiting for no stall for UDT in memtable … (#12771 ) Summary: This PR fix a possible manual flush hanging scenario because of its expectation that others will clear out excessive memtables was not met. The root cause is the FlushRequest rescheduling logic is using a stricter criteria for what a write stall is about to happen means than `WaitUntilFlushWouldNotStallWrites` does. Currently, the former thinks a write stall is about to happen when the last memtable is half full, and it will instead reschedule queued FlushRequest and not actually proceed with the flush. While the latter thinks if we already start to use the last memtable, we should wait until some other background flush jobs clear out some memtables before proceed this manual flush. If we make them use the same criteria, we can guarantee that at any time when`WaitUntilFlushWouldNotStallWrites` is waiting, it's not because the rescheduling logic is holding it back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12771 Test Plan: Added unit test Reviewed By: ajkr Differential Revision: D58603746 Pulled By: jowlyzhang fbshipit-source-id: 9fa1c87c0175d47a40f584dfb1b497baa576755b	2024-06-14 13:37:37 -07:00
Yu Zhang	13c758f986	Change the behavior of manual flush to not retain UDT (#12737 ) Summary: When user-defined timestamps in Memtable only feature is enabled, all scheduled flushes go through a check to see if it's eligible to be rescheduled to retain user-defined timestamps. However when the user makes a manual flush request, their intention is for all the in memory data to be persisted into SST files as soon as possible. These two sides have some conflict of interest, the user can implement some workaround like https://github.com/facebook/rocksdb/issues/12631 to explicitly mark which one takes precedence. The implementation for this can be nuanced since the user needs to be aware of all the scenarios that can trigger a manual flush and handle the concurrency well etc. In this PR, we updated the default behavior to give manual flush precedence when it's requested. The user-defined timestamps rescheduling mechanism is turned off when a manual flush is requested. Likewise, all error recovery triggered flushes skips the rescheduling mechanism too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12737 Test Plan: Add unit tests Reviewed By: ajkr Differential Revision: D58538246 Pulled By: jowlyzhang fbshipit-source-id: 0b9b3d1af3e8d882f2d6a2406adda19324ba0694	2024-06-13 13:18:10 -07:00
Peter Dillinger	0646ec6e2d	Ensure Close() before LinkFile() for WALs in Checkpoint (#12734 ) Summary: POSIX semantics for LinkFile (hard links) allow linking a file that is still being written two, with both the source and destination showing any subsequent writes to the source. This may not be practical semantics for some FileSystem implementations such as remote storage. They might only link the flushed or sync-ed file contents at time of LinkFile, or might even have undefined behavior if LinkFile is called on a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731 to bring more hygiene to our handling of WAL files in Checkpoint. Specifically, we now Close WAL files as soon as they are either (a) inactive and fully synced, or (b) inactive and obsolete (so maybe never fully synced), rather than letting Close() happen in handling obsolete files (maybe a background thread). This should not be a performance issue as Close() should be trivial cost relative to other IO ops, but just in case: * We don't Close() while holding a mutex, to avoid blocking, and * The old behavior is available with a new kill switch option `background_close_inactive_wals`. Stacked on https://github.com/facebook/rocksdb/issues/12731 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734 Test Plan: Extended existing unit test, especially adding a hygiene check to FaultInjectionTestFS to detect LinkFile() on a file still open for writes. FaultInjectionTestFS already has relevant tracking data, and tests can opt out of the new check, as in a smoke test I have left for the old, deprecated functionality `background_close_inactive_wals=true`. Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK with the crash test. (The only place I can find we use LinkFile in production is Checkpoint.) Reviewed By: cbi42 Differential Revision: D58295284 Pulled By: pdillinger fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6	2024-06-12 11:48:45 -07:00
Peter Dillinger	21eb82ebec	Disable "uncache" behavior in DB shutdown (#12751 ) Summary: Crash test showed a potential use-after-free where a file marked as obsolete and eligible for uncache on destruction is destroyed in the VersionSet destructor, which only happens as part of DB shutdown. At that point, the in-memory column families have already been destroyed, so attempting to uncache could use-after-free on stuff like getting the `user_comparator()` from the `internal_comparator()`. I attempted to make it smarter, but wasn't able to untangle the destruction dependencies in a way that was safe, understandable, and maintainable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12751 Test Plan: Reproduced by adding uncache_aggressiveness to an existing (but otherwise unrelated) test. This makes it a fair regression test. Also added testing to ensure that trivial moves and DB close & reopen are well behaved with uncache_aggressiveness. Specifically, this issue doesn't seem to be because things are uncached inappropriately in those cases. Reviewed By: ltamasi Differential Revision: D58390058 Pulled By: pdillinger fbshipit-source-id: 66ac9cb13bf02638fa80ee5b7218153d8bc7cfd3	2024-06-11 15:57:40 -07:00
Evan Jones	af50823069	c.h: Add set_track_and_verify_wals_in_manifest to C API (#12749 ) Summary: This option is recommended to be set for production use: We recommend to set track_and_verify_wals_in_manifest to true for production https://github.com/facebook/rocksdb/wiki/Track-WAL-in-MANIFEST This adds this setting to the C API, so it can be used by other languages. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12749 Reviewed By: ltamasi Differential Revision: D58382892 Pulled By: ajkr fbshipit-source-id: 885de4539745a3119b6b2a162ab4fca9fa975283	2024-06-10 16:26:52 -07:00
Peter Dillinger	68112b3beb	Attempt fix rare failure in DBBlockCacheTypeTest.Uncache (#12748 ) Summary: I haven't been able to reproduce the failure, seen in https://github.com/facebook/rocksdb/actions/runs/9420830905/job/25953696902?pr=12734 ``` [ RUN ] DBBlockCacheTypeTestInstance/DBBlockCacheTypeTest.Uncache/2 db/db_block_cache_test.cc:1415: Failure Expected equality of these values: cache->GetOccupancyCount() Which is: 37 kBaselineCount + kNumDataBlocks + meta_blocks_per_file Which is: 15 Google Test trace: db/db_block_cache_test.cc:1346: ua=10000 db/db_block_cache_test.cc:1344: partitioned=1 db/db_block_cache_test.cc:1418: Failure ... ``` But it's consistent with a SuperVersion reference sticking around beyond the CompactRange, as I can reproduce the result with a dangling Iterator. Like some other tests have had trouble with periodic stats popping up randomly, I suspect that could be the explanation in this case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12748 Test Plan: Watch for similar future failures Reviewed By: ltamasi Differential Revision: D58366031 Pulled By: pdillinger fbshipit-source-id: b812ca8837b8c8b9cbda1b201d76316d145fa3ec	2024-06-10 13:31:46 -07:00
Evan Jones	32e6825bc6	c.h: Add GetDbIdentity, Options::write_dbid_to_manifest (#12736 ) Summary: The write_dbid_to_manifest option is documented as "We recommend setting this flag to true". However, there is no way to set this flag from the C API. Add the following functions to the C API: * rocksdb_get_db_identity * rocksdb_options_get_write_dbid_to_manifest * rocksdb_options_set_write_dbid_to_manifest Add a test that this option preserves the ID across checkpoints. c.cc: * Remove outdated comments about missing C API functions that exist. * Document that CopyString is intended for binary data and is not NUL terminated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12736 Reviewed By: ltamasi Differential Revision: D58202117 Pulled By: ajkr fbshipit-source-id: 707b110df5c4bd118d65548327428a53a9dc3019	2024-06-07 16:53:43 -07:00
Peter Dillinger	b34cef57b7	Support pro-actively erasing obsolete block cache entries (#12694 ) Summary: Currently, when files become obsolete, the block cache entries associated with them just age out naturally. With pure LRU, this is not too bad, as once you "use" enough cache entries to (re-)fill the cache, you are guranteed to have purged the obsolete entries. However, HyperClockCache is a counting clock cache with a somewhat longer memory, so could be more negatively impacted by previously-hot cache entries becoming obsolete, and taking longer to age out than newer single-hit entries. Part of the reason we still have this natural aging-out is that there's almost no connection between block cache entries and the file they are associated with. Everything is hashed into the same pool(s) of entries with nothing like a secondary index based on file. Keeping track of such an index could be expensive. This change adds a new, mutable CF option `uncache_aggressiveness` for erasing obsolete block cache entries. The process can be speculative, lossy, or unproductive because not all potential block cache entries associated with files will be resident in memory, and attempting to remove them all could be wasted CPU time. Rather than a simple on/off switch, `uncache_aggressiveness` basically tells RocksDB how much CPU you're willing to burn trying to purge obsolete block cache entries. When such efforts are not sufficiently productive for a file, we stop and move on. The option is in ColumnFamilyOptions so that it is dynamically changeable for already-open files, and customizeable by CF. Note that this block cache removal happens as part of the process of purging obsolete files, which is often in a background thread (depending on `background_purge_on_iterator_cleanup` and `avoid_unnecessary_blocking_io` options) rather than along CPU critical paths. Notable auxiliary code details: * Possibly fixing some issues with trivial moves with `only_delete_metadata`: unnecessary TableCache::Evict in that case and missing from the ObsoleteFileInfo move operator. (Not able to reproduce an current failure.) * Remove suspicious TableCache::Erase() from VersionSet::AddObsoleteBlobFile() (TODO follow-up item) Marked EXPERIMENTAL until more thorough validation is complete. Direct stats of this functionality are omitted because they could be misleading. Block cache hit rate is a better indicator of benefit, and CPU profiling a better indicator of cost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12694 Test Plan: * Unit tests added, including refactoring an existing test to make better use of parameterized tests. * Added to crash test. * Performance, sample command: ``` for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 \| grep -E 'micros/op\|rocksdb.block.cache.data.(hit\|miss)\|rocksdb.number.keys.(read\|written)\|maxresident' \| awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' \| tee -a results-$CT-$UA; done; done; done ``` Averaging 10 runs each case, block cache data block hit rates ``` lru_cache UA=0 -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0 UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0 fixed_hyper_clock_cache UA=0 -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9 UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2 auto_hyper_clock_cache UA=0 -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5 UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8 ``` Conclusion: up to roughly 1 percentage point of improved block cache hit rate, likely leading to overall improved efficiency (because the foreground CPU cost of cache misses likely outweighs the background CPU cost of erasure, let alone I/O savings). Reviewed By: ajkr Differential Revision: D57932442 Pulled By: pdillinger fbshipit-source-id: 84a243ca5f965f731f346a4853009780a904af6c	2024-06-07 08:57:11 -07:00
Yu Zhang	44aceb88d0	Add a OnManualFlushScheduled callback in event listener (#12631 ) Summary: As titled. Also added the newest user-defined timestamp into the `MemTableInfo`. This can be a useful info in the callback. Added some unit tests as examples for how users can use two separate approaches to allow manual flush / manual compactions to go through when the user-defined timestamps in memtable only feature is enabled. One approach relies on selectively increase cutoff timestamp in `OnMemtableSeal` callback when it's initiated by a manual flush. Another approach is to increase cutoff timestamp in `OnManualFlushScheduled` callback. The caveats of the approaches are also documented in the unit test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12631 Reviewed By: ajkr Differential Revision: D58260528 Pulled By: jowlyzhang fbshipit-source-id: bf446d7140affdf124744095e0a179fa6e427532	2024-06-06 17:29:01 -07:00
Hui Xiao	390fc55ba1	Revert PR 12684 and 12556 (#12738 ) Summary: Context/Summary: a better API design is decided lately so we decided to revert these two changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12738 Test Plan: - CI Reviewed By: ajkr Differential Revision: D58162165 Pulled By: hx235 fbshipit-source-id: 9bbe4d2fe9fbe39213f4cf137a2d419e6ffb8e16	2024-06-06 11:46:16 -07:00
Peter Dillinger	98393f0139	Fix Checkpoint hard link of inactive but unsynced WAL (#12731 ) Summary: Background: there is one active WAL file but there can be several more WAL files in various states. Those other WALs are always in a "flushed" state but could be on the `logs_` list not yet fully synced. We currently allow any WAL that is not the active WAL to be hard-linked when creating a Checkpoint, as although it might still be open for write, we are not appending any more data to it. The problem is that a created Checkpoint is supposed to be fully synced on return of that function, and a hard-linked WAL in the state described above might not be fully synced. (Through some prudence in https://github.com/facebook/rocksdb/issues/10083, it would synced if using track_and_verify_wals_in_manifest=true.) The fix is a step toward a long term goal of removing the need to query the filesystem to determine WAL files and their state. (I consider it dubious any time we independently read from or query metadata from a file we have open for writing, as this makes us more susceptible to FileSystem deficiencies or races.) More specifically: * Detect which WALs might not be fully synced, according to our DBImpl metadata, and prevent hard linking those (with `trim_to_size=true` from `GetLiveFilesStorageInfo()`. And while we're at it, use our known flushed sizes for those WALs. * To avoid a race between that and GetSortedWalFiles(), track a maximum needed WAL number for the Checkpoint/GetLiveFilesStorageInfo. * Because of the level of consistency provided by those two, we no longer need to consider syncing as part of the FlushWAL in GetLiveFilesStorageInfo. (We determine the max WAL number consistent with the manifest file size, while holding DB mutex. Should make track_and_verify_wals_in_manifest happy.) This makes the premise of test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback no longer hit) so the test is removed, with crash test as backstop for related issues. See https://github.com/facebook/rocksdb/issues/10185 Stacked on https://github.com/facebook/rocksdb/issues/12729 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12731 Test Plan: Expanded an existing test, which now fails before fix. Also long runs of blackbox_crash_test with amplified checkpoint frequency. Reviewed By: cbi42 Differential Revision: D58199629 Pulled By: pdillinger fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382	2024-06-05 17:40:09 -07:00
Peter Dillinger	9f4c597d83	FaultInjectionTestFS read unsynced data by default (#12729 ) Summary: In places (e.g. GetSortedWals()) RocksDB relies on querying the file size or even reading the contents of files currently open for writing, and as in POSIX semantics, expects to see the flushed size and contents regardless of what has been synced. FaultInjectionTestFS historically did not emulate this behavior, only showing synced data from such read operations. (Different from FaultInjectionTestEnv--sigh.) This change makes the "proper" behavior the default behavior, at least for GetFileSize and FSSequentialFile. However, this new functionality is disabled in db_stress because of undiagnosed, unresolved issues. Also removes unused and confusing field `pos_at_last_flush_` This change is needed to support testing a relevant bug fix (in a follow-up diff). Other suggested follow-up: * Fix db_stress not to rely on the old behavior, and fix a related FIXME in db_stress_test_base.cc in LockWAL testing. * Fill in some corner cases in the FileSystem API for reading unsynced data (see new TODO items). * Consider deprecating and removing Flush() API functions from FileSystem APIs. It is not clear to me that there is a supported scenario in which they do anything but confuse API users and developers. If there is a use for them, it doesn't appear to be tested. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12729 Test Plan: applies to all unit tests successfully, just updating the unit test from https://github.com/facebook/rocksdb/issues/12556 due to relying on the errant behavior. Also added a specific unit test Reviewed By: hx235 Differential Revision: D58091835 Pulled By: pdillinger fbshipit-source-id: f47a63b2b000f5875b6293a98577bff663d7fd33	2024-06-04 15:25:23 -07:00
Yu Zhang	8523f0a86a	Use extended file boundary for key range overlap check during file ingestion (#12735 ) Summary: When https://github.com/facebook/rocksdb/issues/12343 added support to bulk load external files while column family enables user-defined timestamps, it's a requirement that the external file doesn't overlap with the DB in key ranges. More specifically, the external file should not contain a user key (without timestamp) that already have some entries in the DB. All the `Overlap` functions like `RangeOverlapWithMemtable`, `RangeOverlapWithCompaction` are using `CompareWithoutTimestamp` to check for overlap already. One thing that is missing here is we need to extend the external file's user key boundary for this check to avoid missing the checks for the boundary user keys. For example, with the current way of checking things where `external_file_info.smallest.user_key()` is used as the left boundary, and `external_file_info.largest.user_key()` is used as the right boundary, a file with this entry: (b, 40) can fit into a DB with these two entries: (b, 30), (c, 20). To avoid this, we extend the user key boundaries used for overlap check, by updating the left boundary with the maximum timestamp and the right boundary with the minimum timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12735 Test Plan: Added unit test Reviewed By: ltamasi Differential Revision: D58152117 Pulled By: jowlyzhang fbshipit-source-id: 9cba61e7357f6d76ad44c258381c35073ebbf347	2024-06-04 13:39:51 -07:00
hgy@ruijie.com.cn	21a16f9e64	add c-api to get default cf handle (#12514 ) Summary: rocksdb_batched_multi_get_cf has performance improvement than normal multi_get, however it needs a cf_handle arg, so add a C-API to get and destroy the default cf_handle, as many user only use the default cf. Fixes https://github.com/facebook/rocksdb/issues/12316 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12514 Reviewed By: hx235 Differential Revision: D55922517 Pulled By: ajkr fbshipit-source-id: c4cc4289f2cfd9efbb8f390a44a9d8d1ed08d9f0	2024-06-03 11:09:35 -07:00
Yu Zhang	fc59d8f9c6	Add public API `WriteWithCallback` to support custom callbacks (#12603 ) Summary: This PR adds a `DB::WriteWithCallback` API that does the same things as `DB::Write` while takes an argument `UserWriteCallback` to execute custom callback functions during the write. We currently support two types of callback functions: `OnWriteEnqueued` and `OnWalWriteFinish`. The former is invoked after the write is enqueued, and the later is invoked after WAL write finishes when applicable. These callback functions are intended for users to use to improve synchronization between concurrent writes, their execution is on the write's critical path so it will impact the write's latency if not used properly. The documentation for the callback interface mentioned this and suggest user to keep these callback functions' implementation minimum. Although transaction interfaces' writes doesn't yet allow user to specify such a user write callback argument, the `DBImpl::Write*` type of APIs do not differentiate between regular DB writes or writes coming from the transaction layer when it comes to supporting this `UserWriteCallback`. These callbacks works for all the write modes including: default write mode, Options.two_write_queues, Options.unordered_write, Options.enable_pipelined_write Pull Request resolved: https://github.com/facebook/rocksdb/pull/12603 Test Plan: Added unit test in ./write_callback_test Reviewed By: anand1976 Differential Revision: D58044638 Pulled By: jowlyzhang fbshipit-source-id: 87a84a0221df8f589ec8fc4d74597e72ce97e4cd	2024-05-31 19:30:19 -07:00
Peter Dillinger	7127119ae9	Refactor SyncWAL and SyncClosedLogs for code sharing (#12707 ) Summary: These functions were very similar and did not make sense for maintaining separately. This is not a pure refactor but I think bringing the behaviors closer together should reduce long term risk of unintentionally divergent behavior. This change is motivated by some forthcoming WAL handling fixes for Checkpoint and Backups. * Sync() is always used on closed WALs, like the old SyncClosedWals. SyncWithoutFlush() is only used on the active (maybe) WAL. Perhaps SyncWithoutFlush() should be used whenever available, but I don't know which is preferred, as the previous state of the code was inconsistent. * Syncing the WAL dir is selective based on need, like old SyncWAL, rather than done always like old SyncClosedLogs. This could be a performance improvement that was never applied to SyncClosedLogs but now is. We might still sync the dir more times than necessary in the case of parallel SyncWAL variants, but on a good FileSystem that's probably not too different performance-wise from us implementing something to have threads wait on each other. Cosmetic changes: * Rename internal function SyncClosedLogs to SyncClosedWals * Merging the sync points into the common implementation between the two entry points isn't pretty, but should be fine. Recommended follow-up: * Clean up more confusing naming like log_dir_synced_ Pull Request resolved: https://github.com/facebook/rocksdb/pull/12707 Test Plan: existing tests Reviewed By: anand1976 Differential Revision: D57870856 Pulled By: pdillinger fbshipit-source-id: 5455fba016d25dd5664fa41b253f18db2ca8919a	2024-05-30 14:53:13 -07:00
Rulin Huang	20777b96cb	Optimizations in notify-one (#12545 ) Summary: We tested on icelake server (vcpu=160). The default configuration is allow_concurrent_memtable_write=1, thread number =activate core number. With our optimizations, the improvement can reach up to 184% in fillseq case. op/s is as the performance indicator in db_bench, and the following are performance improvements in some cases in db_bench. \| case name \| optimized/original \| \|-------------------:\|--------------------:\| \| fillrandom \| 182% \| \| fillseq \| 184% \| \| fillsync \| 136% \| \| overwrite \| 179% \| \| randomreplacekeys \| 180% \| \| randomtransaction \| 161% \| \| updaterandom \| 163% \| \| xorupdaterandom \| 165% \| With analysis, we find that although the process of writing memtable is processed in parallel, the process of waking up the writers is not processed in parallel, which means that only one writers is responsible for the sequential waking up other writers. The following is our method to optimize this process. Assume that there are currently n threads in total, we parallelize SetState in LaunchParallelMemTableWriters. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn. A reproduction script: ./db_bench --benchmarks="fillrandom" --threads ${number of all activate vcpu} --seed 1708494134896523 --duration 60 ![image](https://github.com/facebook/rocksdb/assets/22110918/c5eca02f-93b3-4434-bba2-5155fc892a97) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12545 Reviewed By: ajkr Differential Revision: D57422827 Pulled By: cbi42 fbshipit-source-id: 94127937c0c61e4241720bd902c82c607b7b2431	2024-05-30 09:10:44 -07:00
Changyu Bi	af3be5255a	Fail DeleteRange() early when row_cache is configured (#12710 ) Summary: https://github.com/facebook/rocksdb/issues/12512 added the sanity check for this incompatible combination. However, it does the check during memtable insertion which can turn the DB into read-only mode. This PR moves the check earlier so that this write failure will not turn the DB into read-only mode and affect other DB operations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12710 Test Plan: * updated unit test `DBRangeDelTest.RowCache` to write to DB after a failed DeleteRange(). The test fails before this PR. Reviewed By: ajkr Differential Revision: D57925188 Pulled By: cbi42 fbshipit-source-id: 8bf001bd3fcf05635411ba28bc4a037321942879	2024-05-29 15:03:15 -07:00
anand76	9cc6168c98	Add LDB command and option for follower instances (#12682 ) Summary: Add the `--leader_path` option to specify the directory path of the leader for a follower RocksDB instance. This PR also adds a `count` command to the repl shell. While not specific to followers, it is useful for testing purposes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12682 Reviewed By: jowlyzhang Differential Revision: D57642296 Pulled By: anand1976 fbshipit-source-id: 53767d496ecadc363ff92cd958b8e15a7bf3b151	2024-05-28 23:21:32 -07:00
HypenZou	8765a0f546	Fix version edit dump in json (#12703 ) Summary: Context/Summary: the flag --json of manifest_dump in ldb tool has no effect The bug may be introduced by pr https://github.com/facebook/rocksdb/pull/8378 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12703 Reviewed By: cbi42 Differential Revision: D57848094 Pulled By: ajkr fbshipit-source-id: 3d1ce65528bf4ce9c53593a7208406ab90e8994b	2024-05-28 16:44:25 -07:00
muthukrishnan24	259f21e695	Add WB, WBWI Create, UpdateTimestamp, Iterator::Refresh in C API (#10529 ) Summary: This PR adds UpdateTimestamp API of WriteBatch and WBWI, create WB, WBWI with all options and Iterator Refresh in C API Pull Request resolved: https://github.com/facebook/rocksdb/pull/10529 Reviewed By: cbi42 Differential Revision: D57826913 Pulled By: ajkr fbshipit-source-id: d2ec840129f61a1d3a5a12e859728be98ebbad2f	2024-05-28 15:36:09 -07:00
Jaepil Jeong	c115eb6162	Fix compile errors in C++23 (#12106 ) Summary: This PR fixes compile errors in C++23. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12106 Reviewed By: cbi42 Differential Revision: D57826279 Pulled By: ajkr fbshipit-source-id: 594abfd8eceaf51eaf3bbabf7696c0bb5e0e9a68	2024-05-28 15:33:57 -07:00
Daniel Vasquez Lopez	7c6c632ea9	Use `std::optional` instead of `std::unique_ptr` to conditionally create a read lock. (#12704 ) Summary: This change replaces the use of `std::unique_ptr` with `std::optional` for conditionally constructing a `ReadLock` object. The read lock object was recently introduced in PR https://github.com/facebook/rocksdb/issues/12624. This change makes the code more concise and clarifies that the lock is not meant to be transferred (as `std::unique_ptr` is movable). It also avoids a heap allocation. There are no functional changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12704 Reviewed By: cbi42 Differential Revision: D57848192 Pulled By: ajkr fbshipit-source-id: da48c77aac33b51ba5dcc238f98fc48ccf234a21	2024-05-28 15:31:45 -07:00
Peter Dillinger	d2ef70872f	Rename, deprecate `LogFile` and `VectorLogPtr` (#12695 ) Summary: These names are confusing with `Logger` etc. so moving to `WalFile` etc. Other small, related name refactorings. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12695 Test Plan: Left most unit tests using old names as an API compatibility test. Non-test code compiles with deprecated names removed. No functional changes. Reviewed By: ajkr Differential Revision: D57747458 Pulled By: pdillinger fbshipit-source-id: 7b77596b9c20d865d43b9dc66c30c8bd2b3b424f	2024-05-28 09:24:49 -07:00
Changyu Bi	0ee7f8bacb	Fix `max_read_amp` value in crash test (#12701 ) Summary: It should be no less than `level0_file_num_compaction_trigger`(which defaults to 4) when set to a positive value. Otherwise DB open will fail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12701 Test Plan: crash test not failing DB open due to this option value. Reviewed By: ajkr Differential Revision: D57825062 Pulled By: cbi42 fbshipit-source-id: 22d8e12aeceb5cef815157845995a8448552e2d2	2024-05-26 17:26:55 -07:00
Changyu Bi	fecb10c2fa	Improve universal compaction sorted-run trigger (#12477 ) Summary: Universal compaction currently uses `level0_file_num_compaction_trigger` for two purposes: 1. the trigger for checking if there is any compaction to do, and 2. the limit on the number of sorted runs. RocksDB will do compaction to keep the number of sorted runs no more than the value of this option. This can make the option inflexible. A value that is too small causes higher write amp: more compactions to reduce the number of sorted runs. A value that is too big delays potential compaction work and causes worse read performance. This PR introduce an option `CompactionOptionsUniversal::max_read_amp` for only the second purpose: to specify the hard limit on the number of sorted runs. For backward compatibility, `max_read_amp = -1` by default, which means to fallback to the current behavior. When `max_read_amp > 0`,`level0_file_num_compaction_trigger` will only serve as a trigger to find potential compaction. When `max_read_amp = 0`, RocksDB will auto-tune the limit on the number of sorted runs. The estimation is based on DB size, write_buffer_size and size_ratio, so it is adaptive to the size change of the DB. See more in `UniversalCompactionBuilder::PickCompaction()`. Alternatively, users now can configure `max_read_amp` to a very big value and keep `level0_file_num_compaction_trigger` small. This will allow `size_ratio` and `max_size_amplification_percent` to control the number of sorted runs. This essentially disables compactions with reason kUniversalSortedRunNum. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12477 Test Plan: * new unit test * existing unit test for default behavior * updated crash test with the new option * benchmark: * Create a DB that is roughly 24GB in the last level. When `max_read_amp = 0`, we estimate that the DB needs 9 levels to avoid excessive compactions to reduce the number of sorted runs. * We then run fillrandom to ingest another 24GB data to compare write amp. * case 1: small level0 trigger: `level0_file_num_compaction_trigger=5, max_read_amp=-1` * write-amp: 4.8 * case 2: auto-tune: `level0_file_num_compaction_trigger=5, max_read_amp=0` * write-amp: 3.6 * case 3: auto-tune with minimal trigger: `level0_file_num_compaction_trigger=1, max_read_amp=0` * write-amp: 3.8 * case 4: hard-code a good value for trigger: `level0_file_num_compaction_trigger=9` * write-amp: 2.8 ``` Case 1: Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 0/0 0.00 KB 1.0 0.0 0.0 0.0 22.6 22.6 0.0 1.0 0.0 163.2 141.94 111.10 108 1.314 0 0 0.0 0.0 L45 8/0 1.81 GB 0.0 39.6 11.1 28.5 39.3 10.8 0.0 3.5 209.0 207.3 194.25 191.29 43 4.517 348M 2498K 0.0 0.0 L46 13/0 3.12 GB 0.0 15.3 9.5 5.8 15.0 9.3 0.0 1.6 203.1 199.3 77.13 75.88 16 4.821 134M 2362K 0.0 0.0 L47 19/0 4.68 GB 0.0 15.4 10.5 4.9 14.7 9.8 0.0 1.4 204.0 194.9 77.38 76.15 8 9.673 135M 5920K 0.0 0.0 L48 38/0 9.42 GB 0.0 19.6 11.7 7.9 17.3 9.4 0.0 1.5 206.5 182.3 97.15 95.02 4 24.287 172M 20M 0.0 0.0 L49 91/0 22.70 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 0.0 0.0 Sum 169/0 41.74 GB 0.0 89.9 42.9 47.0 109.0 61.9 0.0 4.8 156.7 189.8 587.85 549.45 179 3.284 791M 31M 0.0 0.0 Case 2: Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 1/0 214.47 MB 1.2 0.0 0.0 0.0 22.6 22.6 0.0 1.0 0.0 164.5 140.81 109.98 108 1.304 0 0 0.0 0.0 L44 0/0 0.00 KB 0.0 1.3 1.3 0.0 1.2 1.2 0.0 1.0 206.1 204.9 6.24 5.98 3 2.081 11M 51K 0.0 0.0 L45 4/0 844.36 MB 0.0 7.1 5.4 1.7 7.0 5.4 0.0 1.3 194.6 192.9 37.41 36.00 13 2.878 62M 489K 0.0 0.0 L46 11/0 2.57 GB 0.0 14.6 9.8 4.8 14.3 9.5 0.0 1.5 193.7 189.8 77.09 73.54 17 4.535 128M 2411K 0.0 0.0 L47 24/0 5.81 GB 0.0 19.8 12.0 7.8 18.8 11.0 0.0 1.6 191.4 181.1 106.19 101.21 9 11.799 174M 9166K 0.0 0.0 L48 38/0 9.42 GB 0.0 19.6 11.8 7.9 17.3 9.4 0.0 1.5 197.3 173.6 101.97 97.23 4 25.491 172M 20M 0.0 0.0 L49 91/0 22.70 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 0.0 0.0 Sum 169/0 41.54 GB 0.0 62.4 40.3 22.1 81.3 59.2 0.0 3.6 136.1 177.2 469.71 423.94 154 3.050 549M 32M 0.0 0.0 Case 3: Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 0/0 0.00 KB 5.0 0.0 0.0 0.0 22.6 22.6 0.0 1.0 0.0 163.8 141.43 111.13 108 1.310 0 0 0.0 0.0 L44 0/0 0.00 KB 0.0 0.8 0.8 0.0 0.8 0.8 0.0 1.0 201.4 200.2 4.26 4.19 2 2.130 7360K 33K 0.0 0.0 L45 4/0 844.38 MB 0.0 6.3 5.0 1.2 6.2 5.0 0.0 1.2 202.0 200.3 31.81 31.50 12 2.651 55M 403K 0.0 0.0 L46 7/0 1.62 GB 0.0 13.3 8.8 4.6 13.1 8.6 0.0 1.5 198.9 195.7 68.72 67.89 17 4.042 117M 1696K 0.0 0.0 L47 24/0 5.81 GB 0.0 21.7 12.9 8.8 20.6 11.8 0.0 1.6 198.5 188.6 112.04 109.97 12 9.336 191M 9352K 0.0 0.0 L48 41/0 10.14 GB 0.0 24.8 13.0 11.8 21.9 10.1 0.0 1.7 198.6 175.6 127.88 125.36 6 21.313 218M 25M 0.0 0.0 L49 91/0 22.70 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 0.0 0.0 Sum 167/0 41.10 GB 0.0 67.0 40.5 26.4 85.4 58.9 0.0 3.8 141.1 179.8 486.13 450.04 157 3.096 589M 36M 0.0 0.0 Case 4: Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 0/0 0.00 KB 0.7 0.0 0.0 0.0 22.6 22.6 0.0 1.0 0.0 158.6 146.02 114.68 108 1.352 0 0 0.0 0.0 L42 0/0 0.00 KB 0.0 1.7 1.7 0.0 1.7 1.7 0.0 1.0 185.4 184.3 9.25 8.96 4 2.314 14M 67K 0.0 0.0 L43 0/0 0.00 KB 0.0 2.5 2.5 0.0 2.5 2.5 0.0 1.0 197.8 195.6 13.01 12.65 4 3.253 22M 202K 0.0 0.0 L44 4/0 844.40 MB 0.0 4.2 4.2 0.0 4.1 4.1 0.0 1.0 188.1 185.1 22.81 21.89 5 4.562 36M 503K 0.0 0.0 L45 13/0 3.12 GB 0.0 7.5 6.5 1.0 7.2 6.2 0.0 1.1 188.7 181.8 40.69 39.32 5 8.138 65M 2282K 0.0 0.0 L46 17/0 4.18 GB 0.0 8.3 7.1 1.2 7.9 6.6 0.0 1.1 192.2 181.8 44.23 43.06 4 11.058 73M 3846K 0.0 0.0 L47 22/0 5.34 GB 0.0 8.9 7.5 1.4 8.2 6.8 0.0 1.1 189.1 174.1 48.12 45.37 3 16.041 78M 6098K 0.0 0.0 L48 27/0 6.58 GB 0.0 9.2 7.6 1.6 8.2 6.6 0.0 1.1 195.2 172.9 48.52 47.11 2 24.262 81M 9217K 0.0 0.0 L49 91/0 22.70 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 0.0 0.0 Sum 174/0 42.74 GB 0.0 42.3 37.0 5.3 62.4 57.1 0.0 2.8 116.3 171.3 372.66 333.04 135 2.760 372M 22M 0.0 0.0 setup: ./db_bench --benchmarks=fillseq,compactall,waitforcompaction --num=200000000 --compression_type=none --disable_wal=1 --compaction_style=1 --num_levels=50 --target_file_size_base=268435456 --max_compaction_bytes=6710886400 --level0_file_num_compaction_trigger=10 --write_buffer_size=268435456 --seed 1708494134896523 benchmark: ./db_bench --benchmarks=overwrite,waitforcompaction,stats --num=200000000 --compression_type=none --disable_wal=1 --compaction_style=1 --write_buffer_size=268435456 --level0_file_num_compaction_trigger=5 --target_file_size_base=268435456 --use_existing_db=1 --num_levels=50 --writes=200000000 --universal_max_read_amp=-1 --seed=1716488324800233 ``` Reviewed By: ajkr Differential Revision: D55370922 Pulled By: cbi42 fbshipit-source-id: 9be69979126b840d08e93e7059260e76a878bb2a	2024-05-24 10:10:31 -07:00
Andrew Kryczka	c72ee4531b	Fix recycled WAL detection when wal_compression is enabled (#12643 ) Summary: I think the point of the `if (end_of_buffer_offset_ - buffer_.size() == 0)` was to only set `recycled_` when the first record was read. However, the condition was false when reading the first record when the WAL began with a `kSetCompressionType` record because we had already dropped the `kSetCompressionType` record from `buffer_`. To fix this, I used `first_record_read_` instead. Also, it was pretty confusing to treat the WAL as non-recycled when a recyclable record first appeared in a non-first record. I changed it to return an error if that happens. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12643 Reviewed By: hx235 Differential Revision: D57238099 Pulled By: ajkr fbshipit-source-id: e20a2a0c9cf0c9510a7b6af463650a05d559239e	2024-05-22 15:34:37 -07:00
Hui Xiao	733150f6aa	Flush WAL upon DB close (#12684 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/12556 `avoid_sync_during_shutdown=false` missed an edge case where `manual_wal_flush == true` so WAL sync will still miss unflushed WAL. This PR fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12684 Test Plan: modified UT to include this case `manual_wal_flush==true` Reviewed By: cbi42 Differential Revision: D57655861 Pulled By: hx235 fbshipit-source-id: c9f49fe260e8b38b3ea387558432dcd9a3dbec19	2024-05-22 11:08:16 -07:00
Levi Tamasi	014368f62c	Fix the names of function objects added in PR 12681 (#12689 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12689 These should be in `snake_case` (not `camelCase`) per our style guide. Reviewed By: jowlyzhang Differential Revision: D57676418 fbshipit-source-id: 82ad6a87d1540f0b29c2f864ca0128287fe95a9e	2024-05-22 11:06:52 -07:00
Levi Tamasi	62600cb2d4	Fix rebuilding transactions containing PutEntity (#12681 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12681 When rebuilding transactions during recovery, `MemtableInserter::PutCFImpl` currently calls `WriteBatchInternal::Put` regardless of value type, which is incorrect for `PutEntity` entries, as well as `TimedPut`s and the blob indexes used by the old BlobDB implementation. The patch fixes the handling of `PutEntity` and returns `NotSupported` for `TimedPut`s and blob indices. Reviewed By: jaykorean, jowlyzhang Differential Revision: D57636355 fbshipit-source-id: 833de4e4aa0b42ff6638b72c4181f981d12d0f15	2024-05-21 17:22:20 -07:00
Peter Dillinger	d89ab23bec	Disallow memtable flush and sst ingest while WAL is locked (#12652 ) Summary: We recently noticed that some memtable flushed and file ingestions could proceed during LockWAL, in violation of its stated contract. (Note: we aren't 100% sure its actually needed by MySQL, but we want it to be in a clean state nonetheless.) Despite earlier skepticism that this could be done safely (https://github.com/facebook/rocksdb/issues/12666), I found a place to wait to wait for LockWAL to be cleared before allowing these operations to proceed: WaitForPendingWrites() Pull Request resolved: https://github.com/facebook/rocksdb/pull/12652 Test Plan: Added to unit tests. Extended how db_stress validates LockWAL and re-enabled combination of ingestion and LockWAL in crash test, in follow-up to https://github.com/facebook/rocksdb/issues/12642 Ran blackbox_crash_test for a long while with relevant features amplified. Suggested follow-up: fix FaultInjectionTestFS to report file sizes consistent with what the user has requested to be flushed. Reviewed By: jowlyzhang Differential Revision: D57622142 Pulled By: pdillinger fbshipit-source-id: aef265fce69465618974b4ec47f4636257c676ce	2024-05-21 10:17:34 -07:00
Hui Xiao	d7b938882e	Sync WAL during db Close() (#12556 ) Summary: Context/Summary: Below crash test found out we don't sync WAL upon DB close, which can lead to unsynced data loss. This PR syncs it. ``` ./db_stress --threads=1 --disable_auto_compactions=1 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_before_level=1 --bloom_bits=29.895303579352174 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=0 --compaction_readahead_size=0 --compaction_style=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --fill_cache=0 --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=6 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=3 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=0 --recycle_log_file_num=0 --reopen=2 --report_bg_io_stats=1 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=0 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=100 Verification failed for column family 0 key 000000000000B9D1000000000000012B000000000000017D (4756691): value_from_db: , value_from_expected: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, msg: Iterator verification: Value not found: NotFound: Verification failed :( ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12556 Test Plan: - New UT - Same stress test command failed before this fix but pass after - CI Reviewed By: ajkr Differential Revision: D56267964 Pulled By: hx235 fbshipit-source-id: af1b7e8769c129f64ba1c7f1ff17102f1239b929	2024-05-20 17:33:43 -07:00
Levi Tamasi	ef1d4955ba	Fix the output of `ldb dump_wal` for PutEntity records (#12677 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12677 The patch contains two fixes related to printing `PutEntity` records with `ldb dump_wal`: 1) It adds the key to the printout (it was missing earlier). 2) It restores the formatting flags of the output stream after dumping the wide-column structure so that any `hex` flag that might have been set does not affect subsequent printing of e.g. sequence numbers. Reviewed By: jaykorean, jowlyzhang Differential Revision: D57591295 fbshipit-source-id: af4e3e219f0082ad39bbdfd26f8c5a57ebb898be	2024-05-20 17:04:14 -07:00
anand76	0ed93552f4	Implement obsolete file deletion (GC) in follower (#12657 ) Summary: This PR implements deletion of obsolete files in a follower RocksDB instance. The follower tails the leader's MANIFEST and creates links to newly added SST files. These links need to be deleted once those files become obsolete in order to reclaim space. There are three cases to be considered - 1. New files added and links created, but the Version could not be installed due to some missing files. Those links need to be preserved so a subsequent catch up attempt can succeed. We insert the next file number in the `VersionSet` to `pending_outputs_` to prevent their deletion. 2. Files deleted from the previous successfully installed `Version`. These are deleted as usual in `PurgeObsoleteFiles`. 3. New files added by a `VersionEdit` and deleted by a subsequent `VersionEdit`, both processed in the same catchup attempt. Links will be created for the new files when verifying a candidate `Version`. Those need to be deleted explicitly as they're never added to `VersionStorageInfo`, and thus not deleted by `PurgeObsoleteFiles`. Test plan - New unit tests in `db_follower_test`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12657 Reviewed By: jowlyzhang Differential Revision: D57462697 Pulled By: anand1976 fbshipit-source-id: 898f15570638dd4930f839ffd31c560f9cb73916	2024-05-17 19:13:33 -07:00
Changyu Bi	ffd7930312	Add more debug print to `DBTestWithParam.ThreadStatusSingleCompaction` (#12661 ) Summary: This test is flaky and a recent failure prints the following: ``` [ RUN ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0 thread id: 1842811, thread status: thread id: 1842803, thread status: db/db_test.cc:4697: Failure Expected equality of these values: op_count Which is: 0 expected_count Which is: 1 [ FAILED ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (307 ms) ``` Empty thread status implies that operation_type of the threads are all OP_UNKNOWN. From `3ed46e0668/monitoring/thread_status_updater.cc (L197)`, this can be due to thread_data->operation_type being OP_UNKNOWN or that thread_data->cf_key it not in `cf_info_map_`, potentially due to how cf_key_ is accessed with relaxed memory order. This PR adds some debug print to print the cf_name to check this. This PR also prints num_running_compaction and lsm state to check if a compaction is indeed running, and removes some not needed options and ensures that exactly 4 L0 files are created. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12661 Test Plan: - Cannot repro the failure locally: `gtest-parallel --repeat=10000 --workers=200 ./db_test --gtest_filter="ThreadStatusSingleCompaction"` - New failure message will look like: ``` [ RUN ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0 op_count: 1, expected_count 2 thread id: 6104100864, thread status: , cf_name thread id: 6103527424, thread status: Compaction, cf_name default running compaction: 1 lsm state: 4 db/db_test.cc:4885: Failure Value of: match Actual: false Expected: true [ FAILED ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (115 ms) ``` Reviewed By: hx235 Differential Revision: D57422755 Pulled By: cbi42 fbshipit-source-id: 635663f26052b20e485dfa06a7c0f1f318ac1099	2024-05-16 17:23:56 -07:00
Peter Dillinger	131c8ccfcd	Add EntryType for TimedPut (#12669 ) Summary: Represent internal kTypeValuePreferredSeqno in the public API as kEntryTimedPut (because it is created by TimedPut, until the entry can be safely converted to a regular value entry in compaction) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12669 Test Plan: for follow-up work actually using it. But putting this in place in the public API gives us more flexibility in rolling out that follow-up work (e.g. as a user extension or patch if needed). Reviewed By: jowlyzhang Differential Revision: D57459637 Pulled By: pdillinger fbshipit-source-id: 160ccf7c4e524ee479558846b2a207d51b8b3d9c	2024-05-16 15:18:12 -07:00
Andrew Kryczka	4eaf628120	Add `Iterator` property "rocksdb.iterator.is-value-pinned" (#12659 ) Summary: `ReadOptions::pin_data` already has the effect of pinning the `Slice` returned by `Iterator::value()` when the value is stored inline (e.g., `kTypeValue`). This PR adds a bit of visibility into that via a new `Iterator` property, "rocksdb.iterator.is-value-pinned", as well as some documentation and tests. See also: https://github.com/facebook/rocksdb/issues/12658 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12659 Reviewed By: cbi42 Differential Revision: D57391200 Pulled By: ajkr fbshipit-source-id: 0caa8db27ca1aba86ee2addc3dfd6f0e003d32e2	2024-05-15 19:11:52 -07:00
Peter Dillinger	3ed46e0668	Handle early exit in DBErrorHandlingFSTests (#12655 ) Summary: To avoid use-after-free on custom env on ASSERT_WHATEVER failure. This is motivated by a rare crash seen in DBErrorHandlingFSTest.WALWriteError (VersionSet::GetObsoleteFiles in a SstFileManagerImpl::ClearError thread) and wanting to rule out this being related to that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12655 Test Plan: manually seeing ASSERT_WHATEVER failures, especially under ASAN Reviewed By: cbi42 Differential Revision: D57358202 Pulled By: pdillinger fbshipit-source-id: 4da2a0d73a54380b257e5cc1ab6c666e26b83973	2024-05-14 16:44:32 -07:00
Andrii Lysenko	b9fc13db69	Add padding before timestamp size record if it doesn't fit into a WAL block. (#12614 ) Summary: If timestamp size record doesn't fit into a block, without padding `Writer::EmitPhysicalRecord` fails on assert (either `assert(block_offset_ + kHeaderSize + n <= kBlockSize);` or `assert(block_offset_ + kRecyclableHeaderSize + n <= kBlockSize)`, depending on whether recycling log files is enabled) in debug build. In release, current block grows beyond 32K, `block_offset_` gets reset on next `AddRecord` and all the subsequent blocks are no longer aligned by block size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12614 Reviewed By: ltamasi Differential Revision: D57302140 Pulled By: jowlyzhang fbshipit-source-id: cacb5cefb7586885e52a8137ae23a95e1aefca2d	2024-05-14 15:54:02 -07:00
Yu Zhang	1567d50a27	Temporarily disable file ingestion if LockWAL is tested (#12642 ) Summary: As titled. A proper fix should probably be failing file ingestion if the DB is in a lock wal state as it promises to "Freezes the logical state of the DB". Pull Request resolved: https://github.com/facebook/rocksdb/pull/12642 Reviewed By: pdillinger Differential Revision: D57235869 Pulled By: jowlyzhang fbshipit-source-id: c70031463842220f865621eb6f53424df27d29e9	2024-05-14 09:27:48 -07:00
Yu Zhang	c110091d36	Support read timestamp in ldb (#12641 ) Summary: As titled. Also updated sst_dump to print out user-defined timestamp separately and in human readable format. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12641 Test Plan: manually tested: Example success run: ./ldb --db=$TEST_DB multi_get_entity 0x00000000000000950000000000000120 --key_hex --column_family=7 --read_timestamp=1115613683797942 --value_hex 0x00000000000000950000000000000120 ==> :0x0E0000000A0B080906070405020300011E1F1C1D1A1B181916171415121310112E2F2C2D2A2B282926272425222320213E3F3C3D3A3B383936373435323330314E4F4C4D4A4B484946474445424340415E5F5C5D5A5B58595657545552535051 Example failed run: Failed: GetEntity failed: Invalid argument: column family enables user-defined timestamp while --read_timestamp is not a valid uint64 value. sst_dump print out: '000000000000015D000000000000012B000000000000013B\|timestamp:1113554493256671' seq:2330405, type:1 => 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E Reviewed By: ltamasi Differential Revision: D57297006 Pulled By: jowlyzhang fbshipit-source-id: 8486d91468e4f6c0d42dca3c9629f1c45a92bf5a	2024-05-13 15:43:12 -07:00
Hui Xiao	20213d01a3	Fix crash in CompactFiles() of conflict range under `preclude_last_level_data_seconds > 0` (#12628 ) Summary: Context/Summary: Previously `CompactFiles()` used `RangeOverlapWithCompaction()` to check for conflict when sanitizing input files while later used `FilesRangeOverlapWithCompaction()` to assert for no conflict. The latter function checks for more conflict scenarios than the former does, particularly the ones arising from `preclude_last_level_data_seconds > 0` (i.e, compaction can output to second-to-the-last level). So we ran into assertion violation in `CompactFiles()` like below ``` Assertion `output_level == 0 \|\| !FilesRangeOverlapWithCompaction( input_files, output_level, Compaction::EvaluatePenultimateLevel(vstorage, ioptions_, start_level, output_level))' failed. ``` This PR make `CompactFiles()` used `FilesRangeOverlapWithCompaction()` and return Aborted status upon range conflict instead of crashing (during debug build) or proceed incorrectly (during non-debug build). To do so cleanly, I included a refactoring to make `FilesRangeOverlapWithCompaction()` part of `SanitizeAndConvertCompactionInputFiles()`, replacing `RangeOverlapWithCompaction()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12628 Test Plan: New UT crashed before the fix and return correct status after the fix. Reviewed By: cbi42 Differential Revision: D57123536 Pulled By: hx235 fbshipit-source-id: f963a2c9e7ba1a9927a67fcc87f0dce126d3a430	2024-05-13 13:12:06 -07:00
Peter Dillinger	b75438f986	Allow disableWAL+recycle with WritePreparedTxnDB internals (#12639 ) Summary: Follow-up from https://github.com/facebook/rocksdb/issues/12403 The crash test was periodically failing with the "disableWAL option is not supported if recycle_log_file_num > 0" failure, despite not setting the disableWAL from the user side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12639 Test Plan: db_stress reproducer now passes. Added WAL recycling to txn DB unit tests, which is generally more difficult for correctness. Many tests now cover this change and pass. Reviewed By: anand1976 Differential Revision: D57227617 Pulled By: pdillinger fbshipit-source-id: db9abefeb505bce624b45bc64009694d2a5baed9	2024-05-10 17:56:40 -07:00
Levi Tamasi	97e70906fa	Improve the sanity checks in (Multi)GetEntity and friends (#12630 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12630 The patch cleans up, improves, and brings into sync (to the extent possible without API signature changes) the sanity checks around the `GetEntity` / `MultiGetEntity` family of APIs, including the read-your-own-writes (`WriteBatchWithIndex`) and transaction layers. The checks are centralized in two main sets of entry points, namely in `DB(Impl)` and the "main" `GetEntityFromBatchAndDB` / `MultiGetEntityFromBatchAndDB` overloads in `WriteBatchWithIndex`. This eliminates the need to duplicate the checks in the transaction classes. Reviewed By: jaykorean Differential Revision: D57125741 fbshipit-source-id: 4dd059ef644a9b173fbba767538943397e4cc6cd	2024-05-09 12:25:19 -07:00
Wei Liu	1a3357648f	Error log update to db_impl_compaction_flush.cc (#12608 ) Summary: Make the error looks better. Some inconsistency here and [here](`e46ab9d4f0/db/db_impl/db_impl_compaction_flush.cc (L2701-L2702)`) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12608 Reviewed By: ajkr Differential Revision: D57134933 Pulled By: cbi42 fbshipit-source-id: 2f19f077f388d196652a4e3afd2526f18bf75b2d	2024-05-09 11:36:24 -07:00
Yu Zhang	9dc171e3bb	Fix issue that cause false alarm corruption report (#12626 ) Summary: The state of `saved_seq_for_penul_check_` is not correctly maintained with the current flow. It's supposed to store the original sequence number for a `kTypeValuePreferredSeqno` entry for use in the `DecideOutputLevel` function. However, it's not always properly cleared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12626 Test Plan: Added unit test that would fail before the fix ./tiered_compaction_test --gtest_filter="InterleavedTimedPutAndPut" Reviewed By: pdillinger Differential Revision: D57123469 Pulled By: jowlyzhang fbshipit-source-id: 8d73214b3b6dc152daf19b6bd6ee9063581dc277	2024-05-08 15:51:38 -07:00

1 2 3 4 5 ...

5656 Commits