rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-26 16:30:56 +00:00

Author	SHA1	Message	Date
Levi Tamasi	0d1e0722ef	Fix in-place updates for value types other than kTypeValue (#10254 ) Summary: The patch fixes a couple of issues related to in-place updates: 1) the value type was not passed from `MemTableInserter::PutCFImpl` to `MemTable::Update` and 2) `MemTable::UpdateCallback` was called for any value type (with the callee's logic assuming `kTypeValue`) even though the callback mechanism is only safe for plain values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10254 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D37463644 Pulled By: ltamasi fbshipit-source-id: 33802477dac0691681f416ae84c4d9742c6fe41a	2022-06-27 16:37:09 -07:00
Yanqin Jin	d3de59255a	Enable compaction filter for db_stress with user-defined timestamp (#10259 ) Summary: Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter. This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of the presence of timestamps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259 Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts Reviewed By: ajkr Differential Revision: D37459692 Pulled By: riversand963 fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef	2022-06-27 11:53:09 -07:00
Levi Tamasi	c73d2a9d18	Add API for writing wide-column entities (#10242 ) Summary: The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds a new API called `PutEntity` that can be used to write a wide-column entity to the database. The new API is added to both `DB` and `WriteBatch`. Note that currently there is no way to retrieve these entities; more precisely, all read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they encounter a wide-column entity that is required to answer a query. Read-side support (as well as other missing functionality like `Merge`, compaction filter, and timestamp support) will be added in later PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D37369748 Pulled By: ltamasi fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d	2022-06-25 15:30:47 -07:00
Andrew Kryczka	f322f273b0	Temporarily disable mempurge in crash test (#10252 ) Summary: Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252 Reviewed By: riversand963 Differential Revision: D37432948 Pulled By: ajkr fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a	2022-06-24 17:11:27 -07:00
Bo Wang	8e63d90ff8	Pass rate_limiter_priority through filter block reader functions to FS (#10251 ) Summary: With https://github.com/facebook/rocksdb/pull/9996 , we can pass the rate_limiter_priority to FS for most cases. This PR is to update the code path for filter block reader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10251 Test Plan: Current unit tests should pass. Reviewed By: pdillinger Differential Revision: D37427667 Pulled By: gitbw95 fbshipit-source-id: 1ce5b759b136efe4cfa48a6b97e2f837ff087433	2022-06-24 16:13:44 -07:00
zczhu	410ca2efd2	Fix the flaky cursor persist test (#10250 ) Summary: The 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test` may occasionally fail due to the inconsistent LSM state. The issue is fixed by adding `Flush()` and `WaitForFlushMemTable()` to produce a more predictable and stable LSM state. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10250 Test Plan: 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test` Reviewed By: jay-zhuang, riversand963 Differential Revision: D37426091 Pulled By: littlepig2013 fbshipit-source-id: 56fbaab0384c380c1f279a16dc8732b139c9f611	2022-06-24 14:02:33 -07:00
sdong	246d469750	Reduce overhead of SortFileByOverlappingRatio() (#10161 ) Summary: Currently SortFileByOverlappingRatio() is O(nlogn). It is usually OK but When there are a lot of files in an LSM-tree, SortFileByOverlappingRatio() can take non-trivial amount of time. The problem is severe when the user is loading keys in sorted order, where compaction is only trivial move and this operation becomes the bottleneck and limit the total throughput. This commit makes SortFileByOverlappingRatio() only find the top 50 files based on score. 50 files are usually enough for the parallel compactions needed for the level, and in case it is not enough, we would fall back to random, which should be acceptable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10161 Test Plan: Run a fillseq that generates a lot of files, and observe throughput improved (although stall is not yet eliminated). The command ran: TEST_TMPDIR=/dev/shm/ ./db_bench_sort --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 The throughput improved by 11%. Reviewed By: jay-zhuang Differential Revision: D37129469 fbshipit-source-id: 492da2ef5bfc7cdd6daa3986b50d2ff91f88542d	2022-06-24 14:01:11 -07:00
Gang Liao	052666aed5	BlobDB in crash test hitting assertion (#10249 ) Summary: This task is to fix assertion failures during the crash test runs. The cache entry size might not match value size because value size can include the on-disk (possibly compressed) size. Therefore, we removed the assertions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10249 Reviewed By: ltamasi Differential Revision: D37407576 Pulled By: gangliao fbshipit-source-id: 577559f267c5b2437bcd0631cd0efabb6dde3b69	2022-06-23 22:02:16 -07:00
Yanqin Jin	725df120e9	Fix race condition between file purge and backup/checkpoint (#10187 ) Summary: Resolves https://github.com/facebook/rocksdb/issues/10129 I extracted this fix from https://github.com/facebook/rocksdb/issues/7516 since it's also already a bug in main branch, and we want to separate it from the main part of the PR. There can be a race condition between two threads. Thread 1 executes `DBImpl::FindObsoleteFiles()` while thread 2 executes `GetSortedWals()`. ``` Time thread 1 thread 2 \| mutex_.lock \| read disable_delete_obsolete_files_ \| ... \| wait on log_sync_cv and release mutex_ \| mutex_.lock \| ++disable_delete_obsolete_files_ \| mutex_.unlock \| mutex_.lock \| while (pending_purge_obsolete_files > 0) { bg_cv.wait;} \| wake up with mutex_ locked \| compute WALs tracked by MANIFEST \| mutex_.unlock \| wake up with mutex_ locked \| ++pending_purge_obsolete_files_ \| mutex_.unlock \| \| delete obsolete WAL \| WAL missing but tracked in MANIFEST. V ``` The fix proposed eliminates the possibility of the above by increasing `pending_purge_obsolete_files_` before `FindObsoleteFiles()` can possibly release the mutex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10187 Test Plan: make check Reviewed By: ltamasi Differential Revision: D37214235 Pulled By: riversand963 fbshipit-source-id: 556ab1b58ae6d19150169dfac4db08195c797184	2022-06-23 18:32:25 -07:00
Mark Callaghan	6061905790	Wrapper for benchmark.sh to run a sequence of db_bench tests (#10215 ) Summary: This provides two things: 1) Runs a sequence of db_bench tests. This sequence was chosen to provide good coverage with less variance. 2) Makes it easier to do A/B testing for multiple binaries. This combines the report.tsv files into summary.tsv to make it easier to compare results across multiple binaries. Example output for 2) is: ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id 1115171 446.7 9GB 8.9 1.0 454.7 26 26 0 0 0.9 0.5 2 7 51 5547 20 0.0 0 0.1 0.1 0.2 fillseq.wal_disabled.v400 2022-04-12T08:53:51 6.0 1045726 418.9 8GB 0.0GB 8.4 1.0 432.4 27 26 0 0 1.0 0.5 2 6 102 5618 20 0.0 0 0.1 0.0 0.1 fillseq.wal_disabled.v400 2022-04-12T12:25:36 6.28 ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id 2969192 1189.3 16GB 0.0 0.0 0 0 0 0 10.8 9.3 25 33 49 13551 1781 0.0 0 48.2 6.8 16.8 readrandom.t32 2022-04-12T08:54:28 6.0 2692922 1078.6 16GB 0.0GB 0.0 0.0 0 0 0 0 11.9 10.2 30 38 56 49735 1781 0.0 0 47.8 6.7 16.8 readrandom.t32 2022-04-12T12:26:15 6.28 ... ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id 180227 72.2 38GB 1126.4 8.7 643.2 3286 3218 0 0 177.6 50.2 2687 4083 6148 854083 1793 68.4 7804 17.0 5.9 0.5 overwrite.t32.s0 2022-04-12T11:55:21 6.0 236512 94.7 31GB 0.0GB 1502.9 8.9 862.2 5242 5125 0 0 135.3 59.9 2537 3268 5404 18545 1785 49.7 5112 25.5 8.0 9.4 overwrite.t32.s0 2022-04-12T15:27:25 6.28 Example output with formatting preserved is here: https://gist.github.com/mdcallag/4432e5bbaf91915c916d46bd6ce3c313 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10215 Test Plan: run it Reviewed By: jay-zhuang Differential Revision: D37299892 Pulled By: mdcallag fbshipit-source-id: e6e0ed638fd7e8deeb869d700593fdc3eba899c8	2022-06-23 18:07:14 -07:00
Yueh-Hsuan Chiang	2a3792edfc	Add suggest_compact_range() and suggest_compact_range_cf() to C API. (#10175 ) Summary: Add suggest_compact_range() and suggest_compact_range_cf() to C API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10175 Test Plan: As verifying the result requires SyncPoint, which is not available in the c_test.c, the test is currently done by invoking the functions and making sure it does not crash. Reviewed By: jay-zhuang Differential Revision: D37305191 Pulled By: ajkr fbshipit-source-id: 0fe257b45914f6c9aeb985d8b1820dafc57a20db	2022-06-23 16:25:25 -07:00
zczhu	17a1d65e3a	Cut output files at compaction cursors (#10227 ) Summary: The files behind the compaction cursor contain newer data than the files ahead of it. If a compaction writes a file that spans from before its output level’s cursor to after it, then data before the cursor will be contaminated with the old timestamp from the data after the cursor. To avoid this, we can split the output file into two – one entirely before the cursor and one entirely after the cursor. Note that, in rare cases, we DO NOT need to cut the file if it is a trivial move since the file will not be contaminated by older files. In such case, the compact cursor is not guaranteed to be the boundary of the file, but it does not hurt the round-robin selection process. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10227 Test Plan: Add 'RoundRobinCutOutputAtCompactCursor' unit test in `db_compaction_test` Task: [T122216351](https://www.internalfb.com/intern/tasks/?t=122216351) Reviewed By: jay-zhuang Differential Revision: D37388088 Pulled By: littlepig2013 fbshipit-source-id: 9246a6a084b6037b90d6ab3183ba4dfb75a3378d	2022-06-23 14:25:42 -07:00
Gang Liao	ba1f62ddfb	Read from blob cache first when MultiGetBlob() (#10225 ) Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10225 Test Plan: Add test cases for MultiGetBlob In this task, we added the new API MultiGetBlob() for BlobSource. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Reviewed By: ltamasi Differential Revision: D37358364 Pulled By: gangliao fbshipit-source-id: aff053a37615d96d768fb9aedde17da5618c7ae6	2022-06-23 13:52:00 -07:00
Guido Tagliavini Ponce	b52620ab0e	Fix key size in cache_bench (#10234 ) Summary: cache_bench wasn't generating 16B keys, which are necessary for FastLRUCache. Also: - Added asserts in cache_bench, which is assuming that inserts never fail. When they fail (for example, if we used keys of the wrong size), memory allocated to the values will becomes leaked, and eventually the program crashes. - Move kCacheKeySize to the right spot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10234 Test Plan: ``make -j24 check``. Also, run cache_bench with FastLRUCache and check that memory usage doesn't blow up: ``./cache_bench -cache_type=fast_lru_cache -num_shard_bits=6 -skewed=true \ -lookup_insert_percent=100 -lookup_percent=0 -insert_percent=0 -erase_percent=0 \ -populate_cache=true -cache_size=1073741824 -ops_per_thread=10000000 \ -value_bytes=8192 -resident_ratio=1 -threads=16`` Reviewed By: pdillinger Differential Revision: D37382949 Pulled By: guidotag fbshipit-source-id: b697a942ebb215de5d341f98dc8566763436ba9b	2022-06-23 11:26:50 -07:00
Peter Dillinger	f81ea75df7	Don't count no prefix as Bloom hit (#10244 ) Summary: When a key is "out of domain" for the prefix_extractor (no prefix assigned) then the Bloom filter is not queried. PerfContext was counting this as a Bloom "hit" while Statistics doesn't count this as a prefix Bloom checked. I think it's more accurate to call it neither hit nor miss, so changing the counting to make it PerfContext coounting more like Statistics. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10244 Test Plan: tests updates and expanded (Get and MultiGet). Iterator test coverage of the change will come in next PR Reviewed By: bjlemaire Differential Revision: D37371297 Pulled By: pdillinger fbshipit-source-id: fed132fba6a92b2314ab898d449fce2d1586c157	2022-06-23 11:00:27 -07:00
Baptiste Lemaire	5879053fd0	Dynamically changeable `MemPurge` option (#10011 ) Summary: Summary Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled. Motivation RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible. Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement. Content of this PR This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes. Benchmarking I will add numbers to prove that there is no performance impact within the next 12 hours. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011 Reviewed By: pdillinger Differential Revision: D36462357 Pulled By: bjlemaire fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802	2022-06-23 09:42:18 -07:00
Gang Liao	2352e2dfda	Add the blob cache to the stress tests and the benchmarking tool (#10202 ) Summary: In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`. As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202 Reviewed By: ltamasi Differential Revision: D37325739 Pulled By: gangliao fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8	2022-06-22 16:04:03 -07:00
Bo Wang	c073ed7601	Fix typo in comments and code (#10233 ) Summary: Fix typo in comments and code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10233 Test Plan: Existing unit tests should pass. Reviewed By: jay-zhuang, anand1976 Differential Revision: D37356702 Pulled By: gitbw95 fbshipit-source-id: 32c019adcc6dcc95a9882b38147a310091368e51	2022-06-22 15:45:21 -07:00
Yueh-Hsuan Chiang	e103b87296	Add get_column_family_metadata() and related functions to C API (#10207 ) Summary: * Add metadata related structs and functions in C API, including - `rocksdb_get_column_family_metadata()` and `rocksdb_get_column_family_metadata_cf()` that returns `rocksdb_column_family_metadata_t`. - `rocksdb_column_family_metadata_t` and its get functions & destroy function. - `rocksdb_level_metadata_t` and its and its get functions & destroy function. - `rocksdb_file_metadata_t` and its and get functions & destroy functions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10207 Test Plan: Extend the existing c_test.c to include additional checks for column_family_metadata inside CheckCompaction. Reviewed By: riversand963 Differential Revision: D37305209 Pulled By: ajkr fbshipit-source-id: 0a5183206353acde145f5f9b632c3bace670aa6e	2022-06-22 15:00:28 -07:00
Alan Paxton	a16e2ff82a	Adapt benchmark result script to new fields. (#10120 ) Summary: Recently merged CI benchmark scripts were failing. There has clearly been a major revision of the fields of benchmark output. The upload script expects and sanity-checks the existence of some fields (changes date to conform to OpenSearch format)..., so the script needs to change. Also add a bit more exception checking to make it more obvious when this happens again. We have deleted the existing report.tsv from the benchmark machine. An existing report.tsv is appended to by default, so that if the fields change, later rows no longer match the header. This makes for an upload that dies half way through the report file, when the format no longer matches the header. Re-instate the config.yml for running the benchmarks, so we can once again test it in situ. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10120 Reviewed By: pdillinger Differential Revision: D37314908 Pulled By: jay-zhuang fbshipit-source-id: 34f5243fee694b75c6838eb55d3398e4273254b2	2022-06-22 09:26:13 -07:00
Yanqin Jin	36fefd7e22	Continue to deflake BackupEngineTest.Concurrency (#10228 ) Summary: Even after https://github.com/facebook/rocksdb/issues/10069, `BackupEngineTest.Concurrency` is still flaky with decreased probability of failure. Repro steps as follows ```bash make backup_engine_test gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency ``` The first two commits of this PR demonstrate how the test is flaky. https://github.com/facebook/rocksdb/issues/10069 handles the case in which `Rename()` file returns `IOError` with subcode `PathNotFound`, and `CreateLoggerFromOptions()` allows the operation to succeed, as expected by the test. However, `BackupEngineTest` uses `RemapFileSystem` on top of `ChrootFileSystem` which can return `NotFound` instead of `IOError`. This behavior is different from `Env::Default()` which returns PathNotFound if the src of `rename()` does not exist. We should make the behaviors of the test Env/FS match a real Env/FS. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10228 Test Plan: ```bash make check gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency ``` Reviewed By: pdillinger Differential Revision: D37337241 Pulled By: riversand963 fbshipit-source-id: 07a53115e424467b55a731866e571f0ad4c6635d	2022-06-22 08:50:05 -07:00
Yanqin Jin	9586dcf1ce	Expose the initial logger creation error (#10223 ) Summary: https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`, `DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot directly expose the error to caller without some additional work. This is a first version proposal which: - Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation - Checks the error during `DB::Open()` and return it to caller if non-ok This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`. Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything in case other callers of `SanitizeOptions()` assumes that a logger should be created. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223 Test Plan: make check Reviewed By: pdillinger Differential Revision: D37321717 Pulled By: riversand963 fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b	2022-06-22 08:26:38 -07:00
Yanqin Jin	42c631b339	Update API comment about Options::best_efforts_recovery (#10180 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10180 Reviewed By: pdillinger Differential Revision: D37182037 Pulled By: riversand963 fbshipit-source-id: a8dc865b86e2249beb7a543c317e94a14781e910	2022-06-21 23:34:39 -07:00
Peter Dillinger	84210c9489	Add data block hash index to crash test, fix MultiGet issue (#10220 ) Summary: There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data block hash index, which was not caught because data block hash index was never added to stress tests. This change fixes both issues. Fixes https://github.com/facebook/rocksdb/issues/10186 I intend to pick this into the 7.4.0 release candidate Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220 Test Plan: Failure quickly reproduces in crash test with kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing the failure with a unit test I believe would be too tricky and fragile to be worthwhile. Reviewed By: anand1976 Differential Revision: D37315647 Pulled By: pdillinger fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba	2022-06-21 16:23:58 -07:00
Yanqin Jin	d654888b8f	Refactor wal filter processing during recovery (#10214 ) Summary: So that DBImpl::RecoverLogFiles do not have to deal with implementation details of WalFilter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10214 Test Plan: make check Reviewed By: ajkr Differential Revision: D37299122 Pulled By: riversand963 fbshipit-source-id: acf1a80f1ef75da393d375f55968b2f3ac189816	2022-06-21 14:51:56 -07:00
Bo Wang	f7605ec655	Update LZ4 library for platform009 (#10224 ) Summary: Update LZ4 library for platform009. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10224 Test Plan: Current unit tests should pass. Reviewed By: anand1976 Differential Revision: D37321801 Pulled By: gitbw95 fbshipit-source-id: 8a3d3019d9f7478ac737176f2d2f443c0159829e	2022-06-21 13:22:58 -07:00
zczhu	30141461f9	Add basic kRoundRobin compaction policy (#10107 ) Summary: Add `kRoundRobin` as a compaction priority. The implementation is as follows. - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`). - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function. - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107 Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test` Reviewed By: ajkr Differential Revision: D37316037 Pulled By: littlepig2013 fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee	2022-06-21 11:56:53 -07:00
Yanqin Jin	b012d23557	Destroy iniital db dir for a test in DBWALTest (#10221 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10221 Reviewed By: hx235 Differential Revision: D37316280 Pulled By: riversand963 fbshipit-source-id: 062781acec2f36beebc62003bcc8ec280488d572	2022-06-21 11:27:10 -07:00
Guido Tagliavini Ponce	3afed7408c	Replace per-shard chained hash tables with open-addressing scheme (#10194 ) Summary: In FastLRUCache, we replace the current chained per-shard hash table by an open-addressing hash table. In particular, this allows us to preallocate all handles. Because all handles are preallocated, this implementation doesn't support strict_capacity_limit = false (i.e., allowing insertions beyond the predefined capacity). This clashes with current assumptions of some tests, namely two tests in cache_test and the crash tests. We have disabled these for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10194 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37296770 Pulled By: guidotag fbshipit-source-id: 232ff1b8260331d868ebf4e3e5d8ad709390b0ad	2022-06-21 08:45:04 -07:00
Gang Liao	deff48bcef	Add blob source to retrieve blobs in RocksDB (#10198 ) Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost. Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel! This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198 Reviewed By: ltamasi Differential Revision: D37294735 Pulled By: gangliao fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e	2022-06-20 20:58:11 -07:00
sdong	4207872fc3	Reduce a duplicate consistency check when applying a new version (#10169 ) Summary: One consistency check in SaveTo() is dupilcated with the one within Apply(). Remove one of then in release mode to reduce time spent in DB mutex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10169 Test Plan: Run existing tests and see nothing breaks. Reviewed By: ltamasi Differential Revision: D37157821 fbshipit-source-id: 73b89443a20b43362ff66d10b9212022034a8234	2022-06-20 19:15:59 -07:00
Levi Tamasi	8f59c41cc7	Add new value value type for wide-column entities (#10211 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10211 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D37294067 Pulled By: ltamasi fbshipit-source-id: 3b26f1964746ba4e3654579cb07cd975a29c7319	2022-06-20 18:04:08 -07:00
Peter Dillinger	501543573a	Fix bad include (#10213 ) Summary: include "include/rocksdb/blah.h" is messing up some internal builds vs. include "rocksdb/blah." This fixes the bad case and adds a check for future instances. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10213 Test Plan: back-port to 7.4 release candidate and watch internal build Reviewed By: hx235 Differential Revision: D37296202 Pulled By: pdillinger fbshipit-source-id: d7cc6b2c57d858dff0444f19320d83c8b4f9b185	2022-06-20 17:42:01 -07:00
Peter Dillinger	ccb4f047ae	Add 7.4 to format compatibility test (#10209 ) Summary: Forgotten in https://github.com/facebook/rocksdb/issues/10204 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10209 Test Plan: local run with SHORT_TEST=1 Reviewed By: hx235 Differential Revision: D37284028 Pulled By: pdillinger fbshipit-source-id: 631c1969906d002acc930662dcd5eefc0c758429	2022-06-20 13:13:37 -07:00
Peter Dillinger	6358e1b967	Start release 7.5 development (#10204 ) Summary: Update HISTORY.md and version.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/10204 Test Plan: version bump only Reviewed By: ajkr Differential Revision: D37271866 Pulled By: pdillinger fbshipit-source-id: 0ccaa2af36648a5b6017c172a7826a244e1aec93	2022-06-20 07:12:39 -07:00
Peter Dillinger	fac7a23685	Update HISTORY for 7.4.0 release freeze (#10196 ) Summary: Planned for Sunday 6/19 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10196 Test Plan: no code Reviewed By: akankshamahajan15 Differential Revision: D37244857 Pulled By: pdillinger fbshipit-source-id: afbf4aa201983b3c01c16b5f55c68f2325d17421	2022-06-19 16:31:16 -07:00
Changyu Bi	0e0a19832e	Fix a bug in WriteBatchInternal::Append when write batch KV protection is turned on (#10201 ) Summary: This bug was discovered after write batch checksum verification before WAL is added (https://github.com/facebook/rocksdb/issues/10114) and stress test with write batch checksum protection is turned on (https://github.com/facebook/rocksdb/issues/10037). In this [line](`d5d8920f2c/db/write_batch.cc (L2887)`), the number of checksums may not be consistent with `batch->Count()`. This PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10201 Test Plan: ``` ./db_stress --batch_protection_bytes_per_key=8 --destroy_db_initially=1 --max_key=100000 --use_txn=1 ``` Reviewed By: ajkr Differential Revision: D37260799 Pulled By: cbi42 fbshipit-source-id: ff8dce7dcce295d689333bc9d892d17a843bf0ea	2022-06-18 15:12:17 -07:00
Andrew Kryczka	d5d8920f2c	Fix race condition with WAL tracking and `FlushWAL(true /* sync /)` (#10185 ) Summary: `FlushWAL(true / sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced. The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`. An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB. ``` Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185 Test Plan: - repro unit test - the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list" Reviewed By: riversand963 Differential Revision: D37200993 Pulled By: ajkr fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f	2022-06-17 16:45:28 -07:00
Hui Xiao	a5d773e077	Add rate-limiting support to batched MultiGet() (#10159 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not. However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159 Test Plan: - Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet` - Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds. - Stress test: - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work Reviewed By: ajkr, anand1976 Differential Revision: D37135172 Pulled By: hx235 fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd	2022-06-17 16:40:47 -07:00
Gang Liao	c965c9ef65	Read blob from blob cache if exists when GetBlob() (#10178 ) Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. In this task, we added a new abstraction layer `BlobSource` to retrieve blobs from either blob cache or raw blob file. Note: For simplicity, the current PR only includes `GetBlob()`. `MultiGetBlob()` will be included in the next PR. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10178 Reviewed By: ltamasi Differential Revision: D37250507 Pulled By: gangliao fbshipit-source-id: 3fc4a55a0cea955a3147bdc7dba06430e377259b	2022-06-17 15:22:59 -07:00
Peter Dillinger	1aac814578	Use optimized folly DistributedMutex in LRUCache when available (#10179 ) Summary: folly DistributedMutex is faster than standard mutexes though imposes some static obligations on usage. See https://github.com/facebook/folly/blob/main/folly/synchronization/DistributedMutex.h for details. Here we use this alternative for our Cache implementations (especially LRUCache) for better locking performance, when RocksDB is compiled with folly. Also added information about which distributed mutex implementation is being used to cache_bench output and to DB LOG. Intended follow-up: * Use DMutex in more places, perhaps improving API to support non-scoped locking * Fix linking with fbcode compiler (needs ROCKSDB_NO_FBCODE=1 currently) Credit: Thanks Siying for reminding me about this line of work that was previously left unfinished. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10179 Test Plan: for correctness, existing tests. CircleCI config updated. Also Meta-internal buck build updated. For performance, ran simultaneous before & after cache_bench. Out of three comparison runs, the middle improvement to ops/sec was +21%: Baseline: USE_CLANG=1 DEBUG_LEVEL=0 make -j24 cache_bench (fbcode compiler) ``` Complete in 20.201 s; Rough parallel ops/sec = 1584062 Thread ops/sec = 107176 Operation latency (ns): Count: 32000000 Average: 9257.9421 StdDev: 122412.04 Min: 134 Median: 3623.0493 Max: 56918500 Percentiles: P50: 3623.05 P75: 10288.02 P99: 30219.35 P99.9: 683522.04 P99.99: 7302791.63 ``` New: (add USE_FOLLY=1) ``` Complete in 16.674 s; Rough parallel ops/sec = 1919135 (+21%) Thread ops/sec = 135487 Operation latency (ns): Count: 32000000 Average: 7304.9294 StdDev: 108530.28 Min: 132 Median: 3777.6012 Max: 91030902 Percentiles: P50: 3777.60 P75: 10169.89 P99: 24504.51 P99.9: 59721.59 P99.99: 1861151.83 ``` Reviewed By: anand1976 Differential Revision: D37182983 Pulled By: pdillinger fbshipit-source-id: a17eb05f25b832b6a2c1356f5c657e831a5af8d1	2022-06-17 13:08:45 -07:00
Peter Dillinger	f87adcfb3f	Fix overflow in ribbon_bench after #10184 (#10195 ) Summary: Ribbon micro-bench needs updating after re-numbering `BloomLikeFilterPolicy::GetAllFixedImpls()` entries. (CircleCI nightly failure.) Also fixed memory leaks while using ASAN to validate my fix. (I assume the leaks weren't intentional for some performance characteristic.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10195 Test Plan: run with ASAN Reviewed By: jay-zhuang Differential Revision: D37244459 Pulled By: pdillinger fbshipit-source-id: 5a363e10de3c4c9c88099c937e3dc3b4cf24fd30	2022-06-17 12:53:57 -07:00
Andrew Kryczka	5d6005c780	Add WriteOptions::protection_bytes_per_key (#10037 ) Summary: Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`. Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user. There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037 Test Plan: - Manual - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24` - Enabled in MyShadow for 1+ week - Automated - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions` - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection Reviewed By: cbi42 Differential Revision: D36614569 Pulled By: ajkr fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb	2022-06-16 23:10:07 -07:00
Peter Dillinger	f62c1e1e56	Fix a false negative merge conflict (#10192 ) Summary: .. between https://github.com/facebook/rocksdb/issues/10184 and https://github.com/facebook/rocksdb/issues/10122 not detected by source control, leading to non-compiling code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10192 Test Plan: updated test Reviewed By: hx235 Differential Revision: D37231921 Pulled By: pdillinger fbshipit-source-id: fa21488716f4c006b111b8c4127d71c757c935c3	2022-06-16 21:14:10 -07:00
Changyu Bi	8cf86258b8	Update HISTORY.md for #10114 (#10189 ) Summary: Update HISTORY.md for https://github.com/facebook/rocksdb/issues/10114: write batch checksum verification before writing to WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10189 Reviewed By: ajkr Differential Revision: D37226366 Pulled By: cbi42 fbshipit-source-id: cd2f076961abc35f35783e0f2cc3beda68cdb446	2022-06-16 19:59:26 -07:00
Peter Dillinger	fff302d989	More testing w/prefix extractor, small refactor (#10122 ) Summary: There was an interesting code path not covered by testing that is difficult to replicate in a unit test, which is now covered using a sync point. Specifically, the case of table_prefix_extractor == null and !need_upper_bound_check in `BlockBasedTable::PrefixMayMatch`, which can happen if table reader is open before extractor is registered with global object registry, but is later registered and re-set with SetOptions. (We don't have sufficient testing control over object registry to set that up repeatedly.) Also, this function has been renamed to `PrefixRangeMayMatch` for clarity vs. other functions that are not the same. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10122 Test Plan: unit tests expanded Reviewed By: siying Differential Revision: D36944834 Pulled By: pdillinger fbshipit-source-id: 9e52d9da1929a3e42bbc230fcdc3599949de7bdb	2022-06-16 16:41:25 -07:00
Peter Dillinger	126c223714	Remove deprecated block-based filter (#10184 ) Summary: In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using the public API, because of its inefficiency. Although we normally maintain read compatibility on old DBs forever, filters are not required for reading a DB, only for optimizing read performance. Thus, it should be acceptable to remove this code and the substantial maintenance burden it carries as useful features are developed and validated (such as user timestamp). This change completely removes the code for reading and writing the old block-based filters, net removing about 1370 lines of code no longer needed. Options removed from testing / benchmarking tools. The prior existence is only evident in a couple of places: * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in a major release to minimize source code incompatibilities. * A warning is logged when an old table file is opened that used the old block-based filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing should suffice. Unfortunately, sst_dump does not tell you whether a file uses block-based filter, and the structure of the code makes it very difficult to fix. * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`) for metaindex is maintained (for now). Other notes: * In some cases where numbers are associated with filter configurations, we have had to update the assigned numbers so that they all correspond to something that exists. * Fixed potential stat counting bug by assuming `filter_checked = false` for cases like `filter == nullptr` rather than assuming `filter_checked = true` * Removed obsolete `block_offset` and `prefix_extractor` parameters from several functions. * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)` because the caller guarantees the prefix extractor exists and is compatible Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184 Test Plan: tests updated, manually test new warning in LOG using base version to generate a DB Reviewed By: riversand963 Differential Revision: D37212647 Pulled By: pdillinger fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80	2022-06-16 15:51:33 -07:00
anand76	a6691d0f65	Update stats to help users estimate MultiGet async IO impact (#10182 ) Summary: Add a couple of stats to help users estimate the impact of potential MultiGet perf improvements - 1. NUM_LEVEL_READ_PER_MULTIGET - A histogram stat for number of levels that required MultiGet to read from a file 2. MULTIGET_COROUTINE_COUNT - A ticker stat to count the number of times the coroutine version of MultiGetFromSST was used The NUM_DATA_BLOCKS_READ_PER_LEVEL stat is obsoleted as it doesn't provide useful information for MultiGet optimization. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10182 Reviewed By: akankshamahajan15 Differential Revision: D37213296 Pulled By: anand1976 fbshipit-source-id: 5d2b7708017c0e278578ae4bffac3926f6530efb	2022-06-16 12:12:43 -07:00
Yanqin Jin	4d31d3c2ed	Abort in dbg mode after logging (#10183 ) Summary: In CompactionIterator code, there are multiple places where the process will abort in dbg mode before logging the error message describing the cause. This PR changes only the logging behavior for compaction iterator so that error message is written to LOG before the process aborts in debug mode. Also updated the triggering condition for an assertion for single delete with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10183 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D37190218 Pulled By: riversand963 fbshipit-source-id: 741bb007067be7cfbe94ac9e530ad4b2b339c009	2022-06-15 22:00:24 -07:00
Akanksha Mahajan	8353ae8b27	Add few optimizations in async_io for short scans (#10140 ) Summary: This PR adds few optimizations for async_io for shorter scans. 1. If async_io is enabled, seek would create FilePrefetchBuffer object to fetch the data asynchronously. However `FilePrefetchbuffer::num_file_reads_` wasn't taken into consideration if it calls Next after Seek and would go for Prefetching. This PR fixes that and Next will go for prefetching only if `FilePrefetchbuffer::num_file_reads_` is greater than 2 along with if blocks are sequential. This scenario is only for implicit auto readahead. 2. For seek, when it calls TryReadFromCacheAsync to poll it makes async call as well because TryReadFromCacheAsync flow wasn't changed. So I updated to return after poll instead of further prefetching any data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10140 Test Plan: 1. Added a unit test 2. Ran crash_test with async_io = 1 to make sure nothing crashes. Reviewed By: anand1976 Differential Revision: D37042242 Pulled By: akankshamahajan15 fbshipit-source-id: b8e6b7cb2ee0886f37a8f53951948b9084e8ffda	2022-06-15 20:17:35 -07:00

... 2 3 4 5 6 ...

11356 commits