rocksdb

Commit Graph

Author	SHA1	Message	Date
Akanksha Mahajan	ed86cd5e78	Remove deprecated overloads of DB::CompactRange (#9444 ) Summary: In RocksDB few overloads of DB::CompactRange() are marked as DEPRECATED_FUNC, and we are removing it in the upcoming 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9444 Test Plan: CircleCI Reviewed By: ajkr Differential Revision: D33788520 Pulled By: akankshamahajan15 fbshipit-source-id: 716e0d5f227f791605d4d91626c0cbf5b4571630	2022-01-27 23:12:30 -08:00
Jay Zhuang	22321e1027	Remove unused API base_background_compactions (#9462 ) Summary: The API is deprecated long time ago. Clean up the codebase by removing it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462 Test Plan: CI, fake release: D33835220 Reviewed By: riversand963 Differential Revision: D33835103 Pulled By: jay-zhuang fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184	2022-01-27 21:05:18 -08:00
Yanqin Jin	dd203ed604	Disallow a combination of options (#9348 ) Summary: Disallow `immutable_db_opts.use_direct_io_for_flush_and_compaction == true` and `mutable_db_opts.writable_file_max_buffer_size == 0`, since it causes `WritableFileWriter::Append()` to loop forever and does not make much sense in direct IO. This combination of options itself does not make much sense: asking RocksDB to do direct IO but not allowing RocksDB to allocate a buffer. We should detect this false combination and warn user early, no matter whether the application is running on a platform that supports direct IO or not. In the case of platform not supporting direct IO, it's ok if the user learns about this and then finds that direct IO is not supported. One tricky thing: the constructor of `WritableFileWriter` is being used in our unit tests, and it's impossible to return status code from constructor. Since we do not throw, I put an assertion for now. Fortunately, the constructor is not exposed to external applications. Closing https://github.com/facebook/rocksdb/issues/7109 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9348 Test Plan: make check Reviewed By: ajkr Differential Revision: D33371924 Pulled By: riversand963 fbshipit-source-id: 2a3701ab541cee23bffda8a36cdf37b2d235edfa	2022-01-27 19:30:24 -08:00
Peter Dillinger	78aee6fedc	Remove obsolete backupable_db.h, utility_db.h (#9438 ) Summary: This also removes the obsolete names BackupableDBOptions and UtilityDB. API users must now use BackupEngineOptions and DBWithTTL::Open. In C API, `rocksdb_backupable_db_` is replaced `rocksdb_backup_engine_`. Similar renaming in Java API. In reference to https://github.com/facebook/rocksdb/issues/9389 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438 Test Plan: CI Reviewed By: mrambacher Differential Revision: D33780269 Pulled By: pdillinger fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac	2022-01-27 15:45:30 -08:00
Peter Dillinger	ea89c77f27	Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453 ) Summary: MemTable::MultiGet was not considering range tombstones before querying Bloom filter. This means range tombstones would be skipped for keys (or prefixes) with no other entries in the memtable. This could cause old values for a key (in SST files) to still show up until the range tombstone covering it has been flushed. This is fixed by essentially disabling the memtable Bloom filter when there are any range tombstones. (This could be better optimized in the future, but good enough for now.) Did some other cleanup/optimization in the same code to (more than) offset the cost of checking on range tombstones in more cases. There is now notable improvement when memtable_whole_key_filtering and prefix_extractor are used together (unusual), and this makes MultiGet closer to the Get implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453 Test Plan: new unit test added. Added memtable Bloom to crash test. Performance testing -------------------- Build WAL-only DB (recovers to memtable): ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000 ``` Query test command, to maximize sensitivity to the changed code: ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS ``` (Note -num here is 10x larger for mostly memtable misses) Before & after run simultaneously, average over 10 iterations per data point, ops/sec. MWKF=0 PXS=0 (Bloom disabled) Before: 5724844 After: 6722066 MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful) Before: 9981319 After: 10237990 MWKF=0 PXS=8 (prefixes unique; Bloom useful) Before: 12081715 After: 12117603 MWKF=1 PXS=0 (whole key Bloom useful) Before: 11944354 After: 12096085 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version) Before: 9444299 After: 11826029 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version) Before: 11784465 After: 11778591 Only in this last case is the 'before' slightly faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster. Reviewed By: ajkr Differential Revision: D33805025 Pulled By: pdillinger fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf	2022-01-27 14:55:04 -08:00
Hui Xiao	1e0e883ca5	Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452 ) Summary: Context/Summary: AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code. - Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation) - Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452 Test Plan: Rely on my eyeball and CI Reviewed By: ajkr Differential Revision: D33804938 Pulled By: hx235 fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945	2022-01-27 13:01:09 -08:00
Baptiste Lemaire	92822655fd	Remove deprecated table_cache_remove_scan_count_limit option. (#9450 ) Summary: In RocksDB, this option was already marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the preparations of the upcoming 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9450 Reviewed By: ajkr Differential Revision: D33802466 Pulled By: bjlemaire fbshipit-source-id: 97570985f1400525304053476450f7ef504c0cd5	2022-01-27 09:33:31 -08:00
Peter Dillinger	449029f865	Remove deprecated ObjectLibrary::Register() (and Regex public API) (#9439 ) Summary: Regexes are considered potentially problematic for use in registering RocksDB extensions, so we are removing ObjectLibrary::Register() and the Regex public API it depended on (now unused). In reference to https://github.com/facebook/rocksdb/issues/9389 Why? * The power of Regexes can make it hard to reason about which extension will match what. (The replacement API isn't perfect, but we are at least "holding the line" on patterns we have seen in practice.) * It is easy to make regexes that don't quite mean what you think they mean, such as forgetting that the `.` in `foo.bar` can match any character or that matching is nondeterministic, as in `a🅱️42` matching `.:[0-9]+`. Some regexes and implementations can have disastrously bad performance. This might not be much practical concern for ObjectLibray here, but we don't want to encourage potentially dangerous further use in production code. (Testing code is fine. See TestRegex.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9439 Test Plan: CI Reviewed By: mrambacher Differential Revision: D33792342 Pulled By: pdillinger fbshipit-source-id: 4f64dcb04764e639162c8977a5fa196f67754cec	2022-01-26 16:22:44 -08:00
anand76	beb86addeb	Fix race condition in SstFileManagerImpl error recovery code (#9435 ) Summary: There is a race in SstFileManagerImpl between the ClearError() function and CancelErrorRecovery(). The race can cause ClearError() to deref the file system pointer after it has been freed. This is likely to occur during process shutdown, when the order of destruction of the DB/Env/FileSystem and SstFileManagerImpl is not deterministic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9435 Test Plan: Reproduce the crash in a TSAN build by introducing sleeps in the code, and verify with the fix. Reviewed By: siying Differential Revision: D33774696 Pulled By: anand1976 fbshipit-source-id: 643d3da31b8d2ee6d9b6db5d33327e0053ce3b83	2022-01-25 23:22:58 -08:00
Akanksha Mahajan	8822562d75	Remove deprecated function DB::AddFile (#9433 ) Summary: RocksDB has marked DB::AddFile() as "DEPRECATED_FUNC" for a long time, and it will be removed in the upcoming 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9433 Test Plan: make check -j64; CircleCI Reviewed By: riversand963 Differential Revision: D33763987 Pulled By: akankshamahajan15 fbshipit-source-id: a3407324479bb43689e1213e4e29d53095e7579a	2022-01-25 23:22:58 -08:00
Jay Zhuang	022b400cba	Make `bottommost_temperature` dynamically changeable (#9402 ) Summary: Make `AdvancedColumnFamilyOptions.bottommost_temperature` dynamically changeable with `SetOptions` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9402 Test Plan: added unittest Reviewed By: siying Differential Revision: D33674487 Pulled By: jay-zhuang fbshipit-source-id: 8943768156aa6197c63850a64238a8092527d517	2022-01-25 15:23:04 -08:00
Yanqin Jin	fa52376117	Move RADOS support to separate repo (#9206 ) Summary: This PR moves RADOS support from RocksDB repo to a separate repo. The new (temporary?) repo in this PR serves as an example before we finalize the decision on where and who to host RADOS support. At this point, people can start from the example repo and fork. The goal is to include this commit in RocksDB 7.0 release. Reference: https://github.com/ajkr/dedupfs by ajkr Pull Request resolved: https://github.com/facebook/rocksdb/pull/9206 Test Plan: Follow instructions in https://github.com/riversand963/rocksdb-rados-env/blob/main/README.md and build test binary `env_librados_test` and run it. Also, make check Reviewed By: ajkr Differential Revision: D33751690 Pulled By: riversand963 fbshipit-source-id: 30466c62afa9e4619847a48567ed158e62835e35	2022-01-24 22:50:07 -08:00
Yanqin Jin	50135c1bf3	Move HDFS support to separate repo (#9170 ) Summary: This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point, people can start from the example repo and fork. Java/JNI is not included yet, and needs to be done later if necessary. The goal is to include this commit in RocksDB 7.0 release. Reference: https://github.com/ajkr/dedupfs by ajkr Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170 Test Plan: Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress. make check Reviewed By: ajkr Differential Revision: D33751662 Pulled By: riversand963 fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f	2022-01-24 20:23:54 -08:00
anand76	e8f116deab	Update version to 6.29.0 (#9418 ) Summary: Update version for 6.29 release Pull Request resolved: https://github.com/facebook/rocksdb/pull/9418 Reviewed By: riversand963 Differential Revision: D33721048 Pulled By: anand1976 fbshipit-source-id: e73602ee1c829c2e47ce6e181bca4db7cb663979	2022-01-21 18:23:07 -08:00
Peter Dillinger	e7ac7363b4	Add to HISTORY and minor loose ends from #9294 , #9254 (#9386 ) Summary: Loose ends relate to mmap on 32-bit systems. (Testing is more complicated when the feature was completely disabled on 32-bit.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9386 Test Plan: CI Reviewed By: ajkr Differential Revision: D33590715 Pulled By: pdillinger fbshipit-source-id: f2637036a538a552200adee65b6765fce8cae27b	2022-01-21 13:04:19 -08:00
Peter Dillinger	fc9d4071f0	Fast path for detecting unchanged prefix_extractor (#9407 ) Summary: Fixes a major performance regression in 6.26, where extra CPU is spent in SliceTransform::AsString when reads involve a prefix_extractor (Get, MultiGet, Seek). Common case performance is now better than 6.25. This change creates a "fast path" for verifying that the current prefix extractor is unchanged and compatible with what was used to generate a table file. This fast path detects the common case by pointer comparison on the current prefix_extractor and a "known good" prefix extractor (if applicable) that is saved at the time the table reader is opened. The "known good" prefix extractor is saved as another shared_ptr copy (in an existing field, however) to ensure the pointer is not recycled. When the prefix_extractor has changed to a different instance but same compatible configuration (rare, odd), performance is still a regression compared to 6.25, but this is likely acceptable because of the oddity of such a case. The performance of incompatible prefix_extractor is essentially unchanged. Also fixed a minor case (ForwardIterator) where a prefix_extractor could be used via a raw pointer after being freed as a shared_ptr, if replaced via SetOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9407 Test Plan: ## Performance Populate DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Running head-to-head comparisons simultaneously with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=seekrandom -num=10000000 -duration=20 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Below each is compared by ops/sec vs. baseline which is version 6.25 (multiple baseline runs because of variable machine load) v6.26: 4833 vs. 6698 (<- major regression!) v6.27: 4737 vs. 6397 (still) New: 6704 vs. 6461 (better than baseline in common case) Disabled fastpath: 4843 vs. 6389 (e.g. if prefix extractor instance changes but is still compatible) Changed prefix size (no usable filter) in new: 787 vs. 5927 Changed prefix size (no usable filter) in new & baseline: 773 vs. 784 Reviewed By: mrambacher Differential Revision: D33677812 Pulled By: pdillinger fbshipit-source-id: 571d9711c461fb97f957378a061b7e7dbc4d6a76	2022-01-21 11:37:46 -08:00
Andrew Kryczka	875bfd75a0	Add API warning for `Iterator::Refresh()` with range tombstones (#9398 ) Summary: Need this until we properly return an error or fix the combination. Reported in https://github.com/facebook/rocksdb/issues/9255. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9398 Reviewed By: riversand963 Differential Revision: D33641396 Pulled By: ajkr fbshipit-source-id: 9fe804108f7b93912f5b9c7252ac49acedc4f805	2022-01-19 10:13:27 -08:00
Yanqin Jin	1a8e9f0e07	Use fcntl(F_FULLFSYNC) on OS X (#9356 ) Summary: Closing https://github.com/facebook/rocksdb/issues/5954 fsync/fdatasync on Linux: ``` (fsync/fdatasync) includes writing through or flushing a disk cache if present. ``` However, on OS X and iOS: ``` (fsync) will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence. ``` Solution is to use `fcntl(F_FULLFSYNC)` on OS X so that we get the same persistence guarantee. According to OSX man page, ``` The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. ``` This suggests that it will be no faster than `fsync` on Linux, since Linux, according to its man page, ``` writing through or flushing a disk cache if present ``` It means Linux may not flush all data from disk cache. This is similar to bug reports/fixes in: - golang: https://github.com/golang/go/issues/26650 - leveldb: `296de8d5b8`. Not sure if we should fallback to fsync since we break persistence contract. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9356 Reviewed By: jay-zhuang Differential Revision: D33417416 Pulled By: riversand963 fbshipit-source-id: 475548ff9c5eaccde325e0f6842694271cbc8cb7	2022-01-18 20:23:11 -08:00
Peter Dillinger	5576ded762	Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363 ) Summary: In response to https://github.com/facebook/rocksdb/issues/9354, this PR adds a way for users to "opt out" of extra checks that can impact peak write performance, which currently only includes force_consistency_checks. I considered including some other options but did not see a db_bench performance difference. Also clarify in comment for force_consistency_checks that it can "slow down saturated writing." Pull Request resolved: https://github.com/facebook/rocksdb/pull/9363 Test Plan: basic coverage in unit tests Using my perf test in https://github.com/facebook/rocksdb/issues/9354 comment, I see force_consistency_checks=true -> 725360 ops/s force_consistency_checks=false -> 783072 ops/s Reviewed By: mrambacher Differential Revision: D33636559 Pulled By: pdillinger fbshipit-source-id: 25bfd006f4844675e7669b342817dd4c6a641e84	2022-01-18 17:31:03 -08:00
zhuchong0329	5f2b661f54	FlushMemTable return ok but memtable does not synchronize flush (#8173 ) Summary: Fix https://github.com/facebook/rocksdb/issues/8046 : FlushMemTable return ok but memtable does not synchronize flush. The way to fix it is to expose RecoveryError. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8173 Reviewed By: ajkr Differential Revision: D31674552 Pulled By: jay-zhuang fbshipit-source-id: 9d16b69ba12a196bb429332ec8224754de97773d	2022-01-12 13:21:49 -08:00
mrambacher	1973fcba11	Restore Regex support for ObjectLibrary::Register, rename new APIs to allow old one to be deprecated in the future (#9362 ) Summary: In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions. The PatternEntry methods were left in place but renamed to AddFactory. The goal is to allow for the deprecation of the original regex Registry method in an upcoming release. Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362 Reviewed By: pdillinger Differential Revision: D33432562 Pulled By: mrambacher fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02	2022-01-11 06:33:48 -08:00
Yanqin Jin	b2e53ab2d8	Add checking for `DB::DestroyColumnFamilyHandle()` (#9347 ) Summary: Closing https://github.com/facebook/rocksdb/issues/5006 Calling `DB::DestroyColumnFamilyHandle(column_family)` with `column_family` being the return value of `DB::DefaultColumnFamily()` will return `Status::InvalidArgument()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9347 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D33369675 Pulled By: riversand963 fbshipit-source-id: a8266a4daddf2b7a773c2dc7f3eb9a4adfb6b6dd	2022-01-05 20:26:53 -08:00
mrambacher	fe31dc53ca	Make the Env class Customizable (#9293 ) Summary: Allows the Env to have options (Configurable) and loads like other Customizable classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293 Reviewed By: pdillinger, zhichao-cao Differential Revision: D33181591 Pulled By: mrambacher fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7	2022-01-04 16:45:49 -08:00
Yanqin Jin	677d2b4a8f	Fix a bug in C-binding causing iterator to return incorrect result (#9343 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/9339 When writing SST file, the name, computed as `prefix_extractor->GetId()` will be written to the properties block. When the SST is opened again in the future, `CreateFromString()` will take the name as argument and try to create a prefix extractor object. Without this fix, the C API will pass a `Wrapper` pointer to the underlying DB's `prefix_extractor`. `Wrapper::GetId()`, in this case, will be missing the prefix length component, causing a prefix extractor of length 0 to be silently created and used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9343 Test Plan: ``` make c_test ./c_test ``` Reviewed By: mrambacher Differential Revision: D33355549 Pulled By: riversand963 fbshipit-source-id: c92c3acd8be262c3bff8794b4229e42b9ee31203	2021-12-30 12:48:07 -08:00
sdong	a931bacf5d	Improve SimulatedHybridFileSystem (#9301 ) Summary: Several improvements to SimulatedHybridFileSystem: (1) Allow a mode where all I/Os to all files simulate HDD. This can be enabled in db_bench using -simulate_hdd (2) Latency calculation is slightly more accurate (3) Allow to simulate more than one HDD spindles. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9301 Test Plan: Run db_bench and observe the results are reasonable. Reviewed By: jay-zhuang Differential Revision: D33141662 fbshipit-source-id: b736e58c4ba910d06899cc9ccec79b628275f4fa	2021-12-29 11:14:42 -08:00
Andrew Kryczka	aa2b3bf675	Added `TraceOptions::preserve_write_order` (#9334 ) Summary: This option causes trace records to be written in the serialized write thread. That way, the write records in the trace must follow the same order as writes that are logged to WAL and writes that are applied to the DB. By default I left it disabled to match existing behavior. I enabled it in `db_stress`, though, as that use case requires order of write records in trace matches the order in WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9334 Test Plan: - See if below unsynced data loss crash test can run for 24h straight. It used to crash after a few hours when reaching an unlucky trace ordering. ``` DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 --test_batches_snapshots=0 --duration=86400 ``` Reviewed By: zhichao-cao Differential Revision: D33301990 Pulled By: ajkr fbshipit-source-id: 82d97559727adb4462a7af69758449c8725b22d3	2021-12-28 15:04:26 -08:00
Andrew Kryczka	2ee20a669d	Extend trace filtering to more operation types (#9335 ) Summary: - Extended trace filtering to cover `MultiGet()`, `Seek()`, and `SeekForPrev()`. Now all user ops that can be traced support filtering. - Enabled the new filter masks in `db_stress` since it only cares to trace writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9335 Test Plan: - trace-heavy `db_stress` command reduced 30% elapsed time (79.21 -> 55.47 seconds) Benchmark command: ``` $ /usr/bin/time ./db_stress -ops_per_thread=100000 -sync_fault_injection=1 --db=/dev/shm/rocksdb_stress_db/ --expected_values_dir=/dev/shm/rocksdb_stress_expected/ --clear_column_family_one_in=0 ``` - replay-heavy `db_stress` command reduced 12.4% elapsed time (23.69 -> 20.75 seconds) Setup command: ``` $ ./db_stress -ops_per_thread=100000000 -sync_fault_injection=1 -db=/dev/shm/rocksdb_stress_db/ -expected_values_dir=/dev/shm/rocksdb_stress_expected --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress ``` Benchmark command: ``` $ /usr/bin/time ./db_stress -ops_per_thread=1 -reopen=0 -expected_values_dir=/dev/shm/rocksdb_stress_expected/ -db=/dev/shm/rocksdb_stress_db/ --clear_column_family_one_in=0 --destroy_db_initially=0 ``` Reviewed By: zhichao-cao Differential Revision: D33304580 Pulled By: ajkr fbshipit-source-id: 0df10f87c1fc506e9484b6b42cea2ef96c7ecd65	2021-12-28 11:46:30 -08:00
slk	2e5f764294	Make IncreaseFullHistoryTsLow to a public API (#9221 ) Summary: As (https://github.com/facebook/rocksdb/issues/9210) discussed, the full_history_ts_low is a member of CompactRangeOptions currently, which means a CF's fullHistoryTsLow is advanced only when users submit a CompactRange request. However, users may want to advance the fllHistoryTsLow without an immediate compact. This merge make IncreaseFullHistoryTsLow to a public API so users can advance each CF's fullHistoryTsLow seperately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9221 Reviewed By: akankshamahajan15 Differential Revision: D33201106 Pulled By: riversand963 fbshipit-source-id: 9cb1d013ba93260f72e16353e693ffee167b47ee	2021-12-23 11:03:51 -08:00
Akanksha Mahajan	7bfad07194	Update to version 6.28 (#9312 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9312 Reviewed By: ajkr Differential Revision: D33196324 Pulled By: akankshamahajan15 fbshipit-source-id: 471da75eaedc54d3151672adc28643bc1d6fdf23	2021-12-17 16:20:39 -08:00
Peter Dillinger	0050a73a4f	New stable, fixed-length cache keys (#9126 ) Summary: This change standardizes on a new 16-byte cache key format for block cache (incl compressed and secondary) and persistent cache (but not table cache and row cache). The goal is a really fast cache key with practically ideal stability and uniqueness properties without external dependencies (e.g. from FileSystem). A fixed key size of 16 bytes should enable future optimizations to the concurrent hash table for block cache, which is a heavy CPU user / bottleneck, but there appears to be measurable performance improvement even with no changes to LRUCache. This change replaces a lot of disjointed and ugly code handling cache keys with calls to a simple, clean new internal API (cache_key.h). (Preserving the old cache key logic under an option would be very ugly and likely negate the performance gain of the new approach. Complete replacement carries some inherent risk, but I think that's acceptable with sufficient analysis and testing.) The scheme for encoding new cache keys is complicated but explained in cache_key.cc. Also: EndianSwapValue is moved to math.h to be next to other bit operations. (Explains some new include "math.h".) ReverseBits operation added and unit tests added to hash_test for both. Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126 Test Plan: ### Basic correctness Several tests needed updates to work with the new functionality, mostly because we are no longer relying on filesystem for stable cache keys so table builders & readers need more context info to agree on cache keys. This functionality is so core, a huge number of existing tests exercise the cache key functionality. ### Performance Create db with `TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters` And test performance with `TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4` using DEBUG_LEVEL=0 and simultaneous before & after runs. Before ops/sec, avg over 100 runs: 121924 After ops/sec, avg over 100 runs: 125385 (+2.8%) ### Collision probability I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity over many months, by making some pessimistic simplifying assumptions: * Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys) * All of every file is cached for its entire lifetime We use a simple table with skewed address assignment and replacement on address collision to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output with `./cache_bench -stress_cache_key -sck_keep_bits=40`: ``` Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached) ``` These come from default settings of 2.5M files per day of 32 MB each, and `-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of the 128-bit cache key. With file size of 2\\25 contiguous keys (pessimistic), our simulation is about 2\\(128-40-25) or about 9 billion billion times more prone to collision than reality. More default assumptions, relatively pessimistic: * 100 DBs in same process (doesn't matter much) * Re-open DB in same process (new session ID related to old session ID) on average every 100 files generated * Restart process (all new session IDs unrelated to old) 24 times per day After enough data, we get a result at the end: ``` (keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected) ``` If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data: ``` (keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected) (keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected) ``` The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases: ``` 197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected) ``` I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data. Reviewed By: zhichao-cao Differential Revision: D33171746 Pulled By: pdillinger fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f	2021-12-16 17:15:13 -08:00
Akanksha Mahajan	96d0773a11	Update prepopulate_block_cache logic to support block-based filter (#9300 ) Summary: Update prepopulate_block_cache logic to support block-based filter during insertion in block cache Pull Request resolved: https://github.com/facebook/rocksdb/pull/9300 Test Plan: CircleCI tests, make crash_test -j64 Reviewed By: pdillinger Differential Revision: D33132018 Pulled By: akankshamahajan15 fbshipit-source-id: 241deabab8645bda704728e572d6de6354df18b2	2021-12-15 13:20:27 -08:00
Yanqin Jin	08721293ea	Fix a bug causing duplicate trailing entries in WritableFile (buffered IO) (#9236 ) Summary: `db_stress` is a user of `FaultInjectionTestFS`. After injecting a write error, `db_stress` probabilistically determins data drop (https://github.com/facebook/rocksdb/blob/6.27.fb/db_stress_tool/db_stress_test_base.cc#L2615:L2619). In some of our recent runs of `db_stress`, we found duplicate trailing entries corresponding to file trivial move in the MANIFEST, causing the recovery to fail, because the file move operation is not idempotent: you cannot delete a file from a given level twice. Investigation suggests that data buffering in both `WritableFileWriter` and `FaultInjectionTestFS` may be the root cause. WritableFileWriter buffers data to write in a memory buffer, `WritableFileWriter::buf_`. After each `WriteBuffered()`/`WriteBufferedWithChecksum()` succeeds, the `buf_` is cleared. If the underlying file `WritableFileWriter::writable_file_` is opened in buffered IO mode, then `FaultInjectionTestFS` buffers data written for each file until next file sync. After an injected error, user of `FaultInjectionFS` can choose to drop some or none of previously buffered data. If `db_stress` does not drop any unsynced data, then such data will still exist in the `FaultInjectionTestFS`'s buffer. Existing implementation of `WritableileWriter::WriteBuffered()` does not clear `buf_` if there is an error. This may lead to the data being buffered two copies: one in `WritableFileWriter`, and another in `FaultInjectionTestFS`. We also know that the `WritableFileWriter` of MANIFEST file will close upon an error. During `Close()`, it will flush the content in `buf_`. If no write error is injected to `FaultInjectionTestFS` this time, then we end up with two copies of the data appended to the file. To fix, we clear the `WritableFileWriter::buf_` upon failure as well. We focus this PR on files opened in non-direct mode. This PR includes a unit test to reproduce a case when write error injection to `WritableFile` can cause duplicate trailing entries. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9236 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D33033984 Pulled By: riversand963 fbshipit-source-id: ebfa5a0db8cbf1ed73100528b34fcba543c5db31	2021-12-13 09:00:36 -08:00
Levi Tamasi	297d913275	Update HISTORY.md for PR 9273 (#9282 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9282 Reviewed By: akankshamahajan15 Differential Revision: D33027844 Pulled By: ltamasi fbshipit-source-id: 7540d36010414311bc39610fff92a6498be1570c	2021-12-10 14:50:02 -08:00
Yanqin Jin	bd513fd075	Add commit marker with timestamp (#9266 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266 This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing this tag to WAL, thus it is unavailable to users. This is an ongoing effort to add user-defined timestamp support to write-committed transactions. This diff also indicates all column families that may potentially participate in the same transaction must either disable timestamp or have the same timestamp format, since `CommitWithTimestamp` tag is followed by a single byte-array denoting the commit timestamp of the transaction. We will enforce this checking in a future diff. We keep this diff small. Reviewed By: ltamasi Differential Revision: D31721350 fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6	2021-12-10 11:05:35 -08:00
anand76	ecf2bec613	Add a listener callback for end of auto error recovery (#9244 ) Summary: Previously, the OnErrorRecoveryCompleted callback was called when RocksDB was able to successfully recover from a retryable error. However, if the recovery failed and was eventually stopped, there was no indication of the status. To fix that, a new OnErrorRecoveryEnd callback is introduced that deprecates the OnErrorRecoveryCompleted callback. The new callback is called with the original error and the new error status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9244 Test Plan: Add a new unit test in error_handler_fs_test Reviewed By: zhichao-cao Differential Revision: D32922303 Pulled By: anand1976 fbshipit-source-id: f04e77a9cb92c5ea6385590682d3fcf559971b99	2021-12-08 14:30:57 -08:00
Akanksha Mahajan	9e4d56f2c9	Fix segmentation fault in table_options.prepopulate_block_cache when used with partition_filters (#9263 ) Summary: When table_options.prepopulate_block_cache is set to BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and table_options.partition_filters is also set true, then there is segmentation failure when top level filter is fetched because its entered with wrong type in cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263 Test Plan: Updated unit tests; Ran db_stress: make crash_test -j32 Reviewed By: pdillinger Differential Revision: D32936566 Pulled By: akankshamahajan15 fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426	2021-12-08 12:44:38 -08:00
Hui Xiao	bf2f504188	Add Java API change HISTORY section for #9212 (#9243 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/issues/9212 removed a Java public API without noting it in HISTORY. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9243 Test Plan: Existing tests. Reviewed By: ajkr Differential Revision: D32841050 Pulled By: hx235 fbshipit-source-id: 3b771ffef3ba718f8d70201747ee0e5cbf6de52f	2021-12-03 12:51:38 -08:00
lgqss	77c7085594	MemTableList::TrimHistory now use allocated bytes (#9020 ) Summary: Fix a bug when both max_write_buffer_size_to_maintain and max_write_buffer_number_to_maintain are 0. The bug was introduced in 6.5.0 and https://github.com/facebook/rocksdb/issues/5022. Fix https://github.com/facebook/rocksdb/issues/8371 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9020 Reviewed By: pdillinger Differential Revision: D32767084 Pulled By: ajkr fbshipit-source-id: c401ee6e2557230e892d0fe8abb4966cbd18e85f	2021-12-02 11:45:39 -08:00
Hui Xiao	9daf07305c	Replace TableProperties::properties_offsets map with external_sst_file_global_seqno_offset (#9212 ) Summary: Context: Searching `TableProperties::properties_offsets` across the codebase reveals that internally it is only used to find the external SST file's global seqno offeset. Therefore we can narrow it down and replace this map property with a uint64_t property `external_sst_file_global_seqno_offset` to save memory usage related to table properties. Note: - See PR comments for discussion about potential impact on existing external usage of `TableProperties::properties_offsets` - See PR comments for discussion on keeping external SST file global seqno's offset VS using a simple flag indicating seqno's existence. Summary: - Replaced `TableProperties::properties_offsets` with `TableProperties::external_sst_file_global_seqno_offset` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9212 Test Plan: - Relied on existing tests should be sufficient since `TableProperties::properties_offsets` existed before and should already be tested. Reviewed By: ajkr Differential Revision: D32665941 Pulled By: hx235 fbshipit-source-id: 718e44617346dc4f3b1276ee953e61c196277795	2021-12-02 08:30:36 -08:00
Akanksha Mahajan	44ac714808	Update History.md for the bug fix in RocksDB implicit prefetching in #9234 (#9237 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9237 Reviewed By: riversand963 Differential Revision: D32769906 Pulled By: akankshamahajan15 fbshipit-source-id: ef9185f57b7f7cb16daf412ae08104a3e2724191	2021-12-01 12:26:28 -08:00
mrambacher	7cd5835a28	Make RateLimiter Customizable (#9141 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9141 Reviewed By: zhichao-cao Differential Revision: D32432190 Pulled By: mrambacher fbshipit-source-id: 7930ed88a02412128cd407b5063522484e45c6ce	2021-12-01 06:57:02 -08:00
Yanqin Jin	924616526a	Update WriteBatch::AssignTimestamp() and Add (#9205 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9205 Update WriteBatch::AssignTimestamp() APIs so that they take an additional argument, i.e. a function object called `checker` indicating the user-specified logic of performing checks on timestamp sizes. WriteBatch is a building block used by multiple other RocksDB components, each of which may track timestamp information in different data structures. For example, transaction can either write to `WriteBatchWithIndex` which is a `WriteBatch` with index, or write directly to raw `WriteBatch` if `Transaction::DisableIndexing()` is called. `WriteBatchWithIndex` keeps mapping from column family id to comparator, and transaction needs to keep similar information for the `WriteBatch` if user calls `Transaction::DisableIndexing()` (dynamically) so that we will know the size of each timestamp later. The bookkeeping info maintained by `WriteBatchWithIndex` and `Transaction` should not overlap. When we later call `WriteBatch::AssignTimestamp()`, we need to use these data structures to guarantee that we do not accidentally assign timestamps for keys from column families that disable timestamp. Reviewed By: ltamasi Differential Revision: D31735186 fbshipit-source-id: 8b1709ed880ac72f995aa9e012e5873b290840a7	2021-11-30 22:33:00 -08:00
leipeng	c712b68f5b	Fix num files in single compaction for universal compaction (#9168 ) Summary: https://github.com/facebook/rocksdb/issues/9026 fixed histogram NUM_FILES_IN_SINGLE_COMPACTION for level compaction, but missed fix for universal compaction. This PR fixed NUM_FILES_IN_SINGLE_COMPACTION for universal compaction. Quote from https://github.com/facebook/rocksdb/issues/9026: > currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input. Thanks for ajkr pointed this missed fix! Pull Request resolved: https://github.com/facebook/rocksdb/pull/9168 Reviewed By: akankshamahajan15 Differential Revision: D32434494 Pulled By: ajkr fbshipit-source-id: 93ea092af4afbd8dce67898ffb350cf26b065ed2	2021-11-30 15:11:21 -08:00
Peter Dillinger	e8b5d05e93	HISTORY for #9208 (#9227 ) Summary: Update HISTORY for bug fix. This is going into 6.27 initial release. (Technically 6.27.1) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9227 Test Plan: n/a Reviewed By: ajkr Differential Revision: D32727912 Pulled By: pdillinger fbshipit-source-id: 75e7a81749a188a590d44ef47e261eaaa8667152	2021-11-30 15:01:59 -08:00
Yanqin Jin	8101643611	Update HISTORY and version.h for 6.27 release (#9192 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9192 Reviewed By: ltamasi Differential Revision: D32578141 Pulled By: riversand963 fbshipit-source-id: 16216451c87e383ca8fd309acf15106e46172aaa	2021-11-19 22:11:56 -08:00
Levi Tamasi	3a9f557451	Update HISTORY for PR 9187 (#9191 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9191 Reviewed By: riversand963 Differential Revision: D32577939 Pulled By: ltamasi fbshipit-source-id: 3c52067a0c3e9219c1aafdb711718dfcce5dedf5	2021-11-19 20:07:11 -08:00
Yanqin Jin	43ac7a2774	Fix an assertion failure when ManifestTailer switches to new Manifest in multi-cf mode (#9143 ) Summary: Original unit test fail to test the case of multi-cf mode switching to new manifest. The assertion failure will trigger when the primary instance reopens and secondary continues to tail the newly-created MANIFEST. Fix the assertion failure and update existing unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9143 Test Plan: make check Reviewed By: ltamasi Differential Revision: D32574233 Pulled By: riversand963 fbshipit-source-id: 857ddbe994019091276458abebcf8e2b65340468	2021-11-19 19:53:40 -08:00
Jay Zhuang	6cde8d2190	Deprecating `iter_start_seqnum` and `preserve_deletes` (#9091 ) Summary: `ReadOptions::iter_start_seqnum` and `DBOptions::preserve_deletes` are deprecated, please try using user defined timestamp feature instead. The feature is used to support differential snapshots, but not well maintained (https://github.com/facebook/rocksdb/issues/6837, https://github.com/facebook/rocksdb/issues/8472) and the interface is not user friendly which returns an internal key from the iterator. The user defined timestamp feature is a more flexible feature to support similar usecase, please switch to that if you have such usecase. The deprecated feature will be removed in a future release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9091 Test Plan: check LOG Fix https://github.com/facebook/rocksdb/issues/9090 Reviewed By: ajkr Differential Revision: D32071750 Pulled By: jay-zhuang fbshipit-source-id: b882c4668dd1bf26ce03c4c192f1bba584bf6104	2021-11-19 16:55:45 -08:00
Yanqin Jin	1e8322c0f5	Fix a bug in FlushJob picking more memtables beyond synced WALs (#9142 ) Summary: After RocksDB 6.19 and before this PR, RocksDB FlushJob may pick more memtables to flush beyond synced WALs. This can be problematic if there are multiple column families, since it can prematurely advance the flushed column family's log_number. Should subsequent attempts fail to sync the latest WALs and the database goes through a recovery, it may detect corrupted WAL number below the flushed column family's log number and complain about column family inconsistency. To fix, we record the maximum memtable ID of the column family being flushed. Then we call SyncClosedLogs() so that all closed WALs at the time when memtable ID is recorded will be synced. I also disabled a unit test temporarily due to reasons described in https://github.com/facebook/rocksdb/issues/9151 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9142 Test Plan: make check Reviewed By: ajkr Differential Revision: D32299956 Pulled By: riversand963 fbshipit-source-id: 0da75888177d91905cf8c9d00605b73afb5970a7	2021-11-19 09:56:00 -08:00
Andrew Kryczka	8cf4294e25	Adhere to per-DB concurrency limit when bottom-pri compactions exist (#9179 ) Summary: - Fixed bug where bottom-pri manual compactions were counting towards `bg_compaction_scheduled_` instead of `bg_bottom_compaction_scheduled_`. It seems to have no negative effect. - Fixed bug where automatic compaction scheduling did not consider `bg_bottom_compaction_scheduled_`. Now automatic compactions cannot be scheduled that exceed the per-DB compaction concurrency limit (`max_compactions`) when some existing compactions are bottommost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9179 Test Plan: new unit test for manual/automatic. Also verified the existing automatic/automatic test ("ConcurrentBottomPriLowPriCompactions") hanged until changing it to explicitly enable concurrency. Reviewed By: riversand963 Differential Revision: D32488048 Pulled By: ajkr fbshipit-source-id: 20c4c0693678e81e43f85ed3cc3402fcf26e3310	2021-11-18 17:31:50 -08:00

1 2 3 4 5 ...

1147 Commits