rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-28 05:43:50 +00:00

Author	SHA1	Message	Date
darionyaphet	f4e304f987	Simplify conditional judgment (#11580 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11580 Reviewed By: ajkr Differential Revision: D47158687 Pulled By: cbi42 fbshipit-source-id: 4841b77eee78ddcf35da6ea33da71861c5f1e773	2023-07-03 09:41:48 -07:00
Yu Zhang	15053f3ab4	Logically strip timestamp during flush (#11557 ) Summary: Logically strip the user-defined timestamp when L0 files are created during flush when `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` is false. Logically stripping timestamp here means replacing the original user-defined timestamp with a mininum timestamp, which for now is hard coded to be all zeros bytes. While working on this, I caught a missing piece on the `BlockBuilder` level for this feature. The current quick path `std::min(buffer_size, last_key_size)` needs a bit tweaking to work for this feature. When user-defined timestamp is stripped during block building, on writing first entry or right after resetting, `buffer` is empty and `buffer_size` is zero as usual. However, in follow-up writes, depending on the size of the stripped user-defined timestamp, and the size of the value, what's in `buffer` can sometimes be smaller than `last_key_size`, leading `std::min(buffer_size, last_key_size)` to truncate the `last_key`. Previous test doesn't caught the bug because in those tests, the size of the stripped user-defined timestamps bytes is smaller than the length of the value. In order to avoid the conditional operation, this PR changed the original trivial `std::min` operation into an arithmetic operation. Since this is a change in a hot and performance critical path, I did the following benchmark to check no observable regression is introduced. ```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000``` Compiled with DEBUG_LEVEL=0 Test vs. control runs simulaneous for better accuracy, units = ops/sec PR vs base: Round 1: 350652 vs 349055 Round 2: 365733 vs 364308 Round 3: 355681 vs 354475 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11557 Test Plan: New timestamp specific test added or existing tests augmented, both are parameterized with `UserDefinedTimestampTestMode`: `UserDefinedTimestampTestMode::kNormal` -> UDT feature enabled, write / read with min timestamp `UserDefinedTimestampTestMode::kStripUserDefinedTimestamps` -> UDT feature enabled, write / read with min timestamp, set Options.persist_user_defined_timestamps to false. ``` make all check ./db_wal_test --gtest_filter="WithTimestamp" ./flush_job_test --gtest_filter="WithTimestamp" ./repair_test --gtest_filter="WithTimestamp" ./block_based_table_reader_test ``` Reviewed By: pdillinger Differential Revision: D47027664 Pulled By: jowlyzhang fbshipit-source-id: e729193b6334dfc63aaa736d684d907a022571f5	2023-06-29 15:50:50 -07:00
Griffin Smith	bfdc91017c	C-API: Expose remaining PlainTableOptions (#11442 ) Summary: Expose the remaining fields of PlainTableOptions as arguments to `rocksdb_options_set_plain_table_factory` in the C API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11442 Reviewed By: ajkr Differential Revision: D46786962 Pulled By: hx235 fbshipit-source-id: 8862083dde332bfecc5ff02f9375776ad35c11f5	2023-06-27 12:30:28 -07:00
Akanksha Mahajan	5187ac2af3	Add skip_tmpdir_check arg in crash script (#11539 ) Summary: Add `skip_tmpdir_check` argument in crash script. If `tmp_dir` is on remote storage and exist, `isdir` will be false (checking on local storage) leading to exit. By passing `skip_tmpdir_check` with `crashtest.py`, the dir check can be skipped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11539 Test Plan: Ran locally Reviewed By: anand1976 Differential Revision: D46740456 Pulled By: akankshamahajan15 fbshipit-source-id: 8726882ef53d2c84b604c7515e84eda6d1bf797c	2023-06-27 12:30:19 -07:00
Jay Schmidek	f7aa70a72f	Add create_column_families to C api (#9527 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9527 Reviewed By: akankshamahajan15 Differential Revision: D47007647 Pulled By: ajkr fbshipit-source-id: e13544130b2731e07fa5fa4b9a2aa5f75b548c7e	2023-06-27 11:58:33 -07:00
Yelso Honnr	5732cf50e1	Add OpenBSD Support (#11255 ) Summary: I made some changes to add OpenBSD support. Second time doing something like this, so I apologize in advance if I'm doing something wrong (had some minor hiccups with how github worked). Fixes https://github.com/facebook/rocksdb/issues/11220 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11255 Reviewed By: akankshamahajan15 Differential Revision: D46361706 Pulled By: ajkr fbshipit-source-id: 90922fa30197fe6d6f3c0e3ecca2dbb92c337277	2023-06-27 11:58:29 -07:00
Yingchun Lai	44524cf5da	remove duplicate comments in EncryptedEnv (#11549 ) Summary: There are some comments on subclasses in EncryptedEnv module which are duplicate to their parent classes, it would be nice to remove the duplication and keep the consistency if the comments on parent classes updated in someday. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11549 Reviewed By: akankshamahajan15 Differential Revision: D47007061 Pulled By: ajkr fbshipit-source-id: 8bfdaf9f2418a24ca951c30bb88e90ac861d9016	2023-06-27 11:55:37 -07:00
JUBIN CHHEDA	b14c0b0602	Update secondary_cache_adapter.cc (#11566 ) Summary: Infer detected a(n) [Unnecessary Copy Intermediate](https://fbinfer.com/docs/next/all-issue-types#unnecessary copy intermediate) issue. variable &my_secondary_handles is copied unnecessarily into an intermediate on line 268. To avoid the copy, try moving it by calling std::move instead or alternatively change the callee's parameter type to const &. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11566 Reviewed By: akankshamahajan15 Differential Revision: D47057361 Pulled By: ajkr fbshipit-source-id: bc5d7a71638aecbf976f1a163128b489c9e87fd8	2023-06-27 10:42:42 -07:00
akankshamahajan	94c247bff8	Update HISTORY.md for branch cut for 8.4.fb (#11565 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11565 Reviewed By: jowlyzhang, cbi42 Differential Revision: D47027788 Pulled By: akankshamahajan15 fbshipit-source-id: e5e8db2eb21f8aa68fe072f0e1b63b83ba7beb9f	2023-06-26 13:26:15 -07:00
akankshamahajan	ff1cc8a63e	Fix extra prefetching when num_file_reads_for_auto_readahead is 1 in async_io (#11560 ) Summary: When num_file_reads_for_auto_readahead = 1, during seek, it would go for prefetchingextra data in second buffer along with seek data, that would lead to increase in read data and discarded bytes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11560 Test Plan: Added unit test Reviewed By: anand1976 Differential Revision: D47008102 Pulled By: akankshamahajan15 fbshipit-source-id: 566c6131cb5f968d5efb81fd0ab233ff7e534ab0	2023-06-26 10:39:44 -07:00
Changyu Bi	ca50ccc71a	Add CreateColumnFamilyWithImport to `StackableDB` and `DBImplReadOnly` (#11556 ) Summary: https://github.com/facebook/rocksdb/issues/11378 added a new overloaded `CreateColumnFamilyWithImport` API and updated the virtual function in `StackableDB` and `DBImplReadOnly` to the newly overloaded one. This caused internal error when there is a derived class that tries to override the original `CreateColumnFamilyWithImport` function. This PR adds the original `CreateColumnFamilyWithImport` function back to `StackableDB` and `DBImplReadOnly`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11556 Test Plan: check if this fixes an internal build Reviewed By: akankshamahajan15 Differential Revision: D46980506 Pulled By: cbi42 fbshipit-source-id: 975a6c5748bf9481499a62ee5997ca59e542e3bc	2023-06-23 13:56:26 -07:00
akankshamahajan	fbd2f563bb	Add an interface to provide support for underlying FS to pass their own buffer during reads (#11324 ) Summary: 1. Public API change: Replace `use_async_io` API in file_system with `SupportedOps` API which is used by underlying FileSystem to indicate to upper layers whether the FileSystem supports different operations introduced in `enum FSSupportedOps `. Right now operations are `async_io` and whether FS will provide its own buffer during reads or not. The api is changed to extend it to various FileSystem operations in one API rather than creating a separate API for each operation. 2. Provide support for underlying FS to pass their own buffer during Reads (async and sync read) instead of using RocksDB provided `scratch` (buffer) in `FSReadRequest`. Currently only MultiRead supports it and later will be extended to other reads as well (point lookup, scan etc). More details in how to enable in file_system.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/11324 Test Plan: Tested locally Reviewed By: anand1976 Differential Revision: D44465322 Pulled By: akankshamahajan15 fbshipit-source-id: 9ec9e08f839b5cc815e75d5dade6cd549998d0ec	2023-06-23 11:48:49 -07:00
Yu Zhang	fb5748decf	Fix crash_test crash (#11554 ) Summary: `table_properties_` is not guaranteed to be available. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11554 Reviewed By: akankshamahajan15 Differential Revision: D46944170 Pulled By: jowlyzhang fbshipit-source-id: 609d598e75b417471c9cd964cc316453776a2135	2023-06-22 12:36:22 -07:00
Peter Dillinger	05a1d52e77	Use FaultInjectionTestFS in transaction_test, clarify Close() APIs (#11499 ) Summary: ... instead of race-condition-laden FaultInjectionTestEnv. See https://app.circleci.com/pipelines/github/facebook/rocksdb/27912/workflows/4c63e5a8-597e-439d-8c7e-82308056af02/jobs/609648 and similar PR https://github.com/facebook/rocksdb/issues/11271 Had to fix the semantics of FaultInjectionTestFS Close() operations to allow a non-OK Close() to fulfill the obligation to close before destruction. To me, this is the obvious choice of Close contract, because what is the caller supposed to do if Close() fails and they still have an obligation to successfully close before object destruction? Call Close() in an infinite loop? Leak the object? I have added API comments to the Env and Filesystem Close() functions to clarify the contracts. Note that `DB::Close()` has one exception to this kind of Close contract, but it is clearly described in API comments and it is really only for catching programming mistakes, not for dealing with exogenous errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11499 Test Plan: watch CI Reviewed By: jowlyzhang Differential Revision: D46375708 Pulled By: pdillinger fbshipit-source-id: 03d4d8251e5df50a82ecd139f7e83f613015fe40	2023-06-21 23:38:54 -07:00
Yu Zhang	7521478b43	Record the `persist_user_defined_timestamps` flag in manifest (#11515 ) Summary: Start to record the value of the flag `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag's default value is true, it is only explicitly recorded if it's false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11515 Test Plan: ``` make all check ./version_edit_test ``` Reviewed By: ltamasi Differential Revision: D46920386 Pulled By: jowlyzhang fbshipit-source-id: 075c20363d3d2cc1368422ecc805617ed135cc26	2023-06-21 21:49:01 -07:00
Peter Dillinger	98c6d7fd80	Internal API for generating semi-random salt (#11331 ) Summary: ... so that a non-cryptographic whole file checksum would be highly resistant to manipulation by a user able to manipulate key-value data (e.g. a user whose data is stored in RocksDB) and able to predict SST metadata such as DB session id and file number based on read access to logs or DB files. The adversary would also need to predict the salt in order to influence the checksum result toward collision with another file's checksum. This change is just internal code to support such a future feature. I think this should be a passive feature, not option-controlled, because you probably won't think about needing it until you discover you do need it, and it should be low cost, in space (16 bytes per SST file) and CPU. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11331 Test Plan: Unit tests added to verify at least pseudorandom behavior. (Actually caught a bug in first draft!) The new "stress" style tests run in ~3ms each on my system. Reviewed By: ajkr Differential Revision: D46129415 Pulled By: pdillinger fbshipit-source-id: 7972dc74487e062b29b1fd9c227425e922c98796	2023-06-21 11:32:49 -07:00
Alexandre Lavigne	2926e0718c	Add missing parameter in C API (#11542 ) Summary: The class `NewCompactOnDeletionCollectorFactory` exposes the parameter `delete_ratio`. The C API `rocksdb_options_add_compact_on_deletion_collector_factory` does not allow a user to pass a delete ration to be passed down the the C++ class bellow. The class has default value for the delete ratio which makes it pass the compilation and the tests. closes https://github.com/facebook/rocksdb/issues/11541 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11542 Reviewed By: ajkr Differential Revision: D46770908 Pulled By: cbi42 fbshipit-source-id: 7b5162fe459896052e392e2d85a8f6c01db3b464	2023-06-20 13:18:04 -07:00
Levi Tamasi	022d89549d	Attempt to deflake DBWALTestWithEnrichedEnv.SkipDeletedWALs (#11537 ) Summary: Calling `Flush` (even with `wait==true`) does not guarantee that obsolete WAL files are physically deleted before the call returns. The patch attempts to fix the resulting flakiness by using `SyncPoint`s to make sure `PurgeObsoleteFiles` finishes before checking for WAL deletions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11537 Test Plan: ``` gtest-parallel --repeat=1000 ./db_wal_test --gtest_filter="SkipDeletedWALs" ``` Reviewed By: pdillinger Differential Revision: D46736050 Pulled By: ltamasi fbshipit-source-id: 47a931b7a3a03ef681fbf4adb5a0b223d452703e	2023-06-19 16:04:49 -07:00
Levi Tamasi	b3edb87341	Initialize StressTest::optimistic_txn_db_ in ctor (#11547 ) Summary: `StressTest::optimistic_txn_db_` is currently not initialized by the constructor, which can lead to assertion failures down the line in `StressTest::Open`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11547 Reviewed By: cbi42 Differential Revision: D46845658 Pulled By: ltamasi fbshipit-source-id: 578b0f24fc00e3e97f24221fcdd003cc529439c2	2023-06-19 15:41:30 -07:00
Jay Huh	17d5200504	Stress/Crash Test for OptimisticTransactionDB (#11513 ) Summary: Context: OptimisticTransactionDB has not been covered by db_stress (including crash test) like TransactionDB. 1. Adding the following gflag options to to test OptimisticTransactionDB - `use_optimistic_txn`: When true, open OptimisticTransactionDB to test - `occ_validation_policy`: `OccValidationPolicy::kValidateParallel = 1` by default. - `share_occ_lock_buckets`: Use shared occ locks - `occ_lock_bucket_count`: 500 by default. Number of buckets to use for shared occ lock. 2. Opening OptimisticTransactionDB and NewTxn/Commit added per `use_optimistic_txn` flag in `db_stress_test_base.cc` 3. OptimisticTransactionDB blackbox/whitebox test added in crash_test.mk Please note that the existing flag `use_txn` is being used here. When `use_txn == true` and `use_optimistic_txn == false`, we use `TransactionDB` (a.k.a. pessimistic transaction db). When both `use_txn` and `use_optimistic_txn` are true, we use `OptimisticTransactionDB`. If `use_txn == false` but `use_optimistic_txn == true` throw error with message _"You cannot set use_optimistic_txn true while use_txn is false. Please set use_txn true if you want to use OptimisticTransactionDB"_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11513 Test Plan: Crash Test Serial Validation ``` export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=0" make crash_test -j ``` Parallel Validation (no share bucket) ``` export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=0" make crash_test -j ``` Parallel Validation (share bucket) ``` export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=1 --occ_lock_bucket_count=500" make crash_test -j ``` Stress Test ``` ./db_stress -use_optimistic_txn -threads=32 ``` Reviewed By: pdillinger Differential Revision: D46547387 Pulled By: jaykorean fbshipit-source-id: ca19819ca6e0281694966998014b40d95d4e5960	2023-06-17 16:27:37 -07:00
Hui Xiao	1da9ac2363	Add UT to test BG read qps behavior during upgrade for pr11406 (#11522 ) Summary: Context/Summary: When db is upgrading to adopt [pr11406](https://github.com/facebook/rocksdb/pull/11406/), it's possible for RocksDB to infer a small tail size to prefetch for pre-upgrade files. Such small tail size would have caused 1 file read per index or filter partition if partitioned index or filer is used. This PR provides a UT to show this would not happen. Misc: refactor the related UTs a bit to make this new UT more readable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11522 Test Plan: - New UT If logic of upgrade is wrong e.g, ``` --- a/table/block_based/partitioned_index_reader.cc +++ b/table/block_based/partitioned_index_reader.cc @@ -166,7 +166,8 @@ Status PartitionIndexReader::CacheDependencies( uint64_t prefetch_len = last_off - prefetch_off; std::unique_ptr<FilePrefetchBuffer> prefetch_buffer; if (tail_prefetch_buffer == nullptr \|\| !tail_prefetch_buffer->Enabled() \|\| - tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off) { + (false && tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off)) { ``` , then the UT will fail like below ``` [ RUN ] PrefetchTailTest/PrefetchTailTest.UpgradeToTailSizeInManifest/0 file/prefetch_test.cc:461: Failure Expected: (db_open_file_read.count) < (num_index_partition), actual: 38 vs 33 Received signal 11 (Segmentation fault) ``` Reviewed By: pdillinger Differential Revision: D46546707 Pulled By: hx235 fbshipit-source-id: 9897b0a975e9055963edac5451fd1cd9d6c45d0e	2023-06-16 13:04:30 -07:00
Yu Zhang	66499780b2	Fix error case memory bug in GetHostName() (#11544 ) Summary: Fix the error handling in `GetHostName` for non EFAULT, non EINVAL error. Current handling will cause stack overflow when non null-terminated c style string is in `name`, e.g. ENAMETOOLONG, when the `name` buffer is not big enough and the host name is truncated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11544 Test Plan: ``` COMPILE_WITH_ASAN=1 make all check ``` Reviewed By: pdillinger Differential Revision: D46775799 Pulled By: jowlyzhang fbshipit-source-id: e0fc9400c50fe38bc1fd888b4fea5fe8706165bf	2023-06-16 11:47:19 -07:00
Yu Zhang	b421a8c21b	Add a ticker to track number of trash files deleted in background thread (#11540 ) Summary: This ticker combined with `rocksdb.files.marked.trash` can help give a better picture of how DeleteScheduler is keeping up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11540 Test Plan: ``` ./delete_scheduler_test ``` Reviewed By: ajkr Differential Revision: D46746401 Pulled By: jowlyzhang fbshipit-source-id: f3daa622aa3ddefe7d673e0cc257a47699d506df	2023-06-16 10:05:25 -07:00
Changyu Bi	bc04ec85db	Make option `level_compaction_dynamic_level_bytes` true by default (#11525 ) Summary: after https://github.com/facebook/rocksdb/issues/11321 and https://github.com/facebook/rocksdb/issues/11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size. The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit `f742be330c` and `b1928e42b3` which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape. Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in https://github.com/facebook/rocksdb/issues/10057. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11525 Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change. Reviewed By: ajkr Differential Revision: D46654256 Pulled By: cbi42 fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1	2023-06-15 21:12:39 -07:00
yaphet	253bc91953	Move the status judgment into the block (#11534 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11534 Reviewed By: cbi42 Differential Revision: D46732248 Pulled By: ajkr fbshipit-source-id: c9360866c35a2c436ab83b85a14ed7bd43a88d95	2023-06-15 16:55:38 -07:00
darionyaphet	9f774baaa8	Support Error Recovery Retry Flush in GetFlushReasonString (#11536 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11536 Reviewed By: cbi42 Differential Revision: D46732297 Pulled By: ajkr fbshipit-source-id: 82bb078f189da233addc4f483eaa6eaf7fdd3910	2023-06-15 16:53:44 -07:00
mayue.fight	fa878a0107	Support to create a CF by importing multiple non-overlapping CFs (#11378 ) Summary: The original Feature Request is from [https://github.com/facebook/rocksdb/issues/11317](https://github.com/facebook/rocksdb/issues/11317). Flink uses rocksdb as the state backend, all DB options are the same, and the keys of each DB instance are adjacent and there is no key overlap between two db instances. In the Flink rescaling scenario, it is necessary to quickly split the DB according to a certain key range or quickly merge multiple DBs into one. This PR is mainly used to quickly merge multiple DBs into one. We hope to extend the function of `CreateColumnFamilyWithImports` to support creating ColumnFamily by importing multiple ColumnFamily with no overlapping keys. The import logic is almost the same as `CreateColumnFamilyWithImport`, but it will check whether there is key overlap between CF when importing. The import will fail if there are key overlaps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11378 Reviewed By: ajkr Differential Revision: D46413709 Pulled By: cbi42 fbshipit-source-id: 846d0049fad11c59cf460fa846c345b26c658dfb	2023-06-15 12:25:04 -07:00
Peter Dillinger	70bf5ef093	Avoid destroying default PosixEnv, safely (#11538 ) Summary: Use another static object to join threads instead. This change is motivated by a case in which some code using NewLRUCache() -> ShardedCacheBase -> SemiStructuredUniqueIdGen -> GenerateRawUniqueId() -> Env::Default() was happening during static destruction. I didn't see anything else in PosixEnv or base classes that would cause a problem by not destroying. (WinEnv is already not destroyed; see env_default.cc) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11538UndefinedBehaviorSanitizer: undefined-behavior env/env_test.cc:3561:23 in $ ``` Test Plan: test added, which would previously fail with UBSAN: ``` $ ./env_test --gtest_filter=Destruct Note: Google Test filter = Destruct [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from EnvTestMisc [ RUN ] EnvTestMisc.StaticDestruction [ OK ] EnvTestMisc.StaticDestruction (0 ms) [----------] 1 test from EnvTestMisc (0 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (0 ms total) [ PASSED ] 1 test. env/env_test.cc:3561:23: runtime error: member call on address 0x7f7b96671ca8 which does not point to an object of type 'rocksdb::Env' 0x7f7b96671ca8: note: object is of type 'N7rocksdb12ConfigurableE' 00 00 00 00 90 a7 f7 95 7b 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^~~~~~~~~~~~~~~~~~~~~~~ vptr for 'N7rocksdb12ConfigurableE' Reviewed By: jowlyzhang Differential Revision: D46737389 Pulled By: pdillinger fbshipit-source-id: 0f80a443bf799ffc5641e898cf3a75f7d10a987b	2023-06-14 16:18:08 -07:00
Changyu Bi	15e8a843d9	Do not include last level in compaction when `allow_ingest_behind=true` (#11489 ) Summary: when a DB is configured with `allow_ingest_behind = true`, the last level should be reserved for ingested files and these files should not be included in any compaction. Currently, a major compaction can compact these files to smaller levels. This can cause future files to be rejected for ingest behind (see `ExternalSstFileIngestionJob::CheckLevelForIngestedBehindFile()`). This PR fixes the issue such that files in the last level is not included in any compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11489 Test Plan: * Updated unit test `ExternalSSTFileTest.IngestBehind` to test that last level is not included in manual and auto-compaction. Reviewed By: ajkr Differential Revision: D46455711 Pulled By: cbi42 fbshipit-source-id: 5e2142c2a709ef932ad797897795021c06c4ac8c	2023-06-14 11:28:56 -07:00
Andrew Kryczka	cac3240cbf	add property "rocksdb.obsolete-sst-files-size" (#11533 ) Summary: See "unreleased_history/new_features/obsolete_sst_files_size.md" for description Pull Request resolved: https://github.com/facebook/rocksdb/pull/11533 Test Plan: updated unit test Reviewed By: jowlyzhang Differential Revision: D46703152 Pulled By: ajkr fbshipit-source-id: ea5e31cd6293eccc154130c13e66b5271f57c102	2023-06-13 15:52:45 -07:00
Changyu Bi	e178f9e477	Fix failed CI job "Check buck targets and code format" (#11532 ) Summary: the first CI step "Check buck targets and code format..." is failing with the following error message: ``` Download action repository 'wei/wget@v1' (SHA:c15e476d1463f4936cb54f882170d9d631f1aba5) Error: An action could not be found at the URI '`c15e476d14`' ``` Not sure why the action is lost, but it seems we can use wget directly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11532 Test Plan: watch CI job "Check buck targets and code format" passes Reviewed By: ltamasi Differential Revision: D46700626 Pulled By: cbi42 fbshipit-source-id: 53c09f27965444b533b3fe3755aec922571bba2c	2023-06-13 14:20:11 -07:00
Hui Xiao	6041e50eba	Fix info_log comment in SSTFileManager (#11530 ) Summary: Context/Summary: as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/11530 Test Plan: no code change Reviewed By: jowlyzhang Differential Revision: D46655670 Pulled By: hx235 fbshipit-source-id: 3dfea5485a9a6a8ce0de1f6296c4cf31aedfa116	2023-06-13 14:14:51 -07:00
Yu Zhang	a2a90f8998	Fix typo in twitter link (#11529 ) Summary: A user find and reported this in issue https://github.com/facebook/rocksdb/issues/11528 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11529 Reviewed By: hx235 Differential Revision: D46653618 Pulled By: jowlyzhang fbshipit-source-id: aa98f59028daaaa5b0bd3b96ee87dcdd7369a8b9	2023-06-12 15:26:13 -07:00
Ignat Loskutov	7c67aee4a0	statistics.cc: fix mistype (#11509 ) Summary: Add new tickers: `rocksdb.error.handler.bg.error.count`, `rocksdb.error.handler.bg.io.error.count`, `rocksdb.error.handler.bg.retryable.io.error.count` to replace the misspelled ones: `rocksdb.error.handler.bg.errro.count`, `rocksdb.error.handler.bg.io.errro.count`, `rocksdb.error.handler.bg.retryable.io.errro.count` ('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11509 Reviewed By: ltamasi Differential Revision: D46542809 Pulled By: jowlyzhang fbshipit-source-id: a2a6d8354af46a060de81d40ef6f5336a80bd32e	2023-06-09 13:25:57 -07:00
Ignat Loskutov	05fcacdb42	Add missing stopwatch and perf timer to DBImplReadOnly (#11521 ) Summary: The `rocksdb.db.get.micros` histogram is never updated if the DB is open in ReadOnly mode, as well as the `get_cpu_nanos` perf counter. An earlier PR (https://github.com/facebook/rocksdb/issues/4260) for some reason has only added the TODO line, not the accounting itself, so this one is intended to fix it, adding two lines to match [DBImplSecondary](`4dafa5b220/db/db_impl/db_impl_secondary.cc (L366-L367)`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/11521 Reviewed By: cbi42 Differential Revision: D46577330 Pulled By: jowlyzhang fbshipit-source-id: be147923e763af32bbc18fd6bdf3aff8ebf08aee	2023-06-08 17:40:56 -07:00
Yu Zhang	77dda0d9d8	Fix use after move in data block hash index (#11505 ) Summary: Fix a use-after-move issue in block.cc and added some unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11505 Test Plan: ``` make all check ./block_test ``` Reviewed By: ltamasi Differential Revision: D46506188 Pulled By: jowlyzhang fbshipit-source-id: 316ed8ddd221c00b2bce2cf9fd47eea686cd74a5	2023-06-08 11:04:43 -07:00
Peter Dillinger	2b2994c8db	Fix old comment about HyperClockCache and SecondaryCache (#11517 ) Summary: Support was added in 8.1.0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11517 Test Plan: comments only Reviewed By: anand1976 Differential Revision: D46489929 Pulled By: pdillinger fbshipit-source-id: 4fd30078389065c9ec225bf55b6773f1641f0646	2023-06-07 13:21:31 -07:00
Hui Xiao	3093d98c78	Fix higher read qps during db open caused by pr 11406 (#11516 ) Summary: Context: [PR11406](https://github.com/facebook/rocksdb/pull/11406/) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced - [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on) - more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in Summary: - Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()` - Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior Pull Request resolved: https://github.com/facebook/rocksdb/pull/11516 Test Plan: - db bench Create db with small files prior to PR 11406 ``` ./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd` ``` Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open. ``` ./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd ``` Pre-PR: ``` rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539 ``` Post-PR: ``` rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349 ``` _Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_ ``` class PosixRandomAccessFile : public FSRandomAccessFile { virtual size_t GetRequiredBufferAlignment() const override { - return logical_sector_size_; + return 1; } ``` - CI Reviewed By: pdillinger Differential Revision: D46472566 Pulled By: hx235 fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e	2023-06-06 17:42:43 -07:00
Changyu Bi	2e8cc98ab2	Fix subcompaction bug to allow running two subcompactions (#11501 ) Summary: as reported in https://github.com/facebook/rocksdb/issues/11476, RocksDB currently does not execute compactions in two subcompactions even when they qualify. This PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11501 Test Plan: * Add a new unit test. * Run crash test with max_subcompactions=2: `python3 tools/db_crashtest.py blackbox --simple --subcompactions=2 --target_file_size_base=1048576 --compaction_style=0` * saw logs showing compactions being executed as 2 subcompactions ``` 2023/06/01-17:28:44.028470 3147486 (Original Log Time 2023/06/01-17:28:44.025972) EVENT_LOG_v1 {"time_micros": 1685665724025939, "job": 6, "event": "compaction_finished", "compaction_time_micros": 34539, "compaction_time_cpu_micros": 26404, "output_level": 6, "num_output_files": 2, "total_output_size": 1109796, "num_input_records": 13188, "num_output_records": 13021, "num_subcompactions": 2, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 13]} ``` Reviewed By: ajkr Differential Revision: D46411497 Pulled By: cbi42 fbshipit-source-id: 3ebfc02e19f78f782e114a9546dc3d481d496258	2023-06-06 13:36:02 -07:00
leipeng	ddfcbea3e1	IterKey: change space_[32] to 39 to utilize padding space (#10633 ) Summary: This PR utilize padding space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10633 Reviewed By: pdillinger Differential Revision: D46410978 Pulled By: ajkr fbshipit-source-id: 23ec757b1eea9221c1390971e39d341c6b7f2003	2023-06-06 11:19:15 -07:00
Changyu Bi	633c738a98	Fix unit test `DBRangeDelTest.NonBottommostCompactionDropRangetombstone` (#11512 ) Summary: Fix the test added in https://github.com/facebook/rocksdb/issues/11459 that is failing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11512 Test Plan: `./db_range_del_test --gtest_filter="*NonBottommostCompactionDropRangetombstone"` Reviewed By: pdillinger Differential Revision: D46451450 Pulled By: cbi42 fbshipit-source-id: bcad20b8fd21c4f71924cec6cb045ee4b2038b90	2023-06-05 15:20:57 -07:00
Yu Zhang	4dafa5b220	switch to use RocksDB UnorderedMap (#11507 ) Summary: Switch from std::unordered_map to RocksDB UnorderedMap for all the places that logging user-defined timestamp size in WAL used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11507 Test Plan: ``` make all check ``` Reviewed By: ltamasi Differential Revision: D46448975 Pulled By: jowlyzhang fbshipit-source-id: bdb4d56a723b697a33daaf0f856a61d49a367a99	2023-06-05 13:36:26 -07:00
Changyu Bi	4aa52d89cf	Drop range tombstone during non-bottommost compaction (#11459 ) Summary: Similar to point tombstones, we can drop a range tombstone during compaction when we know its range does not exist in any higher level. This PR adds this optimization. Some existing test in db_range_del_test is fixed to work under this optimization. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11459 Test Plan: * Add unit test `DBRangeDelTest, NonBottommostCompactionDropRangetombstone`. * Ran crash test that issues range deletion for a few hours: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --delrangepercent=10 --writepercent=31 --readpercent=40` Reviewed By: ajkr Differential Revision: D46007904 Pulled By: cbi42 fbshipit-source-id: 3f37205b6778b7d55ed106369ca41b0632a6d0fd	2023-06-05 10:26:40 -07:00
Andrew Kryczka	687a2a0d9a	Small improvements to DBGet microbenchmark (#11498 ) Summary: Follow a couple best practices: - Allowed Google benchmark to decide number of iterations. Previously we hardcoded a value, which circumvented benchmark's heuristic for iterating until the result is stable. - Made each iteration do similar work. Previously, an iteration could do different work depending if the key was found in the first, second, third, or no L0 file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11498 Test Plan: none as I am unable to prove it is better Reviewed By: hx235 Differential Revision: D46339050 Pulled By: ajkr fbshipit-source-id: fcfc6da4111c5b3ae86d79d908afc5f61f96675b	2023-06-02 16:39:14 -07:00
Peter Dillinger	7a9b264f36	Some fixes to unreleased_history/ (#11504 ) Summary: * Add a "Performance Improvements" section * Add required copyright headers Pull Request resolved: https://github.com/facebook/rocksdb/pull/11504 Test Plan: manual Reviewed By: hx235 Differential Revision: D46405128 Pulled By: pdillinger fbshipit-source-id: 4f878dfd0170d381d3051a44c13479c860e812c0	2023-06-02 15:55:02 -07:00
Changyu Bi	71ca9a1dcd	Log correct compaction score for Universal Compaction (#11487 ) Summary: currently 0 is incorrectly logged as the compaction score for L0 when num_levels > 1. This PR fixes the issue to log the correct score. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11487 Test Plan: ``` ./db_bench --benchmarks=fillrandom --max_background_jobs=8 --num=1000000 --compaction_style=1 --stats_dump_period_sec=20 --num_levels=7 --write_buffer_size=1048576 grep "L0 " /tmp/rocksdbtest-543376/dbbench/LOG before: Compaction Stats [default] Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 9.9 0.42 0.33 9 0.046 0 0 0.0 0.0 L0 3/1 1.37 MB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.1 0.6 9.6 3.76 3.03 76 0.050 34K 140 0.0 0.0 L0 2/0 2.26 MB 0.0 0.0 0.0 0.0 0.1 0.1 0.0 1.6 3.2 8.2 12.59 11.17 163 0.077 619K 5499 0.0 0.0 after: compaction scores are non-zero L0 0/0 0.00 KB 0.8 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 9.6 0.43 0.34 9 0.048 0 0 0.0 0.0 L0 2/1 937.08 KB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.1 0.6 9.3 3.85 3.07 75 0.051 34K 165 0.0 0.0 L0 2/2 1.82 MB 1.0 0.0 0.0 0.0 0.1 0.1 0.0 1.6 3.0 8.0 12.45 10.99 160 0.078 577K 5399 0.0 0.0 ``` Reviewed By: ajkr Differential Revision: D46293993 Pulled By: cbi42 fbshipit-source-id: 19753f7df68c5f54a84c4ed52794f83e510c9721	2023-06-01 15:36:19 -07:00
Changyu Bi	e95cc1217d	`CompactRange()` always compacts to bottommost level for leveled compaction (#11468 ) Summary: currently for leveled compaction, the max output level of a call to `CompactRange()` is pre-computed before compacting each level. This max output level is the max level whose key range overlaps with the manual compaction key range. However, during manual compaction, files in the max output level may be compacted down further by some background compaction. When this background compaction is a trivial move, there is a race condition and the manual compaction may not be able to compact all keys in the specified key range. This PR updates `CompactRange()` to always compact to the bottommost level to make this race condition more unlikely (it can still happen, see more in comment here: `796f58f42a/db/db_impl/db_impl_compaction_flush.cc (L1180C29-L1184)`). This PR also changes the behavior of CompactRange() when `bottommost_level_compaction=kIfHaveCompactionFilter` (the default option). The old behavior is that, if a compaction filter is provided, CompactRange() always does an intra-level compaction at the final output level for all files in the manual compaction key range. The only exception when `first_overlapped_level = 0` and `max_overlapped_level = 0`. It’s awkward to maintain the same behavior after this PR since we do not compute max_overlapped_level anymore. So the new behavior is similar to kForceOptimized: always does intra-level compaction at the bottommost level, but not including new files generated during this manual compaction. Several unit tests are updated to work with this new manual compaction behavior. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11468 Test Plan: Add new unit tests `DBCompactionTest.ManualCompactionCompactAllKeysInRange*` Reviewed By: ajkr Differential Revision: D46079619 Pulled By: cbi42 fbshipit-source-id: 19d844ba4ec8dc1a0b8af5d2f36ff15820c6e76f	2023-06-01 15:27:29 -07:00
Yu Zhang	9f7877f246	Add support to strip / pad timestamp when creating / reading a block based table (#11495 ) Summary: Add support to strip timestamp in block based table builder and pad timestamp in block based table reader. On the write path, use the per column family option `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` to indicate whether user-defined timestamps should be stripped for all block based tables created for the column family. On the read path, added a per table `TableReadOption.user_defined_timestamps_persisted` to flag whether the user keys in the table contains user defined timestamps. This patch is mostly passing the related flags down to the block building/parsing level with the exception of handling the `first_internal_key` in `IndexValue`, which is included in the `IndexBuilder` level. The value part of range deletion entries should have a similar handling, I haven't decided where to best fit this piece of logic, I will do it in a follow up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11495 Test Plan: Existing test `BlockBasedTableReaderTest` is parameterized to run with: 1) different UDT test modes: kNone, kNormal, kStripUserDefinedTimestamp 2) all four index types, when index type is `kTwoLevelIndexSearch`, also enables partitioned filters 3) parallel vs non-parallel compression 4) enable/disable compression dictionary. Also added tests for API `BlockBasedTableReader::NewIterator`. `PartitionedFilterBlockTest` is parameterized to run with different UDT test modes:kNone, kNormal, kStripUserDefinedTimestamp. ``` make all check ./block_based_table_reader_test ./partitioned_filter_block_test ``` Reviewed By: ltamasi Differential Revision: D46344577 Pulled By: jowlyzhang fbshipit-source-id: 93ac8542b19319d1298712b8bed908c8831ba675	2023-06-01 11:10:03 -07:00
Changyu Bi	9f1ce6d804	Make `unreleased_history/release.sh` work on macOS (#11494 ) Summary: I got the following errors when running `unreleased_history/release.sh` on my mac. This is due to mac does not have gnu version of awk and find by default. This PR updates the script to work on macOS. ``` awk: calling undefined function strftime input record number 43, file source line number 4 find: -regextype: unknown primary or operator ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11494 Test Plan: manually run `DRY_RUN=1 unreleased_history/release.sh \| less` on macOS and CentOS8 machines. Reviewed By: ajkr Differential Revision: D46328442 Pulled By: cbi42 fbshipit-source-id: a7570cd3480fcd25ac1438beb0d59fe655f9a71a	2023-06-01 10:32:17 -07:00
darionyaphet	68a9cd21f2	Support single delete help message in ldb (#11493 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11493 Reviewed By: jowlyzhang Differential Revision: D46325687 Pulled By: ajkr fbshipit-source-id: ebf08477f5209104aee605496d751c857f4bb0a2	2023-05-31 14:24:54 -07:00

1 2 3 4 5 ...

11966 commits