rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-27 11:43:49 +00:00

Author	SHA1	Message	Date
Changyu Bi	b0e190604b	Update VersionSet last seqno after LogAndApply (#10051 ) Summary: This PR fixes the issue of unstable snapshot during external SST file ingestion. Credit ajkr for the following walk through: consider these relevant steps for of IngestExternalFile(): (1) increase seqno while holding mutex -- `677d2b4a8f/db/db_impl/db_impl.cc (L4768)` (2) LogAndApply() -- `677d2b4a8f/db/db_impl/db_impl.cc (L4797-L4798)` (a) write to MANIFEST with mutex released `a96a4a2f7b/db/version_set.cc (L4407)` (b) apply to in-memory state with mutex held A snapshot taken during (2a) will be unstable. In particular, queries against that snapshot will not include data from the ingested file before (2b), and will include data from the ingested file after (2b). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10051 Test Plan: Added a new unit test: `ExternalSSTFileBasicTest.WriteAfterReopenStableSnapshotWhileLoggingToManifest`. ``` make external_sst_file_basic_test ./external_sst_file_basic_test ``` Reviewed By: ajkr Differential Revision: D36654033 Pulled By: cbi42 fbshipit-source-id: bf720cca313e0cf211585960f3aff04853a31b96	2022-05-25 10:05:17 -07:00
Yu Zhang	d4081bf0be	Add timestamp support to CompactedDBImpl (#10030 ) Summary: This PR is the second and last part for adding user defined timestamp support to read only DB. Specifically, the change in this PR includes: - `options.timestamp` respected by `CompactedDBImpl::Get` and `CompactedDBImpl::MultiGet` to return results visible up till that timestamp. - `CompactedDBImpl::Get(...,std::string* timestsamp)` and `CompactedDBImpl::MultiGet(std::vector<std::string>* timestamps)` return the timestamp(s) associated with the key(s). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10030 Test Plan: ``` $COMPILE_WITH_ASAN=1 make -j24 all $./db_readonly_with_timestamp_test --gtest_filter="DBReadOnlyTestWithTimestamp.CompactedDB" $./db_basic_test --gtest_filter="DBBasicTest.CompactedDB" $make all check ``` Reviewed By: riversand963 Differential Revision: D36613926 Pulled By: jowlyzhang fbshipit-source-id: 5b7ed7fef822708c12e2caf7a8d2deb6a696f0f0	2022-05-24 12:14:10 -07:00
sdong	bea5831bff	Move three info logging within DB Mutex to use log buffer (#10029 ) Summary: info logging with DB Mutex could potentially invoke I/O and cause performance issues. Move three of the cases to use log buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10029 Test Plan: Run existing tests. Reviewed By: jay-zhuang Differential Revision: D36561694 fbshipit-source-id: cabb93fea299001a6b4c2802fcba3fde27fa062c	2022-05-23 10:09:37 -07:00
Yu Zhang	16bdb1f999	Add timestamp support to DBImplReadOnly (#10004 ) Summary: This PR adds timestamp support to a read only DB instance opened as `DBImplReadOnly`. A follow up PR will add the same support to `CompactedDBImpl`. With this, read only database has these timestamp related APIs: `ReadOptions.timestamp` : read should return the latest data visible to this specified timestamp `Iterator::timestamp()` : returns the timestamp associated with the key, value `DB:Get(..., std::string* timestamp)` : returns the timestamp associated with the key, value in `timestamp` Test plan (on devserver): ``` $COMPILE_WITH_ASAN=1 make -j24 all $./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.ReadOnlyDB* ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10004 Reviewed By: riversand963 Differential Revision: D36434422 Pulled By: jowlyzhang fbshipit-source-id: 5d949e65b1ffb845758000e2b310fdd4aae71cfb	2022-05-19 18:39:41 -07:00
anand76	57997ddaaf	Multi file concurrency in MultiGet using coroutines and async IO (#9968 ) Summary: This PR implements a coroutine version of batched MultiGet in order to concurrently read from multiple SST files in a level using async IO, thus reducing the latency of the MultiGet. The API from the user perspective is still synchronous and single threaded, with the RocksDB part of the processing happening in the context of the caller's thread. In Version::MultiGet, the decision is made whether to call synchronous or coroutine code. A good way to review this PR is to review the first 4 commits in order - de773b3, 70c2f70, 10b50e1, and 377a597 - before reviewing the rest. TODO: 1. Figure out how to build it in CircleCI (requires some dependencies to be installed) 2. Do some stress testing with coroutines enabled No regression in synchronous MultiGet between this branch and main - ``` ./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=64 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=true -threads=16 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -statistics ``` Branch - ```multireadrandom : 4.025 micros/op 3975111 ops/sec 60.001 seconds 238509056 operations; 2062.3 MB/s (14767808 of 14767808 found)``` Main - ```multireadrandom : 3.987 micros/op 4013216 ops/sec 60.001 seconds 240795392 operations; 2082.1 MB/s (15231040 of 15231040 found)``` More benchmarks in various scenarios are given below. The measurements were taken with ```async_io=false``` (no coroutines) and ```async_io=true``` (use coroutines). For an IO bound workload (with every key requiring an IO), the coroutines version shows a clear benefit, being ~2.6X faster. For CPU bound workloads, the coroutines version has ~6-15% higher CPU utilization, depending on how many keys overlap an SST file. 1. Single thread IO bound workload on remote storage with sparse MultiGet batch keys (~1 key overlap/file) - No coroutines - ```multireadrandom : 831.774 micros/op 1202 ops/sec 60.001 seconds 72136 operations; 0.6 MB/s (72136 of 72136 found)``` Using coroutines - ```multireadrandom : 318.742 micros/op 3137 ops/sec 60.003 seconds 188248 operations; 1.6 MB/s (188248 of 188248 found)``` 2. Single thread CPU bound workload (all data cached) with ~1 key overlap/file - No coroutines - ```multireadrandom : 4.127 micros/op 242322 ops/sec 60.000 seconds 14539384 operations; 125.7 MB/s (14539384 of 14539384 found)``` Using coroutines - ```multireadrandom : 4.741 micros/op 210935 ops/sec 60.000 seconds 12656176 operations; 109.4 MB/s (12656176 of 12656176 found)``` 3. Single thread CPU bound workload with ~2 key overlap/file - No coroutines - ```multireadrandom : 3.717 micros/op 269000 ops/sec 60.000 seconds 16140024 operations; 139.6 MB/s (16140024 of 16140024 found)``` Using coroutines - ```multireadrandom : 4.146 micros/op 241204 ops/sec 60.000 seconds 14472296 operations; 125.1 MB/s (14472296 of 14472296 found)``` 4. CPU bound multi-threaded (16 threads) with ~4 key overlap/file - No coroutines - ```multireadrandom : 4.534 micros/op 3528792 ops/sec 60.000 seconds 211728728 operations; 1830.7 MB/s (12737024 of 12737024 found) ``` Using coroutines - ```multireadrandom : 4.872 micros/op 3283812 ops/sec 60.000 seconds 197030096 operations; 1703.6 MB/s (12548032 of 12548032 found) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9968 Reviewed By: akankshamahajan15 Differential Revision: D36348563 Pulled By: anand1976 fbshipit-source-id: c0ce85a505fd26ebfbb09786cbd7f25202038696	2022-05-19 15:36:27 -07:00
Jay Zhuang	c6d326d3d7	Track SST unique id in MANIFEST and verify (#9990 ) Summary: Start tracking SST unique id in MANIFEST, which is used to verify with SST properties to make sure the SST file is not overwritten or misplaced. A DB option `try_verify_sst_unique_id` is introduced to enable/disable the verification, if enabled, it opens all SST files during DB-open to read the unique_id from table properties (default is false), so it's recommended to use it with `max_open_files = -1` to pre-open the files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990 Test Plan: unittests, format-compatible test, mini-crash Reviewed By: anand1976 Differential Revision: D36381863 Pulled By: jay-zhuang fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f	2022-05-19 11:04:21 -07:00
Jay Zhuang	b84e3363f5	Add table_properties_collector_factories override (#9995 ) Summary: Add table_properties_collector_factories override on the remote side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9995 Test Plan: unittest added Reviewed By: ajkr Differential Revision: D36392623 Pulled By: jay-zhuang fbshipit-source-id: 3ba031294d90247ca063d7de7b43178d38e3f66a	2022-05-17 20:57:51 -07:00
Yanqin Jin	b58a1a035b	Revert "Bugfix/fix manual flush blocking bug (#9893 )" (#9992 ) Summary: This reverts commit `6d2577e567`. A proposal for resolving our current internal test failures. A fix is being planned. More context can be found: https://github.com/facebook/rocksdb/pull/9893#issuecomment-1126230634 TSAN error: https://github.com/facebook/rocksdb/pull/9893#issuecomment-1126233132 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9992 Reviewed By: siying Differential Revision: D36379154 Pulled By: riversand963 fbshipit-source-id: b240261e766eff099513799cf5631832093f4cd2	2022-05-13 12:31:30 -07:00
Akanksha Mahajan	6442a62e46	Update WAL corruption test so that it fails without fix (#9942 ) Summary: In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't flush the data from WAL to L0 for all column families if possible. As a result, not all column families can increase their log_numbers, and min_log_number_to_keep won't change. For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change. If we persist a new MANIFEST with advanced log_numbers for some column families, then during a second crash after persisting the MANIFEST, RocksDB will see some column families' log_numbers larger than the corrupted WAL, and the "column family inconsistency" error will be hit, causing recovery to fail. This PR update unit tests to emulate the errors and tests are failing without a fix. Error: ``` [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/0 db/corruption_test.cc:1190: Failure DB::Open(options, dbname_, cf_descs, &handles, &db_) Corruption: SST file is ahead of WALs in CF test_cf [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/0, where GetParam() = (true, false) (91 ms) [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/1 db/corruption_test.cc:1190: Failure DB::Open(options, dbname_, cf_descs, &handles, &db_) Corruption: SST file is ahead of WALs in CF test_cf [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/1, where GetParam() = (false, false) (92 ms) [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/2 db/corruption_test.cc:1190: Failure DB::Open(options, dbname_, cf_descs, &handles, &db_) Corruption: SST file is ahead of WALs in CF test_cf [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/2, where GetParam() = (true, true) (95 ms) [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/3 db/corruption_test.cc:1190: Failure DB::Open(options, dbname_, cf_descs, &handles, &db_) Corruption: SST file is ahead of WALs in CF test_cf [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/3, where GetParam() = (false, true) (92 ms) [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/0 db/corruption_test.cc:1354: Failure TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db) Corruption: SST file is ahead of WALs in CF default [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/0, where GetParam() = (true, false) (94 ms) [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/1 db/corruption_test.cc:1354: Failure TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db) Corruption: SST file is ahead of WALs in CF default [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/1, where GetParam() = (false, false) (97 ms) [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/2 db/corruption_test.cc:1354: Failure TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db) Corruption: SST file is ahead of WALs in CF default [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/2, where GetParam() = (true, true) (94 ms) [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/3 db/corruption_test.cc:1354: Failure TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db) Corruption: SST file is ahead of WALs in CF default [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/3, where GetParam() = (false, true) (91 ms) [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/0 db/corruption_test.cc:1483: Failure DB::Open(options, dbname_, cf_descs, &handles, &db_) Corruption: SST file is ahead of WALs in CF default [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/0, where GetParam() = (true, false) (93 ms) [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/1 db/corruption_test.cc:1483: Failure DB::Open(options, dbname_, cf_descs, &handles, &db_) Corruption: SST file is ahead of WALs in CF default [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/1, where GetParam() = (false, false) (94 ms) [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/2 db/corruption_test.cc:1483: Failure DB::Open(options, dbname_, cf_descs, &handles, &db_) Corruption: SST file is ahead of WALs in CF default [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/2, where GetParam() = (true, true) (90 ms) [ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/3 db/corruption_test.cc:1483: Failure DB::Open(options, dbname_, cf_descs, &handles, &db_) Corruption: SST file is ahead of WALs in CF default [ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/3, where GetParam() = (false, true) (93 ms) [----------] 12 tests from CorruptionTest/CrashDuringRecoveryWithCorruptionTest (1116 ms total) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9942 Test Plan: Not needed Reviewed By: riversand963 Differential Revision: D36324112 Pulled By: akankshamahajan15 fbshipit-source-id: cab2075ac4ebe48f5ef93a6ea162558aa4fc334d	2022-05-11 16:12:55 -07:00
sdong	736a7b5433	Remove own ToString() (#9955 ) Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471	2022-05-06 13:03:58 -07:00
Otto Kekäläinen	b7aaa98762	Fix various spelling errors still found in code (#9653 ) Summary: dont -> don't refered -> referred This is a re-run of PR#7785 and `acc9679` since these typos keep coming back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9653 Reviewed By: jay-zhuang Differential Revision: D34879593 fbshipit-source-id: d7631fb779ea0129beae92abfb838038e60790f8	2022-05-05 19:45:32 -07:00
sdong	49628c9a83	Use std::numeric_limits<> (#9954 ) Summary: Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954 Test Plan: See CI Runs. Reviewed By: riversand963 Differential Revision: D36173954 fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0	2022-05-05 13:08:21 -07:00
Changyu Bi	37f490834d	Specify largest_seqno in VerifyChecksum (#9919 ) Summary: `VerifyChecksum()` does not specify `largest_seqno` when creating a `TableReader`. As a result, the `TableReader` uses the `TableReaderOptions` default value (0) for `largest_seqno`. This causes the following error when the file has a nonzero global seqno in its properties: ``` Corruption: An external sst file with version 2 have global seqno property with value , while largest seqno in the file is 0 ``` This PR fixes this by specifying `largest_seqno` in `VerifyChecksumInternal` with `largest_seqno` from the file metadata. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9919 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D36028824 Pulled By: cbi42 fbshipit-source-id: 428d028a79386f46ef97bb6b6051dc76c83e1f2b	2022-05-02 10:22:08 -07:00
Anvesh Komuravelli	aafb377bb5	Update protection info on recovered logs data (#9875 ) Summary: Update protection info on recovered logs data Pull Request resolved: https://github.com/facebook/rocksdb/pull/9875 Test Plan: - Benchmark setup: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576000` - Benchmark command: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=overwrite -write_buffer_size=1048576000 -writes=1 -report_open_timing=true` - Results before this PR ``` OpenDb: 2350.14 milliseconds OpenDb: 2296.94 milliseconds OpenDb: 2184.29 milliseconds OpenDb: 2167.59 milliseconds OpenDb: 2231.24 milliseconds OpenDb: 2109.57 milliseconds OpenDb: 2197.71 milliseconds OpenDb: 2120.8 milliseconds OpenDb: 2148.12 milliseconds OpenDb: 2207.95 milliseconds ``` - Results after this PR ``` OpenDb: 2424.52 milliseconds OpenDb: 2359.84 milliseconds OpenDb: 2317.68 milliseconds OpenDb: 2339.4 milliseconds OpenDb: 2325.36 milliseconds OpenDb: 2321.06 milliseconds OpenDb: 2353.98 milliseconds OpenDb: 2344.64 milliseconds OpenDb: 2384.09 milliseconds OpenDb: 2428.58 milliseconds ``` Mean regressed 7.2% (2201.4 -> 2359.9) Reviewed By: ajkr Differential Revision: D36012787 Pulled By: akomurav fbshipit-source-id: d2aba09f29c6beb2fd0fe8e1e359be910b4ef02a	2022-04-28 14:42:00 -07:00
Andrew Kryczka	c5d367f472	Revert open logic changes in #9634 (#9906 ) Summary: Left HISTORY.md and unit tests. Added a new unit test to repro the corruption scenario that this PR fixes, and HISTORY.md line for that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9906 Reviewed By: riversand963 Differential Revision: D35940093 Pulled By: ajkr fbshipit-source-id: 9816f99e1ce405ba36f316beb4f6378c37c8c86b	2022-04-26 14:46:53 -07:00
RoeyMaor	6d2577e567	Bugfix/fix manual flush blocking bug (#9893 ) Summary: Fix https://github.com/facebook/rocksdb/issues/9892 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9893 Reviewed By: jay-zhuang Differential Revision: D35880959 Pulled By: ajkr fbshipit-source-id: dad1139ad0983cfbd5c5cd6fa6b71022f889735a	2022-04-25 18:52:33 -07:00
Yanqin Jin	fe63899d1a	Add checks to GetUpdatesSince (#9459 ) Summary: Make `DB::GetUpdatesSince` return early if told to scan WALs generated by transactions with write-prepared or write-unprepared policies (`seq_per_batch` is true), as indicated by API comment. Also add checks to `TransactionLogIterator` to clarify some conditions. No API change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9459 Test Plan: make check Closing https://github.com/facebook/rocksdb/issues/1565 Reviewed By: akankshamahajan15 Differential Revision: D33821243 Pulled By: riversand963 fbshipit-source-id: c8b155d020ce0980e2d3b3b1da40b96e65b48d79	2022-04-14 17:12:16 -07:00
Yanqin Jin	0bd4dcde6b	CompactionIterator sees consistent view of which keys are committed (#9830 ) Summary: This PR does not affect the functionality of `DB` and write-committed transactions. `CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed. As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if it is committed. In fact, the implementation of `KeyCommitted()` is as follows: ``` inline bool KeyCommitted(SequenceNumber seq) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr \|\| snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot; } ``` With that being said, we focus on write-prepared/write-unprepared transactions. A few notes: - A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database. - `CompactionIterator` outputs a key as long as the key is uncommitted. Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`. Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone. To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot to determine whether a key is committed or not with minor change to `KeyCommitted()`. ``` inline bool KeyCommitted(SequenceNumber sequence) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr \|\| snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) == SnapshotCheckerResult::kInSnapshot; } ``` As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble for `CompactionIterator`s assertions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830 Test Plan: make check Reviewed By: ltamasi Differential Revision: D35561162 Pulled By: riversand963 fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f	2022-04-14 11:11:04 -07:00
Jay Zhuang	dc1c90c4e3	Support canceling running RemoteCompaction on remote side (#9725 ) Summary: Add the ability to cancel remote compaction on the remote side by setting `OpenAndCompactOptions.canceled` to true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9725 Test Plan: added unittest Reviewed By: ajkr Differential Revision: D35018800 Pulled By: jay-zhuang fbshipit-source-id: be3652f9645e0347df429e42a5614d5a9b3a1ec4	2022-04-13 13:28:09 -07:00
Peter Dillinger	efd035164b	Meta-internal folly integration with F14FastMap (#9546 ) Summary: Especially after updating to C++17, I don't see a compelling case for requiring any folly components in RocksDB. I was able to purge the existing hard dependencies, and it can be quite difficult to strip out non-trivial components from folly for use in RocksDB. (The prospect of doing that on F14 has changed my mind on the best approach here.) But this change creates an optional integration where we can plug in components from folly at compile time, starting here with F14FastMap to replace std::unordered_map when possible (probably no public APIs for example). I have replaced the biggest CPU users of std::unordered_map with compile-time pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set. USE_FOLLY is always set in the Meta-internal buck build, and a simulation of that is in the Makefile for public CI testing. A full folly build is not needed, but checking out the full folly repo is much simpler for getting the dependency, and anything else we might want to optionally integrate in the future. Some picky details: * I don't think the distributed mutex stuff is actually used, so it was easy to remove. * I implemented an alternative to `folly::constexpr_log2` (which is much easier in C++17 than C++11) so that I could pull out the hard dependencies on `ConstexprMath.h` * I had to add noexcept move constructors/operators to some types to make F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a macro to make that easier in some common cases. * Updated Meta-internal buck build to use folly F14Map (always) No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a production integration for open source users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546 Test Plan: CircleCI tests updated so that a couple of them use folly. Most internal unit & stress/crash tests updated to use Meta-internal latest folly. (Note: they should probably use buck but they currently use Makefile.) Example performance improvement: when filter partitions are pinned in cache, they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build a test that exercises that heavily. Build DB with ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters ``` and test with (simultaneous runs with & without folly, ~20 times each to see convergence) ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache ``` Average ops/s no folly: 26229.2 Average ops/s with folly: 26853.3 (+2.4%) Reviewed By: ajkr Differential Revision: D34181736 Pulled By: pdillinger fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94	2022-04-13 07:34:01 -07:00
Jay Zhuang	f934a0af46	Add event listener support on remote compactor side (#9821 ) Summary: So the user is able to set event listener on the compactor side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9821 Test Plan: unittest added Reviewed By: ajkr Differential Revision: D35485388 Pulled By: jay-zhuang fbshipit-source-id: 669d8a3aaee012b75b940470306756c03ffa09b2	2022-04-12 17:25:36 -07:00
Akanksha Mahajan	ae82d91492	Remove corrupted WAL files in kPointRecoveryMode with avoid_flush_duing_recovery set true (#9634 ) Summary: 1) In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't flush the data from WAL to L0 for all column families if possible. As a result, not all column families can increase their log_numbers, and min_log_number_to_keep won't change. 2) For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change. If we persist a new MANIFEST with advanced log_numbers for some column families, then during a second crash after persisting the MANIFEST, RocksDB will see some column families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail. As a solution, 1. the corrupted WALs whose numbers are larger than the corrupted wal and smaller than the new WAL will be moved to archive folder. 2. Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful Pull Request resolved: https://github.com/facebook/rocksdb/pull/9634 Test Plan: 1. Added new unit tests 2. make crast_test -j Reviewed By: riversand963 Differential Revision: D34463666 Pulled By: akankshamahajan15 fbshipit-source-id: e233d3af0ed4e2028ca0cf051e5a334a0fdc9d19	2022-04-11 15:39:31 -07:00
Yanqin Jin	0ad9ee30ce	Remove dead code (#9825 ) Summary: Options `preserve_deletes` and `iter_start_seqnum` have been removed since 7.0. This PR removes dead code related to these two removed options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9825 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D35517950 Pulled By: riversand963 fbshipit-source-id: 86282ce5ec4087acb94a06a42a1b6d55b1715482	2022-04-11 10:26:55 -07:00
Andrew Kryczka	04623e7cd4	Fix GetMergeOperands() heap-use-after-free on flushed memtable (#9805 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/9066. Prior to the fix in this PR, this PR's unit test reported the following error under ASAN: ``` ==2175705==ERROR: AddressSanitizer: heap-use-after-free on address 0x61f0000012a5 at pc 0x7f0fc36e76ce bp 0x7ffc103e9ca0 sp 0x7ffc103e9450 READ of size 5 at 0x61f0000012a5 thread T0 #0 0x7f0fc36e76cd in __interceptor_memcpy /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:790 https://github.com/facebook/rocksdb/issues/1 0x7f0fc35a207e in std::char_traits<char>::copy(char, char const, unsigned long) /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/include/bits/char_traits.h:365 https://github.com/facebook/rocksdb/issues/2 0x7f0fc35a207e in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy(char, char const, unsigned long) /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/include/bits/basic_string.h:351 https://github.com/facebook/rocksdb/issues/3 0x7f0fc35a207e in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned long, char const, unsigned long) /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/include/bits/basic_string.tcc:440 https://github.com/facebook/rocksdb/issues/4 0x8679ca in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign(char const, unsigned long) /mnt/gvfs/third-party2/libgcc/4959b39cfbe5965a37c861c4c327fa7c5c759b87/9.x/platform009/9202ce7/include/c++/9.3.0/bits/basic_string.h:1422 https://github.com/facebook/rocksdb/issues/5 0x8679ca in rocksdb::PinnableSlice::PinSelf(rocksdb::Slice const&) include/rocksdb/slice.h:171 https://github.com/facebook/rocksdb/issues/6 0x8679ca in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1930 https://github.com/facebook/rocksdb/issues/7 0x547324 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle, rocksdb::Slice const&, rocksdb::PinnableSlice, rocksdb::GetMergeOperandsOptions, int) db/db_impl/db_impl.h:203 https://github.com/facebook/rocksdb/issues/8 0x547324 in rocksdb::DBMergeOperandTest_FlushedMergeOperandReadAfterFreeBug_Test::TestBody() db/db_merge_operand_test.cc:117 https://github.com/facebook/rocksdb/issues/9 0x7241da in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899 https://github.com/facebook/rocksdb/issues/10 0x7241da in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935 https://github.com/facebook/rocksdb/issues/11 0x701a47 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3973 https://github.com/facebook/rocksdb/issues/12 0x702040 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3965 https://github.com/facebook/rocksdb/issues/13 0x702040 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4149 https://github.com/facebook/rocksdb/issues/14 0x7025f7 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4124 https://github.com/facebook/rocksdb/issues/15 0x7025f7 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4267 https://github.com/facebook/rocksdb/issues/16 0x704217 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4253 https://github.com/facebook/rocksdb/issues/17 0x704217 in testing::internal::UnitTestImpl::RunAllTests() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6633 https://github.com/facebook/rocksdb/issues/18 0x72505a in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899 https://github.com/facebook/rocksdb/issues/19 0x72505a in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935 https://github.com/facebook/rocksdb/issues/20 0x704aa1 in testing::UnitTest::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6242 https://github.com/facebook/rocksdb/issues/21 0x4c4aff in RUN_ALL_TESTS() third-party/gtest-1.8.1/fused-src/gtest/gtest.h:22110 https://github.com/facebook/rocksdb/issues/22 0x4c4aff in main db/db_merge_operand_test.cc:404 https://github.com/facebook/rocksdb/issues/23 0x7f0fc3108dc4 in __libc_start_main ../csu/libc-start.c:308 https://github.com/facebook/rocksdb/issues/24 0x5445fd in _start (/data/users/andrewkr/rocksdb/db_merge_operand_test+0x5445fd) 0x61f0000012a5 is located 1061 bytes inside of 3264-byte region [0x61f000000e80,0x61f000001b40) freed by thread T0 here: #0 0x7f0fc375b6af in operator delete(void, unsigned long) /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/libsanitizer/asan/asan_new_delete.cc:177 https://github.com/facebook/rocksdb/issues/1 0x743be8 in rocksdb::SuperVersion::~SuperVersion() db/column_family.cc:432 https://github.com/facebook/rocksdb/issues/2 0x8052aa in rocksdb::DBImpl::CleanupSuperVersion(rocksdb::SuperVersion) db/db_impl/db_impl.cc:3534 https://github.com/facebook/rocksdb/issues/3 0x8676c2 in rocksdb::DBImpl::ReturnAndCleanupSuperVersion(rocksdb::ColumnFamilyData, rocksdb::SuperVersion) db/db_impl/db_impl.cc:3544 https://github.com/facebook/rocksdb/issues/4 0x8676c2 in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1911 https://github.com/facebook/rocksdb/issues/5 0x547324 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle, rocksdb::Slice const&, rocksdb::PinnableSlice, rocksdb::GetMergeOperandsOptions, int) db/db_impl/db_impl.h:203 https://github.com/facebook/rocksdb/issues/6 0x547324 in rocksdb::DBMergeOperandTest_FlushedMergeOperandReadAfterFreeBug_Test::TestBody() db/db_merge_operand_test.cc:117 https://github.com/facebook/rocksdb/issues/7 0x7241da in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899 https://github.com/facebook/rocksdb/issues/8 0x7241da in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935 https://github.com/facebook/rocksdb/issues/9 0x701a47 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3973 https://github.com/facebook/rocksdb/issues/10 0x702040 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3965 https://github.com/facebook/rocksdb/issues/11 0x702040 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4149 https://github.com/facebook/rocksdb/issues/12 0x7025f7 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4124 https://github.com/facebook/rocksdb/issues/13 0x7025f7 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4267 https://github.com/facebook/rocksdb/issues/14 0x704217 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4253 https://github.com/facebook/rocksdb/issues/15 0x704217 in testing::internal::UnitTestImpl::RunAllTests() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6633 https://github.com/facebook/rocksdb/issues/16 0x72505a in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899 https://github.com/facebook/rocksdb/issues/17 0x72505a in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935 https://github.com/facebook/rocksdb/issues/18 0x704aa1 in testing::UnitTest::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6242 https://github.com/facebook/rocksdb/issues/19 0x4c4aff in RUN_ALL_TESTS() third-party/gtest-1.8.1/fused-src/gtest/gtest.h:22110 https://github.com/facebook/rocksdb/issues/20 0x4c4aff in main db/db_merge_operand_test.cc:404 https://github.com/facebook/rocksdb/issues/21 0x7f0fc3108dc4 in __libc_start_main ../csu/libc-start.c:308 https://github.com/facebook/rocksdb/issues/22 0x5445fd in _start (/data/users/andrewkr/rocksdb/db_merge_operand_test+0x5445fd) ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9805 Test Plan: following the fix in this PR, the new unit test passes Reviewed By: jay-zhuang Differential Revision: D35388415 Pulled By: ajkr fbshipit-source-id: b39c5d002155906c8abc4a3429eca696dbf916d0	2022-04-05 12:26:36 -07:00
Yanqin Jin	1a1c5bda23	Disallow commit-time-batch for write-prepared/write-unprepared txn conditionally (#9794 ) Summary: For write-prepared/write-unprepared transactions, GetCommitTimeWriteBatch() can be used only if the transaction is started with `TransactionOptions::use_only_the_last_commit_time_batch_for_recovery` set to true. Otherwise, it is possible that multiple uncommitted versions of the same key exist in the database. During bottommost compaction, RocksDB may set the sequence numbers of both to zero once they become committed, causing output SST file to have two identical internal keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9794 Test Plan: make check pay special attention to the following ``` transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/* ``` Reviewed By: lth Differential Revision: D35327214 Pulled By: riversand963 fbshipit-source-id: 3bae00a28359c10e96e4c6f676d20de5610d8a0f	2022-04-05 11:10:20 -07:00
sdong	bbcf7b192c	Fix DB::Open() error logging (#9784 ) Summary: Right now we log a wrong error when DB::Open() fails. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9784 Test Plan: CI runs should pass Reviewed By: ajkr, riversand963 Differential Revision: D35290203 fbshipit-source-id: ffc640afa27f6b0a2382ee153dc43f28d9e242be	2022-03-31 15:52:01 -07:00
Yanqin Jin	de9df6e818	Do not release and re-acquire dbmutex on memtable-switch if no listener (#9758 ) Summary: There is no need to release-and-acquire immediately when no listener is registered. This is what we have been doing for `NotifyOnFlushBegin()`, `NotifyOnFlushCompleted()`, `NotifyOnCompactionBegin()`, `NotifyOnCompactionCompleted()`, and some other `NotifyOnXX` methods in event_helpers.cc. Do the same for `NotifyOnMemTableSealed ()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9758 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D35159552 Pulled By: riversand963 fbshipit-source-id: 6e0aac50bd5c8f506d809b6638c33a7a28d1e87f	2022-03-30 20:48:23 -07:00
Peter Dillinger	105d7f0c7c	Document SetOptions API (#9778 ) Summary: much needed Some other minor tweaks also Pull Request resolved: https://github.com/facebook/rocksdb/pull/9778 Test Plan: existing tests Reviewed By: ajkr Differential Revision: D35258195 Pulled By: pdillinger fbshipit-source-id: 974ddafc23a540aacceb91da72e81593d818f99c	2022-03-30 14:51:12 -07:00
Peter Dillinger	cad809978a	Fix heap use-after-free race with DropColumnFamily (#9730 ) Summary: Although ColumnFamilySet comments say that DB mutex can be freed during iteration, as long as you hold a ref while releasing DB mutex, this is not quite true because UnrefAndTryDelete might delete cfd right before it is needed to get ->next_ for the next iteration of the loop. This change solves the problem by making a wrapper class that makes such iteration easier while handling the tricky details of UnrefAndTryDelete on the previous cfd only after getting next_ in operator++. FreeDeadColumnFamilies should already have been obsolete; this removes it for good. Similarly, ColumnFamilySet::iterator doesn't need to check for cfd with 0 refs, because those are immediately deleted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9730 Test Plan: was reported with ASAN on unit tests like DBLogicalBlockSizeCacheTest.CreateColumnFamily (very rare); keep watching Reviewed By: ltamasi Differential Revision: D35038143 Pulled By: pdillinger fbshipit-source-id: 0a5478d5be96c135343a00603711b7df43ae19c9	2022-03-24 13:05:17 -07:00
Yanqin Jin	e0c84aa0dc	Fix a race condition in WAL tracking causing DB open failure (#9715 ) Summary: There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC. The race condition is between two background flush threads trying to install flush results to the MANIFEST. Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially, both column families have one mutable (active) memtable whose data backed by 6.log. 1. Trigger a manual flush for "cf1", creating a 7.log 2. Insert another key to "default", and trigger flush for "default", creating 8.log 3. BgFlushThread1 finishes writing 9.sst 4. BgFlushThread2 finishes writing 10.sst ``` Time BgFlushThread1 BgFlushThread2 \| mutex_.Lock() \| precompute min_wal_to_keep as 6 \| mutex_.Unlock() \| mutex_.Lock() \| precompute min_wal_to_keep as 6 \| join MANIFEST write queue and mutex_.Unlock() \| write to MANIFEST \| mutex_.Lock() \| cfd1->log_number = 7 \| Signal bg_flush_2 and mutex_.Unlock() \| wake up and mutex_.Lock() \| cfd0->log_number = 8 \| FindObsoleteFiles() with job_context->log_number == 7 \| mutex_.Unlock() \| PurgeObsoleteFiles() deletes 6.log V ``` As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6). Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6). No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`, due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514. The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e. the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist. If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true. We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know the correct min wal number until the other bg flush threads have finished committing to the manifest and updated the `cfd::log_number`. To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`, and use it to track WAL file deletion in non-2pc mode as well. This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread. `min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715 Test Plan: ``` make check ``` Also ran stress test below (with asan) to make sure it completes successfully. ``` TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \ CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \ make J=52 -j52 blackbox_asan_crash_test ``` Reviewed By: ltamasi Differential Revision: D34984412 Pulled By: riversand963 fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005	2022-03-23 19:41:31 -07:00
Yanqin Jin	29bec740f5	Return invalid argument if batch is null (#9744 ) Summary: Originally, a corruption will be returned by `DBImpl::WriteImpl(batch...)` if batch is null. This is inaccurate since there is no data corruption. Return `Status::InvalidArgument()` instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9744 Test Plan: make check Reviewed By: ltamasi Differential Revision: D35086268 Pulled By: riversand963 fbshipit-source-id: 677397b007a53bc25210eac0178d49c9797b5951	2022-03-23 14:28:13 -07:00
duyuqi	cbe303c19b	fix a bug, c api, if enable inplace_update_support, and use create sn… (#9471 ) Summary: c api release snapshot will core dump when enable inplace_update_support and create snapshot Pull Request resolved: https://github.com/facebook/rocksdb/pull/9471 Reviewed By: akankshamahajan15 Differential Revision: D34965103 Pulled By: riversand963 fbshipit-source-id: c3aeeb9ea7126c2eda1466102794fecf57b6ab77	2022-03-21 12:04:33 -07:00
Yanqin Jin	b2aacaf923	Fix assertion error by doing comparison with mutex (#9717 ) Summary: On CircleCI MacOS instances, we have been seeing the following assertion error: ``` Assertion failed: (alive_log_files_tail_ == alive_log_files_.rbegin()), function WriteToWAL, file /Users/distiller/project/db/db_impl/db_impl_write.cc, line 1213. Received signal 6 (Abort trap: 6) #0 0x1 https://github.com/facebook/rocksdb/issues/1 abort (in libsystem_c.dylib) + 120 https://github.com/facebook/rocksdb/issues/2 err (in libsystem_c.dylib) + 0 https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::WriteToWAL(rocksdb::WriteBatch const&, rocksdb::log::Writer, unsigned long long, unsigned long long, rocksdb::Env::IOPriority, bool, bool) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:1213) https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer, unsigned long long, bool, bool, unsigned long long) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:1251) https://github.com/facebook/rocksdb/issues/5 rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch, rocksdb::WriteCallback, unsigned long long, unsigned long long, bool, unsigned long long, unsigned long, rocksdb::PreReleaseCallback) (in librocksdb.7.0.0.dylib) (db_impl_ rite.cc:421) https://github.com/facebook/rocksdb/issues/6 rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:109) https://github.com/facebook/rocksdb/issues/7 rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:2159) https://github.com/facebook/rocksdb/issues/8 rocksdb::DBImpl::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db_impl_write.cc:37) https://github.com/facebook/rocksdb/issues/9 rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) (in librocksdb.7.0.0.dylib) (db.h:382) https://github.com/facebook/rocksdb/issues/10 rocksdb::DBBasicTestWithTimestampPrefixSeek_IterateWithPrefix_Test::TestBody() (in db_with_timestamp_basic_test) (db_with_timestamp_basic_test.cc:2926) https://github.com/facebook/rocksdb/issues/11 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) (in db_with_timestamp_basic_test) (gtest-all.cc:3899) https://github.com/facebook/rocksdb/issues/12 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) (in db_with_timestamp_basic_test) (gtest-all.cc:3935) https://github.com/facebook/rocksdb/issues/13 testing::Test::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:3980) https://github.com/facebook/rocksdb/issues/14 testing::TestInfo::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:4153) https://github.com/facebook/rocksdb/issues/15 testing::TestCase::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:4266) https://github.com/facebook/rocksdb/issues/16 testing::internal::UnitTestImpl::RunAllTests() (in db_with_timestamp_basic_test) (gtest-all.cc:6632) https://github.com/facebook/rocksdb/issues/17 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const) (in db_with_timestamp_basic_test) (gtest-all.cc:3899) https://github.com/facebook/rocksdb/issues/18 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const*) (in db_with_timestamp_basic_test) (gtest-all.cc:3935) https://github.com/facebook/rocksdb/issues/19 testing::UnitTest::Run() (in db_with_timestamp_basic_test) (gtest-all.cc:6242) https://github.com/facebook/rocksdb/issues/20 RUN_ALL_TESTS() (in db_with_timestamp_basic_test) (gtest.h:22110) https://github.com/facebook/rocksdb/issues/21 main (in db_with_timestamp_basic_test) (db_with_timestamp_basic_test.cc:3150) https://github.com/facebook/rocksdb/issues/22 start (in libdyld.dylib) + 1 ``` It's likely caused by concurrent, unprotected access to the deque, even though `back()` is never popped, and we are comparing `rbegin()` with a cached `riterator`. To be safe, do the comparison only if we have mutex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9717 Test Plan: One example Ssh to one CircleCI MacOS instance. ``` gtest-parallel -r 1000 -w 8 ./db_test --gtest_filter=DBTest.FlushesInParallelWithCompactRange ``` Reviewed By: pdillinger Differential Revision: D34990696 Pulled By: riversand963 fbshipit-source-id: 62dd48ae6fedbda53d0a64d73de9b948b4c26eee	2022-03-18 13:11:57 -07:00
anand76	a88d8795ec	Expand auto recovery to background read errors (#9679 ) Summary: Fix and enhance the background error recovery logic to handle the following situations - 1. Background read errors during flush/compaction (previously was resulting in unrecoverable state) 2. Fix auto recovery failure on read/write errors during atomic flush. It was failing due to a bug in setting the resuming_from_bg_err variable in AtomicFlushMemTablesToOutputFiles. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9679 Test Plan: Add new unit tests in error_handler_fs_test Reviewed By: riversand963 Differential Revision: D34770097 Pulled By: anand1976 fbshipit-source-id: 136da973a28d684b9c74bdf668519b0cbbbe1742	2022-03-15 14:45:34 -07:00
Jay Zhuang	2c8100e60e	Fix a race condition when disable and enable manual compaction (#9694 ) Summary: In https://github.com/facebook/rocksdb/issues/9659, when `DisableManualCompaction()` is issued, the foreground manual compaction thread does not have to wait background compaction thread to finish. Which could be a problem that the user re-enable manual compaction with `EnableManualCompaction()`, it may re-enable the BG compaction which supposed be cancelled. This patch makes the FG compaction wait on `manual_compaction_state.done`, which either be set by BG compaction or Unschedule callback. Then when FG manual compaction thread returns, it should not have BG compaction running. So shared_ptr is no longer needed for `manual_compaction_state`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9694 Test Plan: a StressTest and unittest Reviewed By: ajkr Differential Revision: D34885472 Pulled By: jay-zhuang fbshipit-source-id: e6476175b43e8c59cd49f5c09241036a0716c274	2022-03-15 12:31:14 -07:00
Yanqin Jin	6a76008369	Fix TSAN caused by calling `rend()` and `pop_front()`. (#9698 ) Summary: PR9686 makes `WriteToWAL()` call `assert(...!=rend())` while not holding db mutex or log mutex. Another thread may concurrently call `pop_front()`, causing race condition. To fix, assert only if mutex is held. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9698 Test Plan: COMPILE_WITH_TSAN=1 make check Reviewed By: jay-zhuang Differential Revision: D34898535 Pulled By: riversand963 fbshipit-source-id: 1ddfa5bf1b6ae8d409cab6ff6e1b5321c6803da9	2022-03-15 12:16:40 -07:00
Yanqin Jin	bbdaf63d0f	Fix a TSAN-reported bug caused by concurrent accesss to std::deque (#9686 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9686 According to https://www.cplusplus.com/reference/deque/deque/back/, " The container is accessed (neither the const nor the non-const versions modify the container). The last element is potentially accessed or modified by the caller. Concurrently accessing or modifying other elements is safe. " Also according to https://www.cplusplus.com/reference/deque/deque/pop_front/, " The container is modified. The first element is modified. Concurrently accessing or modifying other elements is safe (although see iterator validity above). " In RocksDB, we never pop the last element of `DBImpl::alive_log_files_`. We have been exploiting this fact and the above two properties when ensuring correctness when `DBImpl::alive_log_files_` may be accessed concurrently. Specifically, it can be accessed in the write path when db mutex is released. Sometimes, the log_mute_ is held. It can also be accessed in `FindObsoleteFiles()` when db mutex is always held. It can also be accessed during recovery when db mutex is also held. Given the fact that we never pop the last element of alive_log_files_, we currently do not acquire additional locks when accessing it in `WriteToWAL()` as follows ``` alive_log_files_.back().AddSize(log_entry.size()); ``` This is problematic. Check source code of deque.h ``` back() _GLIBCXX_NOEXCEPT { __glibcxx_requires_nonempty(); ... } pop_front() _GLIBCXX_NOEXCEPT { ... if (this->_M_impl._M_start._M_cur != this->_M_impl._M_start._M_last - 1) { ... ++this->_M_impl._M_start._M_cur; } ... } ``` `back()` will actually call `__glibcxx_requires_nonempty()` first. If `__glibcxx_requires_nonempty()` is enabled and not an empty macro, it will call `empty()` ``` bool empty() { return this->_M_impl._M_finish == this->_M_impl._M_start; } ``` You can see that it will access `this->_M_impl._M_start`, racing with `pop_front()`. Therefore, TSAN will actually catch the bug in this case. To be able to use TSAN on our library and unit tests, we should always coordinate concurrent accesses to STL containers properly. We need to pass information about db mutex and log mutex into `WriteToWAL()`, otherwise it's impossible to know which mutex to acquire inside the function. To fix this, we can catch the tail of `alive_log_files_` by reference, so that we do not have to call `back()` in `WriteToWAL()`. Reviewed By: pdillinger Differential Revision: D34780309 fbshipit-source-id: 1def9821f0c437f2736c6a26445d75890377889b	2022-03-14 18:49:55 -07:00
Jay Zhuang	4dff279b19	DisableManualCompaction may fail to cancel an unscheduled task (#9659 ) Summary: https://github.com/facebook/rocksdb/issues/9625 didn't change the unschedule condition which was waiting for the background thread to clean-up the compaction. make sure we only unschedule the task when it's scheduled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9659 Reviewed By: ajkr Differential Revision: D34651820 Pulled By: jay-zhuang fbshipit-source-id: 23f42081b15ec8886cd81cbf131b116e0c74dc2f	2022-03-12 20:07:04 -08:00
Jay Zhuang	09b0e8f2c7	Fix a timer crash caused by invalid memory management (#9656 ) Summary: Timer crash when multiple DB instances doing heavy DB open and close operations concurrently. Which is caused by adding a timer task with smaller timestamp than the current running task. Fix it by moving the getting new task timestamp part within timer mutex protection. And other fixes: - Disallow adding duplicated function name to timer - Fix a minor memory leak in timer when a running task is cancelled Pull Request resolved: https://github.com/facebook/rocksdb/pull/9656 Reviewed By: ajkr Differential Revision: D34626296 Pulled By: jay-zhuang fbshipit-source-id: 6b6d96a5149746bf503546244912a9e41a0c5f6b	2022-03-12 11:45:56 -08:00
slk	95305c44a1	Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410 ) Summary: As disscussed in (https://github.com/facebook/rocksdb/issues/9223), Here added a new API named DB::OpenAndTrimHistory, this API will open DB and trim data to the timestamp specofied by trim_ts (The data with newer timestamp than specified trim bound will be removed). This API should only be used at a timestamp-enabled db instance recovery. And this PR implemented a new iterator named HistoryTrimmingIterator to support trimming history with a new API named DB::OpenAndTrimHistory. HistoryTrimmingIterator wrapped around the underlying InternalITerator such that keys whose timestamps newer than trim_ts should not be returned to the compaction iterator while trim_ts is not null. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9410 Reviewed By: ltamasi Differential Revision: D34410207 Pulled By: riversand963 fbshipit-source-id: e54049dc234eccd673244c566b15df58df5a6236	2022-03-11 16:13:23 -08:00
Yanqin Jin	3b6dc049f7	Support user-defined timestamps in write-committed txns (#9629 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629 Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are locked upon first operation that writes the key or has the intention of writing. For example, `PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`, `PessimisticTransaction::SingleDelete()` will write to or delete a key, while `PessimisticTransaction::GetForUpdate()` is used by application to indicate to RocksDB that the transaction has the intention of performing write operation later in the same transaction. Pessimistic transactions support two-phase commit (2PC). A transaction can be `Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once `Prepare()` succeeds, the transaction has acquired the necessary resources to commit. The resources include locks, persistence of WAL, etc. Write-committed transaction is the default pessimistic transaction implementation. In RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare section. `Commit()` will write a commit marker to the WAL and then write data to the memtables. While writing to the memtables, different keys in the transaction's write batch will be assigned different sequence numbers in ascending order. Until commit/rollback, the transaction holds locks on the keys so that no other transaction can write to the same keys. Furthermore, the keys' sequence numbers represent the order in which they are committed and should be made visible. This is convenient for us to implement support for user-defined timestamps. Since column families with and without timestamps can co-exist in the same database, a transaction may or may not involve timestamps. Based on this observation, we add two optional members to each `PessimisticTransaction`, `read_timestamp_` and `commit_timestamp_`. If no key in the transaction's write batch has timestamp, then setting these two variables do not have any effect. For the rest of this commit, we discuss only the cases when these two variables are meaningful. read_timestamp_ is used mainly for validation, and should be set before first call to `GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls `TryLock()` that can verify if another transaction has written the same key since `read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed written the same key, then validation fails, and RocksDB allows this transaction to refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()` with a different timestamp to read, but the result of the read should not be used to determine data that will be written later. commit_timestamp_ must be set after finishing writing and before transaction commit. This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after prepare phase succeeds. We currently require that the commit timestamp be chosen after all keys are locked. This means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`, `PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`, etc. will return non-ok status because they specify timestamps before locking the keys. Users are also prompted to use the `Transaction` APIs when they receive the non-ok status. Reviewed By: ltamasi Differential Revision: D31822445 fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43	2022-03-08 16:20:59 -08:00
Hui Xiao	ca0ef54f16	Rate-limit automatic WAL flush after each user write (#9607 ) Summary: Context: WAL flush is currently not rate-limited by `Options::rate_limiter`. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, `Options::manual_wal_flush == false`), by adding `WriteOptions::rate_limiter_options`. Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as `Options::manual_wal_flush == true + manual FlushWAL()` (rate-limiting multiple WAL flushes), for the benefits of: - being consistent with [ReadOptions::rate_limiter_priority](https://github.com/facebook/rocksdb/blob/7.0.fb/include/rocksdb/options.h#L515) - being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat) `WriteOptions::rate_limiter_options` only accept `Env::IO_USER` and `Env::IO_TOTAL` currently due to an implementation constraint. - The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as `Env::IO_HIGH/MID/LOW` and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow `Env::IO_USER` and `Env::IO_TOTAL` right now before improving that scheduling. A pre-requisite to this feature is to support operation-level rate limiting in `WritableFileWriter`, which is also included in this PR. Summary: - Renamed test suite `DBRateLimiterTest to DBRateLimiterOnReadTest` for adding a new test suite - Accept `rate_limiter_priority` in `WritableFileWriter`'s private and public write functions - Passed `WriteOptions::rate_limiter_options` to `WritableFileWriter` in the path of automatic WAL flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9607 Test Plan: - Added new unit test to verify existing flush/compaction rate-limiting does not break, since `DBTest, RateLimitingTest` is disabled and current db-level rate-limiting tests focus on read only (e.g, `db_rate_limiter_test`, `DBTest2, RateLimitedCompactionReads`). - Added new unit test `DBRateLimiterOnWriteWALTest, AutoWalFlush` - `strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100` - verified that WAL flush(i.e, system-call _write_) were chunked into 15 bytes and each _write_ was roughly 1 second apart - verified the chunking disappeared when `-rate_limit_auto_wal_flush=0` - crash test: `python3 tools/db_crashtest.py blackbox --disable_wal=0 --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10` killed as normal Benchmarked on flush/compaction to ensure no performance regression: - compaction with rate-limiting (see table 1, avg over 1280-run): pre-change: 915635 micros/op; post-change: 907350 micros/op (improved by 0.106%) ``` #!/bin/bash TEST_TMPDIR=/dev/shm/testdb START=1 NUM_DATA_ENTRY=8 N=10 rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt for i in $(eval echo "{$START..$NUM_DATA_ENTRY}") do NUM_RUN=$(($N(2($i-1)))) for j in $(eval echo "{$START..$NUM_RUN}") do ./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 \| egrep 'compact' done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt done ``` - compaction w/o rate-limiting (see table 2, avg over 640-run): pre-change: 822197 micros/op; post-change: 823148 micros/op (regressed by 0.12%)* ``` Same as above script, except that -rate_limiter_bytes_per_sec=0 ``` - flush with rate-limiting (see table 3, avg over 320-run, run on the [patch](`ee5c6023a9`) to augment current db_bench ): pre-change: 745752 micros/op; post-change: 745331 micros/op (regressed by 0.06 %) ``` #!/bin/bash TEST_TMPDIR=/dev/shm/testdb START=1 NUM_DATA_ENTRY=8 N=10 rm -f flush_bmk_output.txt flush_bmk_output_2.txt for i in $(eval echo "{$START..$NUM_DATA_ENTRY}") do NUM_RUN=$(($N(2($i-1)))) for j in $(eval echo "{$START..$NUM_RUN}") do ./db_bench -db=$TEST_TMPDIR -write_buffer_size=1048576000 -num=1000000 -rate_limiter_bytes_per_sec=100000000 -benchmarks=fillseq,flush \| egrep 'flush' done > flush_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' flush_bmk_output.txt >> flush_bmk_output_2.txt done ``` - flush w/o rate-limiting (see table 4, avg over 320-run, run on the [patch](`ee5c6023a9`) to augment current db_bench): pre-change: 487512 micros/op, post-change: 485856 micors/ops (improved by 0.34%)* ``` Same as above script, except that -rate_limiter_bytes_per_sec=0 ``` \| table 1 - compact with rate-limiting\| #-run \| (pre-change) avg micros/op \| std micros/op \| (post-change) avg micros/op \| std micros/op \| change in avg micros/op (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 896978 \| 16046.9 \| 901242 \| 15670.9 \| 0.475373978 20 \| 893718 \| 15813 \| 886505 \| 17544.7 \| -0.8070778478 40 \| 900426 \| 23882.2 \| 894958 \| 15104.5 \| -0.6072681153 80 \| 906635 \| 21761.5 \| 903332 \| 23948.3 \| -0.3643141948 160 \| 898632 \| 21098.9 \| 907583 \| 21145 \| 0.9960695813 3.20E+02 \| 905252 \| 22785.5 \| 908106 \| 25325.5 \| 0.3152713278 6.40E+02 \| 905213 \| 23598.6 \| 906741 \| 21370.5 \| 0.1688000504 1.28E+03 \| 908316 \| 23533.1 \| 907350 \| 24626.8 \| -0.1063506533 average over #-run \| 901896.25 \| 21064.9625 \| 901977.125 \| 20592.025 \| 0.008967217682 \| table 2 - compact w/o rate-limiting\| #-run \| (pre-change) avg micros/op \| std micros/op \| (post-change) avg micros/op \| std micros/op \| change in avg micros/op (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 811211 \| 26996.7 \| 807586 \| 28456.4 \| -0.4468627768 20 \| 815465 \| 14803.7 \| 814608 \| 28719.7 \| -0.105093413 40 \| 809203 \| 26187.1 \| 797835 \| 25492.1 \| -1.404839082 80 \| 822088 \| 28765.3 \| 822192 \| 32840.4 \| 0.01265071379 160 \| 821719 \| 36344.7 \| 821664 \| 29544.9 \| -0.006693285661 3.20E+02 \| 820921 \| 27756.4 \| 821403 \| 28347.7 \| 0.05871454135 6.40E+02 \| 822197 \| 28960.6 \| 823148 \| 30055.1 \| 0.1156657103 average over #-run \| 8.18E+05 \| 2.71E+04 \| 8.15E+05 \| 2.91E+04 \| -0.25 \| table 3 - flush with rate-limiting\| #-run \| (pre-change) avg micros/op \| std micros/op \| (post-change) avg micros/op \| std micros/op \| change in avg micros/op (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 741721 \| 11770.8 \| 740345 \| 5949.76 \| -0.1855144994 20 \| 735169 \| 3561.83 \| 743199 \| 9755.77 \| 1.09226586 40 \| 743368 \| 8891.03 \| 742102 \| 8683.22 \| -0.1703059588 80 \| 742129 \| 8148.51 \| 743417 \| 9631.58\| 0.1735547324 160 \| 749045 \| 9757.21 \| 746256 \| 9191.86 \| -0.3723407806 3.20E+02 \| 745752 \| 9819.65 \| 745331 \| 9840.62 \| -0.0564530836 6.40E+02 \| 749006 \| 11080.5 \| 748173 \| 10578.7 \| -0.1112140624 average over #-run \| 743741.4286 \| 9004.218571 \| 744117.5714 \| 9090.215714 \| 0.05057441238 \| table 4 - flush w/o rate-limiting\| #-run \| (pre-change) avg micros/op \| std micros/op \| (post-change) avg micros/op \| std micros/op \| change in avg micros/op (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 477283 \| 24719.6 \| 473864 \| 12379 \| -0.7163464863 20 \| 486743 \| 20175.2 \| 502296 \| 23931.3 \| 3.195320734 40 \| 482846 \| 15309.2 \| 489820 \| 22259.5 \| 1.444352858 80 \| 491490 \| 21883.1 \| 490071 \| 23085.7 \| -0.2887139108 160 \| 493347 \| 28074.3 \| 483609 \| 21211.7 \| -1.973864238 3.20E+02 \| 487512 \| 21401.5 \| 485856 \| 22195.2 \| -0.3396839462 6.40E+02 \| 490307 \| 25418.6 \| 485435 \| 22405.2 \| -0.9936631539 average over #-run \| 4.87E+05 \| 2.24E+04 \| 4.87E+05 \| 2.11E+04 \| 0.00E+00 Reviewed By: ajkr Differential Revision: D34442441 Pulled By: hx235 fbshipit-source-id: 4790f13e1e5c0a95ae1d1cc93ffcf69dc6e78bdd	2022-03-08 13:19:39 -08:00
Ezgi Çiçek	27d6ef8e60	Rename mutable_cf_options to signify explicity copy (#9666 ) Summary: Signify explicit copy with comment and better name for variable `mutable_cf_options` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9666 Reviewed By: riversand963 Differential Revision: D34680934 Pulled By: ezgicicek fbshipit-source-id: b64ef18725fe523835d14ceb4b29bcdfe493f8ed	2022-03-08 11:26:40 -08:00
Jay Zhuang	36aec94d85	`compression_per_level` should be used for flush and changeable (#9658 ) Summary: - Make `compression_per_level` dynamical changeable with `SetOptions`; - Fix a bug that `compression_per_level` is not used for flush; Pull Request resolved: https://github.com/facebook/rocksdb/pull/9658 Test Plan: CI Reviewed By: ajkr Differential Revision: D34700749 Pulled By: jay-zhuang fbshipit-source-id: a23b9dfa7ad03d393c1d71781d19e91de796f49c	2022-03-07 18:06:19 -08:00
Yanqin Jin	659a16d52b	Fix bug causing incorrect data returned by snapshot read (#9648 ) Summary: This bug affects use cases that meet the following conditions - (has only the default column family or disables WAL) and - has at least one event listener - atomic flush is NOT affected. If the above conditions meet, then RocksDB can release the db mutex before picking all the existing memtables to flush. In the meantime, a snapshot can be created and db's sequence number can still be incremented. The upcoming flush will ignore this snapshot. A later read using this snapshot can return incorrect result. To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid creating snapshots during this interval. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9648 Test Plan: make check Reviewed By: ajkr Differential Revision: D34555456 Pulled By: riversand963 fbshipit-source-id: 1438981e9f069a5916686b1a0ad7627f734cf0ee	2022-03-02 21:03:14 -08:00
Jay Zhuang	db8647969d	Unschedule manual compaction from thread-pool queue (#9625 ) Summary: PR https://github.com/facebook/rocksdb/issues/9557 introduced a race condition between manual compaction foreground thread and background compaction thread. This PR adds the ability to really unschedule manual compaction from thread-pool queue by differentiate tag name for manual compaction and other tasks. Also fix an issue that db `close()` didn't cancel the manual compaction thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9625 Test Plan: unittest not hang Reviewed By: ajkr Differential Revision: D34410811 Pulled By: jay-zhuang fbshipit-source-id: cb14065eabb8cf1345fa042b5652d4f788c0c40c	2022-03-02 13:43:00 -08:00
Andrew Kryczka	9983eecdfb	Dedicate cacheline for DB mutex (#9637 ) Summary: We found a case of cacheline bouncing due to writers locking/unlocking `mutex_` and readers accessing `block_cache_tracer_`. We discovered it only after the issue was fixed by https://github.com/facebook/rocksdb/issues/9462 shifting the `DBImpl` members such that `mutex_` and `block_cache_tracer_` were naturally placed in separate cachelines in our regression testing setup. This PR forces the cacheline alignment of `mutex_` so we don't accidentally reintroduce the problem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9637 Reviewed By: riversand963 Differential Revision: D34502233 Pulled By: ajkr fbshipit-source-id: 46aa313b7fe83e80c3de254e332b6fb242434c07	2022-02-27 11:36:54 -08:00
Hui Xiao	87a8b3c8af	Deflake DBErrorHandlingFSTest.MultiCFWALWriteError (#9496 ) Summary: Context: As part of https://github.com/facebook/rocksdb/pull/6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](`e66199d848 (diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396)`). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is the `version_` in the recovery process. However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](`3122cb4358/db/error_handler.cc (L552-L553)`) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`. As a consequence, `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like ``` db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet' assert(versions_->io_status().ok()); ``` Summary: - I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` - As the context, the orignal https://github.com/facebook/rocksdb/pull/3997 introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later https://github.com/facebook/rocksdb/pull/6949 distances these two calls a bit. - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/force=/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables. - In addition, to preserve the previous behavior of https://github.com/facebook/rocksdb/pull/6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s` - In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9496 Test Plan: - Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread ``` sleep(1); assert(versions_->io_status().ok()); ``` `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError` ``` [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBErrorHandlingFSTest [ RUN ] DBErrorHandlingFSTest.MultiCFWALWriteError Received signal 11 (Segmentation fault) #0 rocksdb/error_handler_fs_test() [0x5818a4] rocksdb::DBImpl::ResumeImpl(rocksdb::DBRecoverContext) /data/users/huixiao/rocksdb/db/db_impl/db_impl.cc:421 https://github.com/facebook/rocksdb/issues/1 rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600 https://github.com/facebook/rocksdb/issues/2 rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError() /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310 https://github.com/facebook/rocksdb/issues/3 rocksdb/error_handler_fs_test() ``` - The assertion failure does not happen with PR `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError` `[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms) ` Reviewed By: riversand963, anand1976 Differential Revision: D33990099 Pulled By: hx235 fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab	2022-02-25 14:44:46 -08:00
Siddhartha Roychowdhury	39b0d92153	Add record to set WAL compression type if enabled (#9556 ) Summary: When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs. Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556 Reviewed By: anand1976 Differential Revision: D34308216 Pulled By: sidroyc fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705	2022-02-17 16:19:31 -08:00
Andrew Kryczka	babe56ddba	Add rate limiter priority to ReadOptions (#9424 ) Summary: Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working. `RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`. There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads). The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424 Test Plan: - new unit tests - new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart. - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true` - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true` - crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10` Reviewed By: hx235 Differential Revision: D33747386 Pulled By: ajkr fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c	2022-02-16 23:18:14 -08:00
Yanqin Jin	1cda273dc3	Fix a silent data loss for write-committed txn (#9571 ) Summary: The following sequence of events can cause silent data loss for write-committed transactions. ``` Time thread 1 bg flush \| db->Put("a") \| txn = NewTxn() \| txn->Put("b", "v") \| txn->Prepare() // writes only to 5.log \| db->SwitchMemtable() // memtable 1 has "a" \| // close 5.log, \| // creates 8.log \| trigger flush \| pick memtable 1 \| unlock db mutex \| write new sst \| txn->ctwb->Put("gtid", "1") // writes 8.log \| txn->Commit() // writes to 8.log \| // writes to memtable 2 \| compute min_log_number_to_keep_2pc, this \| will be 8 (incorrect). \| \| Purge obsolete wals, including 5.log \| V ``` At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"}, only the commit marker of txn is in 8.log The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`. In the above example, when bg flush thread tries to find obsolete wals, it uses the information computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()` depends on three components - `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit. - `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed. - `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data. The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log. We should return this information via `FindMinPrepLogReferencedByMemTable()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571 Test Plan: ``` ./transaction_test --gtest_filter=/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/ make check ``` Reviewed By: siying Differential Revision: D34235236 Pulled By: riversand963 fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1	2022-02-16 23:08:58 -08:00
Jay Zhuang	a0c569ee1d	Cancel manual compaction in thread-pool queue (#9557 ) Summary: Fix `DisableManualCompaction()` has to wait scheduled manual compaction to start the execution to cancel the job. When a manual compaction in thread-pool queue is cancel, set the job is_canceled to true and clean the resource. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9557 Test Plan: added unittest that will hang without the change Reviewed By: ajkr Differential Revision: D34214910 Pulled By: jay-zhuang fbshipit-source-id: 89dbaee78ddf26eb13ce862c2b15f4a098b36a78	2022-02-15 19:23:01 -08:00
Hui Xiao	443d8ef094	Fix PinSelf() read-after-free in DB::GetMergeOperands() (#9507 ) Summary: Context: Running the new test `DBMergeOperandTest.MergeOperandReadAfterFreeBug` prior to this fix surfaces the read-after-free bug of PinSef() as below: ``` READ of size 8 at 0x60400002529d thread T0 https://github.com/facebook/rocksdb/issues/5 0x7f199a in rocksdb::PinnableSlice::PinSelf(rocksdb::Slice const&) include/rocksdb/slice.h:171 https://github.com/facebook/rocksdb/issues/6 0x7f199a in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1919 https://github.com/facebook/rocksdb/issues/7 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle, rocksdb::Slice const&, rocksdb::PinnableSlice, rocksdb::GetMergeOperandsOptions, int) db/db_impl/db_impl.h:203 freed by thread T0 here: https://github.com/facebook/rocksdb/issues/3 0x1191399 in rocksdb::cache_entry_roles_detail::RegisteredDeleter<rocksdb::Block, (rocksdb::CacheEntryRole)0>::Delete(rocksdb::Slice const&, void) cache/cache_entry_roles.h:99 https://github.com/facebook/rocksdb/issues/4 0x719348 in rocksdb::LRUHandle::Free() cache/lru_cache.h:205 https://github.com/facebook/rocksdb/issues/5 0x71047f in rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle, bool) cache/lru_cache.cc:547 https://github.com/facebook/rocksdb/issues/6 0xa78f0a in rocksdb::Cleanable::DoCleanup() include/rocksdb/cleanable.h:60 https://github.com/facebook/rocksdb/issues/7 0xa78f0a in rocksdb::Cleanable::Reset() include/rocksdb/cleanable.h:38 https://github.com/facebook/rocksdb/issues/8 0xa78f0a in rocksdb::PinnedIteratorsManager::ReleasePinnedData() db/pinned_iterators_manager.h:71 https://github.com/facebook/rocksdb/issues/9 0xd0c21b in rocksdb::PinnedIteratorsManager::~PinnedIteratorsManager() db/pinned_iterators_manager.h:24 https://github.com/facebook/rocksdb/issues/10 0xd0c21b in rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, rocksdb::Status, rocksdb::MergeContext, unsigned long, bool, bool, unsigned long, rocksdb::ReadCallback, bool, bool) db/pinned_iterators_manager.h:22 https://github.com/facebook/rocksdb/issues/11 0x7f0fdf in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1886 https://github.com/facebook/rocksdb/issues/12 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle, rocksdb::Slice const&, rocksdb::PinnableSlice, rocksdb::GetMergeOperandsOptions, int) db/db_impl/db_impl.h:203 previously allocated by thread T0 here: https://github.com/facebook/rocksdb/issues/1 0x1239896 in rocksdb::AllocateBlock(unsigned long, *rocksdb::MemoryAllocator)** memory/memory_allocator.h:35 https://github.com/facebook/rocksdb/issues/2 0x1239896 in rocksdb::BlockFetcher::CopyBufferToHeapBuf() table/block_fetcher.cc:171 https://github.com/facebook/rocksdb/issues/3 0x1239896 in rocksdb::BlockFetcher::GetBlockContents() table/block_fetcher.cc:206 https://github.com/facebook/rocksdb/issues/4 0x122eae5 in rocksdb::BlockFetcher::ReadBlockContents() table/block_fetcher.cc:325 https://github.com/facebook/rocksdb/issues/5 0x11b1f45 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, rocksdb::CachableEntry<rocksdb::Block>, rocksdb::BlockType, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, rocksdb::BlockContents) const table/block_based/block_based_table_reader.cc:1503 ``` Here is the analysis: - We have [PinnedIteratorsManager](https://github.com/facebook/rocksdb/blob/6.28.fb/db/version_set.cc#L1980) with `Cleanable` capability in our `Version::Get()` path. It's responsible for managing the life-time of pinned iterator and invoking registered cleanup functions during its own destruction. - For example in case above, the merge operands's clean-up gets associated with this manger in [GetContext::push_operand](https://github.com/facebook/rocksdb/blob/6.28.fb/table/get_context.cc#L405). During PinnedIteratorsManager's [destruction](https://github.com/facebook/rocksdb/blob/6.28.fb/db/pinned_iterators_manager.h#L67), the release function associated with those merge operand data is invoked. And that's what we see in "freed by thread T955 here" in ASAN.* - Bug 🐛: `PinnedIteratorsManager` is local to `Version::Get()` while the data of merge operands need to outlive `Version::Get` and stay till they get [PinSelf()](https://github.com/facebook/rocksdb/blob/6.28.fb/db/db_impl/db_impl.cc#L1905), which is the read-after-free in ASAN. - This bug is likely to be an overlook of `PinnedIteratorsManager` when developing the API `DB::GetMergeOperands` cuz the current logic works fine with the existing case of getting the merged value where the operands do not need to live that long. - This bug was not surfaced much (even in its unit test) due to the release function associated with the merge operands (which are actually blocks put in cache as you can see in `BlockBasedTable::MaybeReadBlockAndLoadToCache` in "previously allocated by" in ASAN report) is a cache entry deleter. The deleter will call `Cache::Release()` which, for LRU cache, won't immediately deallocate the block based on LRU policy [unless the cache is full or being instructed to force erase](https://github.com/facebook/rocksdb/blob/6.28.fb/cache/lru_cache.cc#L521-L531) - `DBMergeOperandTest.MergeOperandReadAfterFreeBug` makes the cache extremely small to force cache full. Summary: - Fix the bug by align `PinnedIteratorsManager`'s lifetime with the merge operands Pull Request resolved: https://github.com/facebook/rocksdb/pull/9507 Test Plan: - New test `DBMergeOperandTest.MergeOperandReadAfterFreeBug` - db bench on read path - Setup (LSM tree with several levels, cache the whole db to avoid read IO, warm cache with readseq to avoid read IO): `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1``TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="readrandom" -num=1000000 -cache_size=100000000 ` - Actual command run (run 20-run for 20 times and then average the 20-run's average micros/op) - `for j in {1..20}; do (for i in {1..20}; do rm -rf /dev/shm/rocksdb/ && TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq,readrandom" -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1 \| egrep 'readrandom'; done > rr_output_pre.txt && (awk '{sum+=$3; sum_sqrt+=$3^2}END{print sum/20, sqrt(sum_sqrt/20-(sum/20)^2)}' rr_output_pre.txt) >> rr_output_pre_2.txt); done` - Result: Pre-change: 3.79193 micros/op; Post-change: 3.79528 micros/op (+0.09%) (pre-change)sorted avg micros/op of each 20-run \| std of micros/op of each 20-run \| (post-change) sorted avg micros/op of each 20-run \| std of micros/op of each 20-run -- \| -- \| -- \| -- 3.58355 \| 0.265209 \| 3.48715 \| 0.382076 3.58845 \| 0.519927 \| 3.5832 \| 0.382726 3.66415 \| 0.452097 \| 3.677 \| 0.563831 3.68495 \| 0.430897 \| 3.68405 \| 0.495355 3.70295 \| 0.482893 \| 3.68465 \| 0.431438 3.719 \| 0.463806 \| 3.71945 \| 0.457157 3.7393 \| 0.453423 \| 3.72795 \| 0.538604 3.7806 \| 0.527613 \| 3.75075 \| 0.444509 3.7817 \| 0.426704 \| 3.7683 \| 0.468065 3.809 \| 0.381033 \| 3.8086 \| 0.557378 3.80985 \| 0.466011 \| 3.81805 \| 0.524833 3.8165 \| 0.500351 \| 3.83405 \| 0.529339 3.8479 \| 0.430326 \| 3.86285 \| 0.44831 3.85125 \| 0.434108 \| 3.8717 \| 0.544098 3.8556 \| 0.524602 \| 3.895 \| 0.411679 3.8656 \| 0.476383 \| 3.90965 \| 0.566636 3.8911 \| 0.488477 \| 3.92735 \| 0.608038 3.898 \| 0.493978 \| 3.9439 \| 0.524511 3.97235 \| 0.515008 \| 3.9623 \| 0.477416 3.9768 \| 0.519993 \| 3.98965 \| 0.521481 - CI Reviewed By: ajkr Differential Revision: D34030519 Pulled By: hx235 fbshipit-source-id: a99ac585c11704c5ed93af033cb29ba0a7b16ae8	2022-02-15 12:25:18 -08:00
Ezgi Çiçek	95d9cb8357	Avoid unnecessary copy of sample_slice map (#9551 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9551 Reviewed By: riversand963 Differential Revision: D34169574 Pulled By: ezgicicek fbshipit-source-id: 2e88db59b65bda269917a9b0bed17181a4afd281	2022-02-11 09:15:27 -08:00
Hui Xiao	c5cd31c12b	Fix TSAN data race in EventListenerTest.MultiCF (#9528 ) Summary: Context: `EventListenerTest.MultiCF` occasionally failed on TSAN data race as below: ``` WARNING: ThreadSanitizer: data race (pid=2047633) Read of size 8 at 0x7b6000001440 by main thread: #0 std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::size() const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:916:40 (listener_test+0x52337c) https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() /home/circleci/project/db/listener_test.cc:384:7 (listener_test+0x52337c) Previous write of size 8 at 0x7b6000001440 by thread T2: #0 void std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::_M_realloc_insert<rocksdb::DB* const&>(__gnu_cxx::__normal_iterator<rocksdb::DB*, std::vector<rocksdb::DB, std::allocator<rocksdb::DB> > >, rocksdb::DB const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/vector.tcc:503:31 (listener_test+0x550654) https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::push_back(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:1195:4 (listener_test+0x550654) https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB, rocksdb::FlushJobInfo const&) /home/circleci/project/db/listener_test.cc:255:18 (listener_test+0x550654) ``` After investigation, it is due to the following: (1) `ASSERT_OK(Flush(i));` before the read `std::vector::size()` is supposed to be [blocked on `DB::Impl::bg_cv_` for memtable flush to finish](`320d9a8e8a/db/db_impl/db_impl_compaction_flush.cc (L2319)`) and get signaled [at the end of background flush ](`320d9a8e8a/db/db_impl/db_impl_compaction_flush.cc (L2830)`), which happens after the write `std::vector::push_back()` . So the sequence of execution should have been synchronized as `call flush() -> write -> return from flush() -> read` and would not cause any TSAN data race. - The subsequent `ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable());` serves a similar purpose based on [the previous attempt to deflake the test.](https://github.com/facebook/rocksdb/pull/9084) (2) However, there are multiple places in the code can signal this `DB::Impl::bg_cv_` and mistakenly wake up `ASSERT_OK(Flush(i));` (or `ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable());`) too early (and with the lock available to them), resulting in non-synchronized read and write thus a TSAN data race. - Reproduced by the following, suggested by ajkr: ``` diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc index 4ff87c1e4..52492e9cf 100644 --- a/db/db_impl/db_impl_compaction_flush.cc +++ b/db/db_impl/db_impl_compaction_flush.cc @@ -22,7 +22,7 @@ #include "test_util/sync_point.h" #include "util/cast_util.h" #include "util/concurrent_task_limiter_impl.h" namespace ROCKSDB_NAMESPACE { bool DBImpl::EnoughRoomForCompaction( @@ -855,6 +855,7 @@ void DBImpl::NotifyOnFlushCompleted( mutable_cf_options.level0_stop_writes_trigger); // release lock while notifying events mutex_.Unlock(); + bg_cv_.SignalAll(); ``` Summary:* - Added synchornization between read and write by ` ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency()` mechanism Pull Request resolved: https://github.com/facebook/rocksdb/pull/9528 Test Plan: `./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10` - pre-fix: ``` Repeating all tests (iteration 3) Note: Google Test filter = EventListenerTest.MultiCF [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from EventListenerTest [ RUN ] EventListenerTest.MultiCF ================== WARNING: ThreadSanitizer: data race (pid=3377137) Read of size 8 at 0x7b6000000840 by main thread: #0 std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::size() https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() db/listener_test.cc:384 (listener_test+0x4bb300) Previous write of size 8 at 0x7b6000000840 by thread T2: #0 void std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::_M_realloc_insert<rocksdb::DB* const&>(__gnu_cxx::__normal_iterator<rocksdb::DB*, std::vector<rocksdb::DB, std::allocator<rocksdb::DB> > >, rocksdb::DB const&) https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::push_back(rocksdb::DB* const&) https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) db/listener_test.cc:255 (listener_test+0x4e820f) ``` - post-fix: `All passed` Reviewed By: ajkr Differential Revision: D34085791 Pulled By: hx235 fbshipit-source-id: f877aa687ea1d5cb6f31ef8c4772625d22868e8b	2022-02-10 10:21:25 -08:00
Levi Tamasi	320d9a8e8a	Use a sorted vector instead of a map to store blob file metadata (#9526 ) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc	2022-02-09 12:36:43 -08:00
Akanksha Mahajan	9745c68eb1	Remove deprecated option new_table_reader_for_compaction_inputs (#9443 ) Summary: In RocksDB option new_table_reader_for_compaction_inputs has not effect on Compaction or on the behavior of RocksDB library. Therefore, we are removing it in the upcoming 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443 Test Plan: CircleCI Reviewed By: ajkr Differential Revision: D33788508 Pulled By: akankshamahajan15 fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d	2022-02-08 19:31:28 -08:00
Yanqin Jin	629e3e1d77	Fix spelling in public API (#9490 ) Summary: I feel it would be nice if we can fix this spelling error. In `SizeApproximationOptions`, the `include_memtabtles` should be `include_memtables`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9490 Test Plan: make check Reviewed By: hx235 Differential Revision: D33949862 Pulled By: riversand963 fbshipit-source-id: b2be67501b65d4aabb6b8df1bf25eb8d54cc1466	2022-02-03 15:15:23 -08:00
Yanqin Jin	3122cb4358	Revise APIs related to user-defined timestamp (#8946 ) Summary: ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`. Namely, `WriteOptions` should not include information about "what-to-write", but should just include information about "how-to-write". According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore, this PR removes `WriteOptions::timestamp` for compliance. After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and `SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity made me reconsider doing it in another PR (maybe). For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take extra `timestamp` information when writing to `WriteBatch`es. These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list. Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to `WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps allocated already and multiple timestamps can be updated. The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp size of the default column family. This will be used to allocate space when calling APIs that do not specify a column family handle. Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing some assertions about timestamp to returning Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946 Test Plan: make check ./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8 ./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0 Make sure there is no perf regression by running the following ``` ./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom ``` Before this PR ``` DB path: [/dev/shm/rocksdb] fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s ``` After this PR ``` DB path: [/dev/shm/rocksdb] fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s ``` Reviewed By: ltamasi Differential Revision: D33721359 Pulled By: riversand963 fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09	2022-02-01 22:19:01 -08:00
Jay Zhuang	980b9ff385	Add more micro-benchmark tests (#9436 ) Summary: * Add more micro-benchmark tests * Expose an API in DBImpl for waiting for compactions (still not visible to the user) * Add argument name for ribbon_bench * remove benchmark run from CI, as it runs too long. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9436 Test Plan: CI Reviewed By: riversand963 Differential Revision: D33777836 Pulled By: jay-zhuang fbshipit-source-id: c05de3bc082cc05b5d019f00b324e774bf4bbd96	2022-02-01 09:01:55 -08:00
Yanqin Jin	d10c5c08d3	Remove iter_start_seqnum and preserve_deletes (#9430 ) Summary: According to https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl.cc#L2896:L2911 and https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl_open.cc#L203:L208, we are going to remove `iter_start_seqnum` and `preserve_deletes` starting from RocksDB 7.0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9430 Test Plan: make check and CI Reviewed By: ajkr Differential Revision: D33753639 Pulled By: riversand963 fbshipit-source-id: c80aab8e8d8fc33e52472fed524ed703d0ffc8b6	2022-01-28 13:28:38 -08:00
Yanqin Jin	dd203ed604	Disallow a combination of options (#9348 ) Summary: Disallow `immutable_db_opts.use_direct_io_for_flush_and_compaction == true` and `mutable_db_opts.writable_file_max_buffer_size == 0`, since it causes `WritableFileWriter::Append()` to loop forever and does not make much sense in direct IO. This combination of options itself does not make much sense: asking RocksDB to do direct IO but not allowing RocksDB to allocate a buffer. We should detect this false combination and warn user early, no matter whether the application is running on a platform that supports direct IO or not. In the case of platform not supporting direct IO, it's ok if the user learns about this and then finds that direct IO is not supported. One tricky thing: the constructor of `WritableFileWriter` is being used in our unit tests, and it's impossible to return status code from constructor. Since we do not throw, I put an assertion for now. Fortunately, the constructor is not exposed to external applications. Closing https://github.com/facebook/rocksdb/issues/7109 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9348 Test Plan: make check Reviewed By: ajkr Differential Revision: D33371924 Pulled By: riversand963 fbshipit-source-id: 2a3701ab541cee23bffda8a36cdf37b2d235edfa	2022-01-27 19:30:24 -08:00
Siddhartha Roychowdhury	c27ca23644	Add option for WAL compression algorithm (#9432 ) Summary: Add an option to set the WAL compression algorithm - wal_compression. TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432 Reviewed By: pdillinger Differential Revision: D33797275 Pulled By: sidroyc fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7	2022-01-26 14:23:00 -08:00
Peter Dillinger	8064a3ac31	Fix flaky EventListenerTest.DisableBGCompaction (#9400 ) Summary: Wasn't able to easily reproduce error, but easy to see a race condition between TestFlushListener::OnFlushCompleted and DBTestBase::Close(), which frees CF handles before closing DB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9400 Test Plan: CI etc. Reviewed By: riversand963 Differential Revision: D33645134 Pulled By: pdillinger fbshipit-source-id: d0ec914cc43c9e14f53da633876b95b61995138d	2022-01-21 08:25:09 -08:00
zhuchong0329	5f2b661f54	FlushMemTable return ok but memtable does not synchronize flush (#8173 ) Summary: Fix https://github.com/facebook/rocksdb/issues/8046 : FlushMemTable return ok but memtable does not synchronize flush. The way to fix it is to expose RecoveryError. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8173 Reviewed By: ajkr Differential Revision: D31674552 Pulled By: jay-zhuang fbshipit-source-id: 9d16b69ba12a196bb429332ec8224754de97773d	2022-01-12 13:21:49 -08:00
Yanqin Jin	b2e53ab2d8	Add checking for `DB::DestroyColumnFamilyHandle()` (#9347 ) Summary: Closing https://github.com/facebook/rocksdb/issues/5006 Calling `DB::DestroyColumnFamilyHandle(column_family)` with `column_family` being the return value of `DB::DefaultColumnFamily()` will return `Status::InvalidArgument()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9347 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D33369675 Pulled By: riversand963 fbshipit-source-id: a8266a4daddf2b7a773c2dc7f3eb9a4adfb6b6dd	2022-01-05 20:26:53 -08:00
Andrew Kryczka	aa2b3bf675	Added `TraceOptions::preserve_write_order` (#9334 ) Summary: This option causes trace records to be written in the serialized write thread. That way, the write records in the trace must follow the same order as writes that are logged to WAL and writes that are applied to the DB. By default I left it disabled to match existing behavior. I enabled it in `db_stress`, though, as that use case requires order of write records in trace matches the order in WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9334 Test Plan: - See if below unsynced data loss crash test can run for 24h straight. It used to crash after a few hours when reaching an unlucky trace ordering. ``` DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 --test_batches_snapshots=0 --duration=86400 ``` Reviewed By: zhichao-cao Differential Revision: D33301990 Pulled By: ajkr fbshipit-source-id: 82d97559727adb4462a7af69758449c8725b22d3	2021-12-28 15:04:26 -08:00
slk	2e5f764294	Make IncreaseFullHistoryTsLow to a public API (#9221 ) Summary: As (https://github.com/facebook/rocksdb/issues/9210) discussed, the full_history_ts_low is a member of CompactRangeOptions currently, which means a CF's fullHistoryTsLow is advanced only when users submit a CompactRange request. However, users may want to advance the fllHistoryTsLow without an immediate compact. This merge make IncreaseFullHistoryTsLow to a public API so users can advance each CF's fullHistoryTsLow seperately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9221 Reviewed By: akankshamahajan15 Differential Revision: D33201106 Pulled By: riversand963 fbshipit-source-id: 9cb1d013ba93260f72e16353e693ffee167b47ee	2021-12-23 11:03:51 -08:00
Andrew Kryczka	538d2365e9	Fix race condition in BackupEngineTest.ChangeManifestDuringBackupCreation (#9327 ) Summary: The failure looked like this: ``` utilities/backupable/backupable_db_test.cc:3161: Failure Value of: db_chroot_env_->FileExists(prev_manifest_path).IsNotFound() Actual: false Expected: true ``` The failure could be coerced consistently with the following patch: ``` diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc index 80410f671..637636791 100644 --- a/db/db_impl/db_impl_compaction_flush.cc +++ b/db/db_impl/db_impl_compaction_flush.cc @@ -2772,6 +2772,8 @@ void DBImpl::BackgroundCallFlush(Env::Priority thread_pri) { if (job_context.HaveSomethingToClean() \|\| job_context.HaveSomethingToDelete() \|\| !log_buffer.IsEmpty()) { mutex_.Unlock(); + bg_cv_.SignalAll(); + sleep(1); TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:FilesFound"); // Have to flush the info logs before bg_flush_scheduled_-- // because if bg_flush_scheduled_ becomes 0 and the lock is ``` The cause was a familiar problem, which is manual flush/compaction may return before files they obsoleted are removed. The solution is just to wait for "scheduled" work to complete, which includes all phases including cleanup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9327 Test Plan: after this PR, even the above patch to coerce the bug cannot cause the test to fail. Reviewed By: riversand963 Differential Revision: D33252208 Pulled By: ajkr fbshipit-source-id: 720a7eaca58c7247d221911fffe3d5e1dbf581e9	2021-12-22 21:59:53 -08:00
mrambacher	423538a816	Make MemoryAllocator into a Customizable class (#8980 ) Summary: - Make MemoryAllocator and its implementations into a Customizable class. - Added a "DefaultMemoryAllocator" which uses new and delete - Added a "CountedMemoryAllocator" that counts the number of allocs and free - Updated the existing tests to use these new allocators - Changed the memkind allocator test into a generic test that can test the various allocators. - Added tests for creating all of the allocators - Added tests to verify/create the JemallocNodumpAllocator using its options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980 Reviewed By: zhichao-cao Differential Revision: D32990403 Pulled By: mrambacher fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76	2021-12-17 04:20:47 -08:00
Peter Dillinger	0050a73a4f	New stable, fixed-length cache keys (#9126 ) Summary: This change standardizes on a new 16-byte cache key format for block cache (incl compressed and secondary) and persistent cache (but not table cache and row cache). The goal is a really fast cache key with practically ideal stability and uniqueness properties without external dependencies (e.g. from FileSystem). A fixed key size of 16 bytes should enable future optimizations to the concurrent hash table for block cache, which is a heavy CPU user / bottleneck, but there appears to be measurable performance improvement even with no changes to LRUCache. This change replaces a lot of disjointed and ugly code handling cache keys with calls to a simple, clean new internal API (cache_key.h). (Preserving the old cache key logic under an option would be very ugly and likely negate the performance gain of the new approach. Complete replacement carries some inherent risk, but I think that's acceptable with sufficient analysis and testing.) The scheme for encoding new cache keys is complicated but explained in cache_key.cc. Also: EndianSwapValue is moved to math.h to be next to other bit operations. (Explains some new include "math.h".) ReverseBits operation added and unit tests added to hash_test for both. Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126 Test Plan: ### Basic correctness Several tests needed updates to work with the new functionality, mostly because we are no longer relying on filesystem for stable cache keys so table builders & readers need more context info to agree on cache keys. This functionality is so core, a huge number of existing tests exercise the cache key functionality. ### Performance Create db with `TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters` And test performance with `TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4` using DEBUG_LEVEL=0 and simultaneous before & after runs. Before ops/sec, avg over 100 runs: 121924 After ops/sec, avg over 100 runs: 125385 (+2.8%) ### Collision probability I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity over many months, by making some pessimistic simplifying assumptions: * Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys) * All of every file is cached for its entire lifetime We use a simple table with skewed address assignment and replacement on address collision to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output with `./cache_bench -stress_cache_key -sck_keep_bits=40`: ``` Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached) ``` These come from default settings of 2.5M files per day of 32 MB each, and `-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of the 128-bit cache key. With file size of 2\\25 contiguous keys (pessimistic), our simulation is about 2\\(128-40-25) or about 9 billion billion times more prone to collision than reality. More default assumptions, relatively pessimistic: * 100 DBs in same process (doesn't matter much) * Re-open DB in same process (new session ID related to old session ID) on average every 100 files generated * Restart process (all new session IDs unrelated to old) 24 times per day After enough data, we get a result at the end: ``` (keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected) ``` If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data: ``` (keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected) (keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected) ``` The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases: ``` 197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected) ``` I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data. Reviewed By: zhichao-cao Differential Revision: D33171746 Pulled By: pdillinger fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f	2021-12-16 17:15:13 -08:00
Yanqin Jin	bd513fd075	Add commit marker with timestamp (#9266 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266 This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing this tag to WAL, thus it is unavailable to users. This is an ongoing effort to add user-defined timestamp support to write-committed transactions. This diff also indicates all column families that may potentially participate in the same transaction must either disable timestamp or have the same timestamp format, since `CommitWithTimestamp` tag is followed by a single byte-array denoting the commit timestamp of the transaction. We will enforce this checking in a future diff. We keep this diff small. Reviewed By: ltamasi Differential Revision: D31721350 fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6	2021-12-10 11:05:35 -08:00
Si Ke	79f4a04ee3	Get DBTest passing Assert Status Checked (#7737 ) Summary: Closes https://github.com/facebook/rocksdb/pull/7737 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9231 Reviewed By: hx235 Differential Revision: D32978332 Pulled By: pdillinger fbshipit-source-id: b28900b685d60c668529a90dbaa8e1b357b28f76	2021-12-09 11:00:17 -08:00
sdong	88875df821	File temperature information should be preserved when restart the DB (#9242 ) Summary: Fix a bug that causes file temperature not preserved after DB is restarted, or options.max_manifest_file_size is hit. Also, pass temperature information to NewRandomAccessFile() to allow users to hack a solution where they don't preserve tiering information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9242 Test Plan: Add a unit test that would fail without the fix. Reviewed By: jay-zhuang Differential Revision: D32818150 fbshipit-source-id: 36aa3f148c60107f7b8e9d65b63b039f9e1a1eec	2021-12-03 14:43:14 -08:00
lgqss	77c7085594	MemTableList::TrimHistory now use allocated bytes (#9020 ) Summary: Fix a bug when both max_write_buffer_size_to_maintain and max_write_buffer_number_to_maintain are 0. The bug was introduced in 6.5.0 and https://github.com/facebook/rocksdb/issues/5022. Fix https://github.com/facebook/rocksdb/issues/8371 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9020 Reviewed By: pdillinger Differential Revision: D32767084 Pulled By: ajkr fbshipit-source-id: c401ee6e2557230e892d0fe8abb4966cbd18e85f	2021-12-02 11:45:39 -08:00
Peter Dillinger	2a67d475f1	Fix bug affecting GetSortedWalFiles, Backups, Checkpoint (#9208 ) Summary: Saw error like this: `Backup failed -- IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or directory` Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.) relies on no file deletions happening while its operating, which means not only disabling (more) deletions, but ensuring any pending deletions are completed. Two fixes related to this: * There was a gap in several places between decrementing pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where the db mutex would be released and GetSortedWalFiles (and others) could get false information that no deletions are pending. * The fix to https://github.com/facebook/rocksdb/issues/8591 (disabling deletions in GetSortedWalFiles) seems incomplete because it doesn't prevent pending deletions from occuring during the operation (if deletions not already disabled, the case that was to be fixed by the change). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9208 Test Plan: existing tests (it's hard to write a test for interleavings that are now excluded - this is what stress test is for) Reviewed By: ajkr Differential Revision: D32630675 Pulled By: pdillinger fbshipit-source-id: a121e3da648de130cd24d44c524232f4eb22f178	2021-11-24 14:52:00 -08:00
Jay Zhuang	6cde8d2190	Deprecating `iter_start_seqnum` and `preserve_deletes` (#9091 ) Summary: `ReadOptions::iter_start_seqnum` and `DBOptions::preserve_deletes` are deprecated, please try using user defined timestamp feature instead. The feature is used to support differential snapshots, but not well maintained (https://github.com/facebook/rocksdb/issues/6837, https://github.com/facebook/rocksdb/issues/8472) and the interface is not user friendly which returns an internal key from the iterator. The user defined timestamp feature is a more flexible feature to support similar usecase, please switch to that if you have such usecase. The deprecated feature will be removed in a future release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9091 Test Plan: check LOG Fix https://github.com/facebook/rocksdb/issues/9090 Reviewed By: ajkr Differential Revision: D32071750 Pulled By: jay-zhuang fbshipit-source-id: b882c4668dd1bf26ce03c4c192f1bba584bf6104	2021-11-19 16:55:45 -08:00
Yanqin Jin	1e8322c0f5	Fix a bug in FlushJob picking more memtables beyond synced WALs (#9142 ) Summary: After RocksDB 6.19 and before this PR, RocksDB FlushJob may pick more memtables to flush beyond synced WALs. This can be problematic if there are multiple column families, since it can prematurely advance the flushed column family's log_number. Should subsequent attempts fail to sync the latest WALs and the database goes through a recovery, it may detect corrupted WAL number below the flushed column family's log number and complain about column family inconsistency. To fix, we record the maximum memtable ID of the column family being flushed. Then we call SyncClosedLogs() so that all closed WALs at the time when memtable ID is recorded will be synced. I also disabled a unit test temporarily due to reasons described in https://github.com/facebook/rocksdb/issues/9151 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9142 Test Plan: make check Reviewed By: ajkr Differential Revision: D32299956 Pulled By: riversand963 fbshipit-source-id: 0da75888177d91905cf8c9d00605b73afb5970a7	2021-11-19 09:56:00 -08:00
Andrew Kryczka	8cf4294e25	Adhere to per-DB concurrency limit when bottom-pri compactions exist (#9179 ) Summary: - Fixed bug where bottom-pri manual compactions were counting towards `bg_compaction_scheduled_` instead of `bg_bottom_compaction_scheduled_`. It seems to have no negative effect. - Fixed bug where automatic compaction scheduling did not consider `bg_bottom_compaction_scheduled_`. Now automatic compactions cannot be scheduled that exceed the per-DB compaction concurrency limit (`max_compactions`) when some existing compactions are bottommost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9179 Test Plan: new unit test for manual/automatic. Also verified the existing automatic/automatic test ("ConcurrentBottomPriLowPriCompactions") hanged until changing it to explicitly enable concurrency. Reviewed By: riversand963 Differential Revision: D32488048 Pulled By: ajkr fbshipit-source-id: 20c4c0693678e81e43f85ed3cc3402fcf26e3310	2021-11-18 17:31:50 -08:00
Yanqin Jin	2035798834	Update TransactionUtil::CheckKeyForConflict to also use timestamps (#9162 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162 Existing TransactionUtil::CheckKeyForConflict() performs only seq-based conflict checking. If user-defined timestamp is enabled, it should perform conflict checking based on timestamps too. Update TransactionUtil::CheckKey-related methods to verify the timestamp of the latest version of a key is smaller than the read timestamp. Note that CheckKeysForConflict() is not updated since it's used only by optimistic transaction, and we do not plan to update it in this upcoming batch of diffs. Existing GetLatestSequenceForKey() returns the sequence of the latest version of a specific user key. Since we support user-defined timestamp, we need to update this method to also return the timestamp (if enabled) of the latest version of the key. This will be needed for snapshot validation. Reviewed By: ltamasi Differential Revision: D31567960 fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44	2021-11-15 12:52:18 -08:00
slk	937fbcbddc	Track per-SST user-defined timestamp information in MANIFEST (#9092 ) Summary: Track per-SST user-defined timestamp information in MANIFEST https://github.com/facebook/rocksdb/issues/8957 Rockdb has supported user-defined timestamp feature. Application can specify a timestamp when writing each k-v pair. When data flush from memory to disk file called SST files, file creation activity will commit to MANIFEST. This commit is for tracking timestamp info in the MANIFEST for each file. The changes involved are as follows: 1) Track max/min timestamp in FileMetaData, and fix invoved codes. 2) Add NewFileCustomTag::kMinTimestamp and NewFileCustomTag::kMinTimestamp in NewFileCustomTag ( in the kNewFile4 part ), and support invoved codes such as VersionEdit Encode and Decode etc. 3) Add unit test code for VersionEdit EncodeDecodeNewFile4, and fix invoved test codes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9092 Reviewed By: ajkr, akankshamahajan15 Differential Revision: D32252323 Pulled By: riversand963 fbshipit-source-id: d2642898d6e3ad1fef0eb866b98045408bd4e162	2021-11-10 10:49:04 -08:00
Jay Zhuang	29102641dd	Skip directory fsync for filesystem btrfs (#8903 ) Summary: Directory fsync might be expensive on btrfs and it may not be needed. Here are 4 directory fsync cases: 1. creating a new file: dir-fsync is not needed on btrfs, as long as the new file itself is synced. 2. renaming a file: dir-fsync is not needed if the renamed file is synced. So an API `FsyncAfterFileRename(filename, ...)` is provided to sync the file on btrfs. By default, it just calls dir-fsync. 3. deleting files: dir-fsync is forced by set `IOOptions.force_dir_fsync = true` 4. renaming multiple files (like backup and checkpoint): dir-fsync is forced, the same as above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903 Test Plan: run tests on btrfs and non btrfs Reviewed By: ajkr Differential Revision: D30885059 Pulled By: jay-zhuang fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a	2021-11-03 12:21:27 -07:00
leipeng	230c98f3ce	fix histogram NUM_FILES_IN_SINGLE_COMPACTION (#9026 ) Summary: currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9026 Reviewed By: ajkr Differential Revision: D31668241 Pulled By: jay-zhuang fbshipit-source-id: c02f6c4a5df9fbf0b7510036594811152e8738af	2021-11-01 12:57:27 -07:00
leipeng	2b70224f82	remove bad extra RecordTick(stats_, WRITE_WITH_WAL) (#9064 ) Summary: This PR fix wrong ticker `WRITE_WITH_WAL`. `RecordTick(WRITE_WITH_WAL)` will be called later in `WriteToWAL` and `ConcurrentWriteToWAL`. Fixes: 1. Delete these two extra `RecordTick(WRITE_WITH_WAL)` 2. Fix corresponding test case Pull Request resolved: https://github.com/facebook/rocksdb/pull/9064 Reviewed By: ajkr Differential Revision: D31944459 Pulled By: riversand963 fbshipit-source-id: f1aa8d2a4320456bc357bc5b0902032f7dcad086	2021-11-01 11:43:14 -07:00
Andrew Kryczka	f24c39ab3d	Prevent corruption with parallel manual compactions and `change_level == true` (#9077 ) Summary: The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well. Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction. The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps. Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default). The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled. Thanks to siying for finding the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077 Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions. Reviewed By: siying Differential Revision: D31921908 Pulled By: ajkr fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb	2021-10-27 23:08:56 -07:00
Yanqin Jin	f72fd58565	Fix atomic flush waiting forever for MANIFEST write (#9034 ) Summary: In atomic flush, concurrent background flush threads will commit to the MANIFEST one by one, in the order of the IDs of their picked memtables for all included column families. Each time, a background flush thread decides whether to wait based on two criteria: - Is db stopped? If so, don't wait. - Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go. When atomic flush was implemented, error writing to or syncing the MANIFEST would cause the db to be stopped. Therefore, this background thread does not have to check for the background error while waiting. If there has been such an error, `DBStopped()` would have been true, and this thread will not wait forever. After we improved error handling, RocksDB may map an IOError while writing to MANIFEST to a soft error, if there is no WAL. This requires the background threads to check for background error while waiting. Otherwise, a background flush thread may wait forever. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D31639225 Pulled By: riversand963 fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd	2021-10-20 21:34:47 -07:00
leipeng	0a73ada7b5	remove unused local obj and simpilify comple code (#9052 ) Summary: This PR does not change code sematics, it just changes for: 1. local obj `nonmem_w` and `lfile` are unused 2. null check for `delete ptr` is unnecessary 3. use `unique_ptr::reset` instead of `release` + `delete` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9052 Reviewed By: zhichao-cao Differential Revision: D31801661 Pulled By: anand1976 fbshipit-source-id: 16a77d45da8c8833bf5bf3bce546bb3711b335df	2021-10-20 14:08:05 -07:00
leipeng	0c53b41856	db_impl_write.cc: use stats_ instead of immutable_db_options_.stats (#9053 ) Summary: This PR has no semantic changes, just to make code shorter. `stats_` has value same with `immutable_db_options_.stats`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9053 Reviewed By: zhichao-cao Differential Revision: D31801603 Pulled By: anand1976 fbshipit-source-id: cbd8fe478d3e90ae078ace49b4f2eb9bb028ccf6	2021-10-20 14:04:59 -07:00
Peter Dillinger	ad5325a736	Experimental support for SST unique IDs (#8990 ) Summary: * New public header unique_id.h and function GetUniqueIdFromTableProperties which computes a universally unique identifier based on table properties of table files from recent RocksDB versions. * Generation of DB session IDs is refactored so that they are guaranteed unique in the lifetime of a process running RocksDB. (SemiStructuredUniqueIdGen, new test included.) Along with file numbers, this enables SST unique IDs to be guaranteed unique among SSTs generated in a single process, and "better than random" between processes. See https://github.com/pdillinger/unique_id * In addition to public API producing 'external' unique IDs, there is a function for producing 'internal' unique IDs, with functions for converting between the two. In short, the external ID is "safe" for things people might do with it, and the internal ID enables more "power user" features for the future. Specifically, the external ID goes through a hashing layer so that any subset of bits in the external ID can be used as a hash of the full ID, while also preserving uniqueness guarantees in the first 128 bits (bijective both on first 128 bits and on full 192 bits). Intended follow-up: * Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into the third 64-bit value of the unique ID.) * Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990 Test Plan: Unit tests added, and checking of unique ids in stress test. NOTE in stress test we do not generate nearly enough files to thoroughly stress uniqueness, but the test trims off pieces of the ID to check for uniqueness so that we can infer (with some assumptions) stronger properties in the aggregate. Reviewed By: zhichao-cao, mrambacher Differential Revision: D31582865 Pulled By: pdillinger fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243	2021-10-18 23:32:01 -07:00
Jay Zhuang	314de7e7de	Make `DB::Close()` thread-safe (#8970 ) Summary: If `DB::Close()` is called in multi-thread env, the resource could be double released, which causes exception or assert. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8970 Test Plan: Test with multi-thread benchmark, with each thread try to close the DB at the end. Reviewed By: pdillinger Differential Revision: D31242042 Pulled By: jay-zhuang fbshipit-source-id: a61276b1b61e07732e375554106946aea86a23eb	2021-10-18 20:32:35 -07:00
Peter Dillinger	3ffb3baa0b	Add (Live)FileStorageInfo API (#8968 ) Summary: New classes FileStorageInfo and LiveFileStorageInfo and 'experimental' function DB::GetLiveFilesStorageInfo, which is intended to largely replace several fragmented DB functions needed to create checkpoints and backups. This function is now used to create checkpoints and backups, because it fixes many (probably not all) of the prior complexities of checkpoint not having atomic access to DB metadata. This also ensures strong functional test coverage of the new API. Specifically, much of the old CheckpointImpl::CreateCustomCheckpoint has been migrated to and updated in DBImpl::GetLiveFilesStorageInfo, with the former now calling the latter. Also, the class FileStorageInfo in metadata.h compatibly replaces BackupFileInfo and serves as a new base class for SstFileMetaData. Some old fields of SstFileMetaData are still provided (for now) but deprecated. Although FileStorageInfo::directory is accurate when using db_paths and/or cf_paths, these have never been supported by Checkpoint nor BackupEngine and still are not. This change does now detect these cases and return NotSupported when appropriate. (More work needed for support.) Somehow this change broke ProgressCallbackDuringBackup, but the progress_callback logic was dubious to begin with because it would call the callback based on copy buffer size, not size actually copied. Logic and test updated to track size actually copied per-thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968 Test Plan: tests updated. DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl. DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo, including reading the data after DB close. Added CheckpointTest.CheckpointWithDbPath (NotSupported). Reviewed By: siying Differential Revision: D31242045 Pulled By: pdillinger fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0	2021-10-16 10:04:32 -07:00
Giuseppe Ottaviano	4bfd415e34	Fix sequence number bump logic in multi-CF SST ingestion (#9005 ) Summary: The code in `IngestExternalFiles()` that bumps the DB's sequence number depending on what seqnos were assigned to the files has 3 bugs: 1) There is an assertion that the sequence number is increased in all the affected column families, but this is unnecessary, it is fine if some files can stick to a lower sequence number. It is very easy to hit the assertion: it is sufficient to insert 2 files in 2 CFs, one which overlaps the CF and one that doesn't (for example the CF is empty). The line added in the `IngestFilesIntoMultipleColumnFamilies_Success` test makes the assertion fail. 2) SetLastSequence() is called with the sum of all the bumps across CFs, but we should take the maximum instead, as all CFs start with the current seqno and bump it independently. 3) The code above is accidentally under a `#ifndef NDEBUG`, so it doesn't run in optimized builds, so some files may be assigned seqnos from the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9005 Test Plan: Added line in `IngestFilesIntoMultipleColumnFamilies_Success` that triggers the assertion, verified that the test (and all the others) pass after the fix. Reviewed By: ajkr Differential Revision: D31597892 Pulled By: ot fbshipit-source-id: c2d3237f90290df1178736ace8653a9623f5a770	2021-10-12 20:39:52 -07:00
Giuseppe Ottaviano	22d4dc5066	Fix race in WriteBufferManager (#9009 ) Summary: EndWriteStall has a data race: `queue_.empty()` is checked outside of the mutex, so once we enter the critical section another thread may already have cleared the list, and accessing the `front()` is undefined behavior (and causes interesting crashes under high concurrency). This PR fixes the bug, and also rewrites the logic to make it easier to reason about it. It also fixes another subtle bug: if some writers are stalled and `SetBufferSize(0)` is called, which disables the WBM, the writer are not unblocked because of an early `enabled()` check in `EndWriteStall()`. It doesn't significantly change the locking behavior, as before writers won't lock unless entering a stall condition, and `FreeMem` almost always locks if stalling is allowed, but that is inevitable with the current design. Liveness is guaranteed by the fact that if some writes are blocked, eventually all writes will be blocked due to `stall_active_`, and eventually all memory is freed. While at it, do a couple of optimizations: - In `WBMStallInterface::Signal()` signal the CV only after releasing the lock. Signaling under the lock is a common pitfall, as it causes the woken-up thread to immediately go back to sleep because the mutex is still locked by the awaker. - Move all allocations and deallocations outside of the lock. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9009 Test Plan: ``` USE_CLANG=1 make -j64 all check ``` Reviewed By: akankshamahajan15 Differential Revision: D31550668 Pulled By: ot fbshipit-source-id: 5125387c3dc7ecaaa2b8bbc736e58c4156698580	2021-10-12 00:16:21 -07:00
Zhichao Cao	bcd049cd2d	Ingest external SST files with Temperature hints (#8949 ) Summary: Add the file temperature to `IngestExternalFileArg` such that when SST files are ingested, user is able to assign the temperature to each SST file. If the temperature vector is empty or its size does not match the file name vector size, all ingested SST files will be assigned with `Temperature::unKnown`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8949 Test Plan: add the new test and make check Reviewed By: siying Differential Revision: D31127852 Pulled By: zhichao-cao fbshipit-source-id: 141a81f0f7b473d88f4ab0cb2a21a114cbc6f83c	2021-10-08 10:32:24 -07:00
Andrew Kryczka	fcaa7ff638	Cancel manual compactions waiting on automatic compactions to drain (#8991 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8991 Test Plan: the new test hangs forever without this fix and passes with this fix. Reviewed By: hx235 Differential Revision: D31456419 Pulled By: ajkr fbshipit-source-id: a82c0e5560b6e6153089dccd8e46163c61b07bff	2021-10-07 15:23:55 -07:00
anand76	532ff334d9	Don't ignore deletion rate limit if WAL dir is different (#8967 ) Summary: If WAL dir is different from the DB dir, we should still honor the SstFileManager deletion rate limit for SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8967 Test Plan: Add a new unit test in db_sst_test Reviewed By: pdillinger Differential Revision: D31220116 Pulled By: anand1976 fbshipit-source-id: bcde8a53a7d728e15e597fb5d07ee86c1b38bd28	2021-09-30 13:26:31 -07:00
mrambacher	13ae16c315	Cleanup includes in dbformat.h (#8930 ) Summary: This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead. Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing. Hopefully, this sort of code hygiene cleanup will speed up the builds... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930 Reviewed By: pdillinger Differential Revision: D31142788 Pulled By: mrambacher fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d	2021-09-29 04:04:40 -07:00
Akanksha Mahajan	78afb4d81e	Support SingleDelete for user-defined timestamps (#8921 ) Summary: Added support for SingleDelete for user-defined timestamps. Users can now Get and Iterate over keys deleted with SingleDelete. It also includes changes in CompactionIterator which preserves the same user key with different timestamps, unless the timestamp is below a certain threshold full_history_ts_low. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8921 Test Plan: Added new unit tests Reviewed By: riversand963 Differential Revision: D31098191 Pulled By: akankshamahajan15 fbshipit-source-id: 78a59ef4b4884ae324fcd10f56e62a27d5ee2f49	2021-09-27 11:51:07 -07:00
Yanqin Jin	b512f4bc76	Batch blob read IO for MultiGet (#8699 ) Summary: In batched `MultiGet()`, RocksDB batches blob read IO and uses `RandomAccessFileReader::MultiRead()` to read the blobs instead of issuing multiple `Read()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8699 Test Plan: ``` make check ``` Reviewed By: ltamasi Differential Revision: D31030861 Pulled By: riversand963 fbshipit-source-id: a0df6060cbfd54cff9515a4eee08807b1dbcb0c8	2021-09-17 19:23:13 -07:00
Akanksha Mahajan	d6aa8c49f8	Expose blob file information through the EventListener interface (#8675 ) Summary: 1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files. 2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted. 3. Test OnFile*Finish operations notifications with Blob Files. 4. Log the blob file creation/deletion events through EventLogger in Log file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8675 Test Plan: Add new unit tests in listener_test Reviewed By: ltamasi Differential Revision: D30412613 Pulled By: akankshamahajan15 fbshipit-source-id: ca51b63c6e8c8d0485a38c503572bc5a82bd5d07	2021-09-16 17:23:36 -07:00
eharry	0b6be7eb68	Fix WAL log data corruption #8723 (#8746 ) Summary: Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together (https://github.com/facebook/rocksdb/issues/8723) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8746 Reviewed By: ajkr Differential Revision: D30758468 Pulled By: riversand963 fbshipit-source-id: 07c20899d5f2447dc77861b4845efc68a59aa4e8	2021-09-13 20:15:59 -07:00
Yanqin Jin	2a2b3e03a5	Allow WriteBatch to have keys with different timestamp sizes (#8725 ) Summary: In the past, we unnecessarily requires all keys in the same write batch to be from column families whose timestamps' formats are the same for simplicity. Specifically, we cannot use the same write batch to write to two column families, one of which enables timestamp while the other disables it. The limitation is due to the member `timestamp_size_` that used to exist in each `WriteBatch` object. We pass a timestamp_size to the constructor of `WriteBatch`. Therefore, users can simply use the old `WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while the internal implementation of `WriteBatch` will take care of memory allocation for timestamps. The above is not necessary. One the one hand, users can set up a memory buffer to store user key and then contiguously append the timestamp to the user key. Then the user can pass this buffer to the `WriteBatch::Put(Slice&)` API. On the other hand, users can set up a SliceParts object which is an array of Slices and let the last Slice to point to the memory buffer storing timestamp. Then the user can pass the SliceParts object to the `WriteBatch::Put(SliceParts&)` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8725 Test Plan: make check Reviewed By: ltamasi Differential Revision: D30654499 Pulled By: riversand963 fbshipit-source-id: 9d848c77ad3c9dd629aa5fc4e2bc16fb0687b4a2	2021-09-12 15:34:26 -07:00
mrambacher	beed86473a	Make MemTableRepFactory into a Customizable class (#8419 ) Summary: This PR does the following: -> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString -> Makes the existing implementations compatible with configurations -> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419 Reviewed By: zhichao-cao Differential Revision: D29558961 Pulled By: mrambacher fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c	2021-09-08 07:46:44 -07:00
Andrew Kryczka	941543721d	Bytes read stat for `VerifyChecksum()` and `VerifyFileChecksums()` APIs (#8741 ) Summary: - Clarified some comments on compatibility for adding new ticker stats - Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs Pull Request resolved: https://github.com/facebook/rocksdb/pull/8741 Test Plan: new unit test Reviewed By: zhichao-cao Differential Revision: D30708578 Pulled By: ajkr fbshipit-source-id: d06b961f7e199ae92c266b683e39870aa8f63449	2021-09-07 13:28:29 -07:00
Peter Dillinger	4750421ece	Replace most typedef with using= (#8751 ) Summary: Old typedef syntax is confusing Most but not all changes with perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files make format Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751 Test Plan: existing Reviewed By: zhichao-cao Differential Revision: D30745277 Pulled By: pdillinger fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9	2021-09-07 11:31:59 -07:00
Peter Dillinger	c9cd5d25a8	Remove some unneeded code (#8736 ) Summary: * FullKey and ParseFullKey appear to serve no purpose in the public API (or anything else) so removed. Only use in one test updated. * NumberToString serves no purpose vs. ToString so removed, numerous calls updated * Remove unnecessary forward declarations in metadata.h by re-arranging class definitions. * Remove some unneeded semicolons Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736 Test Plan: existing tests Reviewed By: mrambacher Differential Revision: D30700039 Pulled By: pdillinger fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2	2021-09-01 14:28:58 -07:00
Peter Dillinger	13ded69484	Built-in support for generating unique IDs, bug fix (#8708 ) Summary: Env::GenerateUniqueId() works fine on Windows and on POSIX where /proc/sys/kernel/random/uuid exists. Our other implementation is flawed and easily produces collision in a new multi-threaded test. As we rely more heavily on DB session ID uniqueness, this becomes a serious issue. This change combines several individually suitable entropy sources for reliable generation of random unique IDs, with goal of uniqueness and portability, not cryptographic strength nor maximum speed. Specifically: * Moves code for getting UUIDs from the OS to port::GenerateRfcUuid rather than in Env implementation details. Callers are now told whether the operation fails or succeeds. * Adds an internal API GenerateRawUniqueId for generating high-quality 128-bit unique identifiers, by combining entropy from three "tracks": * Lots of info from default Env like time, process id, and hostname. * std::random_device * port::GenerateRfcUuid (when working) * Built-in implementations of Env::GenerateUniqueId() will now always produce an RFC 4122 UUID string, either from platform-specific API or by converting the output of GenerateRawUniqueId. DB session IDs now use GenerateRawUniqueId while DB IDs (not as critical) try to use port::GenerateRfcUuid but fall back on GenerateRawUniqueId with conversion to an RFC 4122 UUID. GenerateRawUniqueId is declared and defined under env/ rather than util/ or even port/ because of the Env dependency. Likely follow-up: enhance GenerateRawUniqueId to be faster after the first call and to guarantee uniqueness within the lifetime of a single process (imparting the same property onto DB session IDs). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708 Test Plan: A new mini-stress test in env_test checks the various public and internal APIs for uniqueness, including each track of GenerateRawUniqueId individually. We can't hope to verify anywhere close to 128 bits of entropy, but it can at least detect flaws as bad as the old code. Serial execution of the new tests takes about 350 ms on my machine. Reviewed By: zhichao-cao, mrambacher Differential Revision: D30563780 Pulled By: pdillinger fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364	2021-08-30 15:20:41 -07:00
James Yin	7ddc096d7d	Fix typo in the comment of log_empty_ (#8711 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8711 Reviewed By: riversand963 Differential Revision: D30566761 Pulled By: jay-zhuang fbshipit-source-id: dd4690f5e2af2d263ed75ea1b9ed24692fe81362	2021-08-27 12:10:29 -07:00
anand76	ebaa3c8a59	Fix a race condition in DumpStats() during iteration of the ColumnFamilySet (#8714 ) Summary: DumpStats() iterates through the ColumnFamilySet. There is a potential race condition because it does Ref the cfd, and the cfd could get destroyed during the iteration. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8714 Test Plan: make check Reviewed By: ltamasi Differential Revision: D30580199 Pulled By: anand1976 fbshipit-source-id: 60a3443ad0d4f7ac6a977dec780e6d2c1b70b850	2021-08-26 15:40:26 -07:00
Yanqin Jin	229350ef48	Allow iterate refresh for secondary instance (#8700 ) Summary: Test plan make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/8700 Reviewed By: zhichao-cao Differential Revision: D30523907 Pulled By: riversand963 fbshipit-source-id: 68928ab4dafb64ce80ab7bc69d83727a4713ab91	2021-08-24 15:40:56 -07:00
Akanksha Mahajan	5efec84c60	Fix blob callback in compaction and atomic flush (#8681 ) Summary: Pass BlobFileCompletionCallback in case of atomic flush and compaction job which is currently nullptr(default parameter). BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to SstFileManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681 Test Plan: CircleCI jobs Reviewed By: ltamasi Differential Revision: D30445998 Pulled By: akankshamahajan15 fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021	2021-08-20 11:41:14 -07:00
Baptiste Lemaire	c625b8d017	Add condition on NotifyOnFlushComplete that FlushJob was not mempurge. Add event listeners to mempurge tests. (#8672 ) Summary: Previously, when a `FlushJob` was redirected to a MemPurge, the function `DBImpl::NotifyOnFlushComplete` was called, which created a series of issues because the JobInfo was not correctly collected from the memtables. This diff aims at correcting these two issues (`FlushJobInfo` collection in `FlushJob::MemPurge` , no call to `DBImpl::NotifyOnFlushComplete` after successful mempurge). Event listeners were added to the unit tests to handle these situations. Surprisingly none of the crashtests caught this issue, I will try to add event listeners to crash tests in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8672 Reviewed By: akankshamahajan15 Differential Revision: D30383109 Pulled By: bjlemaire fbshipit-source-id: 35a8d4295886923ee4049a6447f00022cb221c73	2021-08-18 17:40:01 -07:00
Peter Dillinger	b6269b078a	Stable cache keys on ingested SST files (#8669 ) Summary: Extends https://github.com/facebook/rocksdb/issues/8659 to work for ingested external SST files, even the same file ingested into different DBs sharing a block cache. Note: These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8669 Test Plan: Extended unit test Reviewed By: zhichao-cao Differential Revision: D30398532 Pulled By: pdillinger fbshipit-source-id: 1f13e2af4b8bfff5741953a69466e9589fbc23c7	2021-08-18 11:33:03 -07:00
Adam Retter	5de333fd99	Add db_test2 to to ASSERT_STATUS_CHECKED (#8640 ) Summary: This is the `db_test2` parts of https://github.com/facebook/rocksdb/pull/7737 reworked on the latest HEAD. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8640 Reviewed By: akankshamahajan15 Differential Revision: D30303684 Pulled By: mrambacher fbshipit-source-id: 263e2f82d849bde4048b60aed8b31e7deed4706a	2021-08-16 08:10:32 -07:00
Merlin Mao	f58d276764	Make TraceRecord and Replayer public (#8611 ) Summary: New public interfaces: `TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h". `Replayer`, available in `rocksdb/utilities/replayer.h`. User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file. Unit tests: - `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes. - `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611 Reviewed By: ajkr Differential Revision: D30266329 Pulled By: autopear fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8	2021-08-11 19:32:46 -07:00
Baptiste Lemaire	e3a96c4823	Memtable sampling for mempurge heuristic. (#8628 ) Summary: Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option. This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value. Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate. At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation. Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries. The unit tests have been readapted to support this new API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628 Reviewed By: pdillinger Differential Revision: D30149315 Pulled By: bjlemaire fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27	2021-08-10 18:09:03 -07:00
Levi Tamasi	87882736ef	Fix the sorting of KeyContexts for batched MultiGet (#8633 ) Summary: `CompareKeyContext::operator()` on the trunk has a bug: when comparing column family IDs, `lhs` is used for both sides of the comparison. This results in the `KeyContext`s getting sorted solely based on key, which in turn means that keys with the same column family do not necessarily form a single range in the sorted list. This violates an assumption of the batched `MultiGet` logic, leading to the same column family showing up multiple times in the list of `MultiGetColumnFamilyData`. The end result is the code attempting to check out the thread-local `SuperVersion` for the same CF multiple times, causing an assertion violation in debug builds and memory corruption/crash in release builds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D30169182 Pulled By: ltamasi fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023	2021-08-06 16:27:42 -07:00
Levi Tamasi	9b25d26dc8	Attempt to deflake ObsoleteFilesTest.DeleteObsoleteOptionsFile (#8624 ) Summary: We've been seeing occasional crashes on CI while inserting into the vectors in `ObsoleteFilesTest.DeleteObsoleteOptionsFile`. The crashes don't reproduce locally (could be either a race or an object lifecycle issue) but the good news is that the vectors in question are not really used for anything meaningful by the test. (The assertion about the sizes of the two vectors being equal is guaranteed to hold, since the two sync points where they are populated are right after each other.) The patch simply removes the vectors from the test, alongside the associated callbacks and sync points. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8624 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D30118485 Pulled By: ltamasi fbshipit-source-id: 0a4c3d06584e84cd2b1dcc212d274fa1b89cb647	2021-08-05 18:36:16 -07:00
Yanqin Jin	0879c24040	Fix NotifyOnFlushCompleted() for atomic flush (#8585 ) Summary: PR https://github.com/facebook/rocksdb/issues/5908 added `flush_jobs_info_` to `FlushJob` to make sure `OnFlushCompleted()` is called after committing flush results to MANIFEST. However, `flush_jobs_info_` is not updated in atomic flush, causing `NotifyOnFlushCompleted()` to skip `OnFlushCompleted()`. This PR fixes this, in a similar way to https://github.com/facebook/rocksdb/issues/5908 that handles regular flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8585 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D29913720 Pulled By: riversand963 fbshipit-source-id: 4ff023c98372fa2c93188d4a5c8a4e9ffa0f4dda	2021-08-03 13:31:10 -07:00
Levi Tamasi	3f7e929865	Fix a race in ColumnFamilyData::UnrefAndTryDelete (#8605 ) Summary: The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk unlocks the DB mutex before destroying the `ThreadLocalPtr` holding the per-thread `SuperVersion` pointers when the only remaining reference is the back reference from `super_version_`. The idea behind this was to break the circular dependency between `ColumnFamilyData` and `SuperVersion`: when the penultimate reference goes away, `ColumnFamilyData` can clean up the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there is a `SuperVersion` and it is not referenced by anything else.) However, unlocking the mutex throws a wrench in this plan by making it possible for another thread to jump in and take another reference to the `ColumnFamilyData`, keeping the object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like https://github.com/facebook/rocksdb/issues/8440 , https://github.com/facebook/rocksdb/issues/8382 , and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet` destructor we sometimes observe during close in our stress tests. Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call `SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone. https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced the circular dependency and associated cleanup logic described above (in order to enable iterators to keep the `ColumnFamilyData` for dropped column families alive), and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`. Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete` in `SuperVersion::Cleanup`. The patch simply eliminates the unlocking and relocking, which has been unnecessary ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free. This closes the window during which another thread could increase the reference count, and hopefully fixes the issues above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605 Test Plan: Ran `make check` and stress tests locally. Reviewed By: pdillinger Differential Revision: D30051035 Pulled By: ltamasi fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528	2021-08-02 18:12:11 -07:00
mrambacher	ab7f7c9e49	Allow WAL dir to change with db dir (#8582 ) Summary: Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file. After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path). Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582 Reviewed By: pdillinger, autopear Differential Revision: D29881122 Pulled By: mrambacher fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb	2021-07-30 12:16:44 -07:00
Yanqin Jin	066b51126d	Several simple local code clean-ups (#8565 ) Summary: This PR tries to remove some unnecessary checks as well as unreachable code blocks to improve readability. An obvious non-public API method naming typo is also corrected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565 Test Plan: make check Reviewed By: lth Differential Revision: D29963984 Pulled By: riversand963 fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a	2021-07-30 12:07:49 -07:00
Peter Dillinger	1d34cd797e	Fix insecure internal API for GetImpl (#8590 ) Summary: Calling the GetImpl function could leave reference to a local callback function in a field of a parameter struct. As this is performance-critical code, I'm not going to attempt to sanitize this code too much, but make the existing hack a bit cleaner by reverting what it overwrites in the input struct. Added SaveAndRestore utility class to make that easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8590 Test Plan: added unit test for SaveAndRestore; existing tests for GetImpl Reviewed By: riversand963 Differential Revision: D29947983 Pulled By: pdillinger fbshipit-source-id: 2f608853f970bc06724e834cc84dcc4b8599ddeb	2021-07-29 17:23:01 -07:00
sdong	e8f218cb68	DB::GetSortedWalFiles() to ensure file deletion is disabled (#8591 ) Summary: If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done. Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8591 Test Plan: Run all existing tests Reviewed By: pdillinger Differential Revision: D29969412 fbshipit-source-id: d5f42b5271608a35b9b07687ce18157d7447b0de	2021-07-29 11:51:08 -07:00
Baptiste Lemaire	c521a9ab2b	Retire superfluous functions introduced in earlier mempurge PRs. (#8558 ) Summary: The main challenge to make the memtable garbage collection prototype (nicknamed `mempurge`) was to not get rid of WAL files that contain unflushed (but mempurged) data. That was successfully guaranteed by not writing the VersionEdit to the MANIFEST file after a successful mempurge. By not writing VersionEdits to the `MANIFEST` file after a succesful mempurge operation, we do not change the earliest log file number that contains unflushed data: `cfd->GetLogNumber()` (`cfd->SetLogNumber()` is only called in `VersionSet::ProcessManifestWrites`). As a result, a number of functions introduced earlier just for the mempurge operation are not obscolete/redundant. (e.g.: `FlushJob::ExtractEarliestLogFileNumber`), and this PR aims at cleaning up all these now-unnecessary functions. In particular, we no longer need to store the earliest log file number in the `MemTable` struct itself. This PR therefore also reverts the `MemTable` struct to its original form. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8558 Test Plan: Already included in `db_flush_test.cc`. Reviewed By: anand1976 Differential Revision: D29764351 Pulled By: bjlemaire fbshipit-source-id: 0f43b260fa270251862512f397d3f24ee62e8437	2021-07-22 18:29:13 -07:00
Yanqin Jin	2e5388178f	Return error if trying to open secondary on missing or inaccessible primary (#8200 ) Summary: If the primary's CURRENT file is missing or inaccessible, the secondary should not hang trying repeatedly to switch to the next MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8200 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D27840627 Pulled By: riversand963 fbshipit-source-id: 071fed97cbab1bc5cdefd1dc235e5cd406c174e1	2021-07-22 15:48:58 -07:00
Peter Dillinger	84eef260de	Remove TaskLimiterToken::ReleaseOnce for fix (#8567 ) Summary: Rare TSAN and valgrind failures are caused by unnecessary reading of a field on the TaskLimiterToken::limiter_ for an assertion after the token has been released and the limiter destroyed. To simplify we can simply destroy the token before triggering DB shutdown (potentially destroying the limiter). This makes the ReleaseOnce logic unnecessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8567 Test Plan: watch for more failures in CI Reviewed By: ajkr Differential Revision: D29811795 Pulled By: pdillinger fbshipit-source-id: 135549ebb98fe4f176d1542ed85d5bd6350a40b3	2021-07-21 17:37:53 -07:00
Jay Zhuang	42eaa45c1b	Avoid updating option if there's no value updated (#8518 ) Summary: Try avoid expensive updating options operation if `SetDBOptions()` does not change any option value. Skip updating is not guaranteed, for example, changing `bytes_per_sync` to `0` may still trigger updating, as the value could be sanitized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8518 Test Plan: added unittest Reviewed By: riversand963 Differential Revision: D29672639 Pulled By: jay-zhuang fbshipit-source-id: b7931de62ceea6f1bdff0d1209adf1197d3ed1f4	2021-07-21 13:45:59 -07:00
sdong	9e885939a3	Change to code for trimmed memtable history is to released outside DB mutex (#8530 ) Summary: Currently, the code shows that we delete memtables immedately after it is trimmed from history. Although it should never happen as the super version still holds the memtable, which is only switched after it, it feels a good practice not to do it, but use clean it up in the standard way: put it to WriteContext and clean it after DB mutex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8530 Test Plan: Run all existing tests. Reviewed By: ajkr Differential Revision: D29703410 fbshipit-source-id: 21d8068ac6377de4b6fa7a89697195742659fde4	2021-07-16 19:28:48 -07:00
Peter Dillinger	df5dc73bec	Don't hold DB mutex for block cache entry stat scans (#8538 ) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b	2021-07-16 14:13:08 -07:00
Baptiste Lemaire	206845c057	Mempurge support for wal (#8528 ) Summary: In this PR, `mempurge` is made compatible with the Write Ahead Log: in case of recovery, the DB is now capable of recovering the data that was "mempurged" and kept in the `imm()` list of immutable memtables. The twist was to add a uint64_t to the `memtable` struct to store the number of the earliest log file containing entries from the `memtable`. When a `Flush` operation is replaced with a `MemPurge`, the `VersionEdit` (which usually contains the new min log file number to pick up for recovery and the level 0 file path of the newly created SST file) is no longer appended to the manifest log, and every time the `deleteWal` method is called, a check is made on the list of immutable memtables. This PR also includes a unit test that verifies that no data is lost upon Reopening of the database when the mempurge feature is activated. This extensive unit test includes two column families, with valid data contained in the imm() at time of "crash"/reopening (recovery). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8528 Reviewed By: pdillinger Differential Revision: D29701097 Pulled By: bjlemaire fbshipit-source-id: 072a900fb6ccc1edcf5eef6caf88f3060238edf9	2021-07-15 17:49:13 -07:00
longlijian	803a40d412	Delete legacy code not used any more. (#8508 ) Summary: The removed function in this PR, just only have declared and dose not have any reference used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8508 Reviewed By: mrambacher Differential Revision: D29649033 Pulled By: jay-zhuang fbshipit-source-id: df98143b73d6c184a2a60c9f7ea2548a065ee35d	2021-07-14 16:04:56 -07:00
Dmitry Vorobev	0b75b22321	Implement missing Handler methods in ColumnFamilyCollector. (#8456 ) Summary: When db is open as secondary, there are basically 2 step process: 1) Collect column families from wal log 2) Apply changes to Memtable In case primary db is TransactionDB instance, wal log will contain some additional data, like noop, etc. ColumnFamilyCollector doesn't implement methods to handle these, so it fails to open a wal log written by TransactionDB. (Everything works fine with standard DB::Open). Memtable recovery process knows how to handle such wal logs, so only missing piece seems to be ColumnFamilyCollector. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8456 Reviewed By: ajkr Differential Revision: D29455945 Pulled By: mrambacher fbshipit-source-id: 5b29560fcbc008e17e95d0dc4b07558f3d63e26f	2021-07-12 09:23:45 -07:00
anand76	d1b70b05a6	Avoid passing existing BG error to WriteStatusCheck (#8511 ) Summary: In ```DBImpl::WriteImpl()```, we call ```PreprocessWrite()``` which, among other things, checks the BG error and returns it set. This return status is later on passed to ```WriteStatusCheck()```, which calls ```SetBGError()```. This results in a spurious call, and info logs, on every user write request. We should avoid passing the ```PreprocessWrite()``` return status to ```WriteStatusCheck()```, as the former would have called ```SetBGError()``` already if it encountered any new errors, such as error when creating a new WAL file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8511 Test Plan: Run existing tests Reviewed By: zhichao-cao Differential Revision: D29639917 Pulled By: anand1976 fbshipit-source-id: 19234163969e1645dbeb273712aaf5cd9ea2b182	2021-07-11 22:37:52 -07:00
Baptiste Lemaire	837705ad80	Make mempurge a background process (equivalent to in-memory compaction). (#8505 ) Summary: In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype. In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...). Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge. MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`. The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505 Reviewed By: pdillinger Differential Revision: D29619283 Pulled By: bjlemaire fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95	2021-07-09 17:23:59 -07:00
longlijian	ac3f3f3719	Eliminate compiler complaining, which the return type of the function… (#8498 ) Summary: … should be uint64_t. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8498 Reviewed By: jay-zhuang Differential Revision: D29605064 Pulled By: ajkr fbshipit-source-id: e431448ac9d8a37ae83679c4cc5732e29fe49de4	2021-07-08 10:09:05 -07:00
Baptiste Lemaire	714ce5041d	Fix clang_analyzer failure (#8492 ) Summary: Previously, the following command: ```USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j$(nproc) analyze``` was raising an error/warning the new_mem could potentially be a `nullptr`. This error appeared due to code changes from https://github.com/facebook/rocksdb/issues/8454, including an if-statement containing "`... && new_mem != nullptr && ...`", which made the analyzer believe that past this `if`-statement, a `new_mem==nullptr` was a possible scenario. This code patch simply introduces `assert`s and removes this condition in the `if`-statement. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8492 Reviewed By: jay-zhuang Differential Revision: D29571275 Pulled By: bjlemaire fbshipit-source-id: 75d72246b70ebbbae7dea11ccb5778686d8bcbea	2021-07-06 18:48:56 -07:00
Baptiste Lemaire	9dc887ece0	Memtable "MemPurge" prototype (#8454 ) Summary: Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage. The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time . Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested). One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454 Reviewed By: anand1976 Differential Revision: D29433971 Pulled By: bjlemaire fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5	2021-07-02 05:23:02 -07:00
Akanksha Mahajan	c76778e2bd	Call OnCompactionCompleted API in case of DisableManualCompaction (#8469 ) Summary: Call OnCompactionCompleted API in case of DisableManualCompaction() with updated Status::Incomplete Pull Request resolved: https://github.com/facebook/rocksdb/pull/8469 Reviewed By: ajkr Differential Revision: D29475517 Pulled By: akankshamahajan15 fbshipit-source-id: a1726c5e6ee18c0b5097ea04f5e6975fbe108055	2021-07-01 19:18:55 -07:00
Jay Zhuang	93a7389442	Add statistics support on CompactionService remote side (#8368 ) Summary: Add statistics option on CompactionService remote side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8368 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D28944427 Pulled By: jay-zhuang fbshipit-source-id: 2a19217f4a69b6e511af87eed12391860ef00c5e	2021-06-29 11:48:14 -07:00
mrambacher	be219089ad	Add BlobMetaData retrieval methods (#8273 ) Summary: Added BlobMetaData to ColumnFamilyMetaData and LiveBlobMetaData and DB API GetLiveBlobMetaData to retrieve it. First pass at struct. More tests and maybe fields to come... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8273 Reviewed By: ltamasi Differential Revision: D29102400 Pulled By: mrambacher fbshipit-source-id: 8a2383a4446328be6b91dced9841fdd3dfc80b73	2021-06-28 08:13:29 -07:00
Zhichao Cao	a904c62d28	Using existing crc32c checksum in checksum handoff for Manifest and WAL (#8412 ) Summary: In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead. In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412 Test Plan: make check, add new testing cases. Reviewed By: anand1976 Differential Revision: D29151545 Pulled By: zhichao-cao fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772	2021-06-25 00:47:17 -07:00
Jay Zhuang	54d73d6429	Fix DeleteFilesInRange may cause inconsistent compaction error (#8434 ) Summary: `DeleteFilesInRange()` marks deleting files to `being_compacted` before deleting, which may cause ongoing compactions report corruption exception or ASSERT for debug build. Adding the missing `ComputeCompactionScore()` when `being_compacted` is set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8434 Test Plan: Unittest Reviewed By: ajkr Differential Revision: D29276127 Pulled By: jay-zhuang fbshipit-source-id: f5b223e3c1fc6d821e100e3f3442bc70c1d50cf7	2021-06-22 09:17:37 -07:00
Zhichao Cao	82a70e1470	Trace MultiGet Keys and CF_IDs to the trace file (#8421 ) Summary: Tracing the MultiGet information including timestamp, keys, and CF_IDs to the trace file for analyzing and replay. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8421 Test Plan: make check, add test to trace_analyzer_test Reviewed By: anand1976 Differential Revision: D29221195 Pulled By: zhichao-cao fbshipit-source-id: 30c677d6c39ab31ef4bbdf7e0d1fa1fd79f295ff	2021-06-18 15:04:05 -07:00
anand76	575ea26ec9	Don't log a warning if file system doesn't support ReopenWritableFile() (#8414 ) Summary: RocksDB logs a warning if WAL truncation on DB open fails. Its possible that on some file systems, truncation is not required and they would return ```Status::NotSupported()``` for ```ReopenWritableFile```. Don't log a warning in such cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8414 Reviewed By: akankshamahajan15 Differential Revision: D29181738 Pulled By: anand1976 fbshipit-source-id: 6e01e9117e1e4c1d67daa4dcee7fa59d06e057a7	2021-06-17 12:05:40 -07:00
mrambacher	d5bd0039b9	Rename ImmutableOptions variables (#8409 ) Summary: This is the next part of the ImmutableOptions cleanup. After changing the use of ImmutableCFOptions to ImmutableOptions, there were places in the code that had did something like "ImmutableOptions* immutable_cf_options", where "cf" referred to the "old" type. This change simply renames the variables to match the current type. No new functionality is introduced. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8409 Reviewed By: pdillinger Differential Revision: D29166248 Pulled By: mrambacher fbshipit-source-id: 96de97f8e743f5c5160f02246e3ed8269556dc6f	2021-06-16 16:51:38 -07:00
Peter Dillinger	b3dbeadc34	Fix double-dumping CF stats to log (#8380 ) Summary: DBImpl::DumpStats is supposed to do this: Dump DB stats to LOG For each CF, dump CFStatsNoFileHistogram to LOG For each CF, dump CFFileHistogram to LOG Instead, due to a longstanding bug from 2017 (https://github.com/facebook/rocksdb/issues/2126), it would dump CFStats, which includes both CFStatsNoFileHistogram and CFFileHistogram, in both loops, resulting in near-duplicate output. This fixes the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8380 Test Plan: Manual inspection of LOG after db_bench Reviewed By: jay-zhuang Differential Revision: D29017535 Pulled By: pdillinger fbshipit-source-id: 3010604c4a629a80347f129cd746ce9b0d0cbda6	2021-06-11 17:06:09 -07:00
Zhichao Cao	f44e69c64a	Use DbSessionId as cache key prefix when secondary cache is enabled (#8360 ) Summary: Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360 Test Plan: add the test to lru_cache_test Reviewed By: pdillinger Differential Revision: D29006215 Pulled By: zhichao-cao fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814	2021-06-10 11:02:43 -07:00
David Devecsery	80a59a03a7	Cancel compact range (#8351 ) Summary: Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure. Tested via two tests added to db_tests2.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351 Reviewed By: ajkr Differential Revision: D28808894 Pulled By: ddevec fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd	2021-06-07 11:41:31 -07:00
Peter (Stig) Edwards	c75ef03e58	Do not truncate WAL if in read_only mode (#8313 ) Summary: I noticed ```openat``` system call with ```O_WRONLY``` flag and ```sync_file_range``` and ```truncate``` on WAL file when using ```rocksdb::DB::OpenForReadOnly``` by way of ```db_bench --readonly=true --benchmarks=readseq --use_existing_db=1 --num=1 ...``` Noticed in ```strace``` after seeing the last modification time of the WAL file change after each run (with ```--readonly=true```). I think introduced by `7d7f14480e` from https://github.com/facebook/rocksdb/pull/8122 I added a test to catch the WAL file being truncated and the modification time on it changing. I am not sure if a mock filesystem with mock clock could be used to avoid having to sleep 1.1s. The test could also check the set of files is the same and that the sizes are also unchanged. Before: ``` [ RUN ] DBBasicTest.ReadOnlyReopenMtimeUnchanged db/db_basic_test.cc:182: Failure Expected equality of these values: file_mtime_after_readonly_reopen Which is: 1621611136 file_mtime_before_readonly_reopen Which is: 1621611135 file is: 000010.log [ FAILED ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms) ``` After: ``` [ RUN ] DBBasicTest.ReadOnlyReopenMtimeUnchanged [ OK ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8313 Reviewed By: pdillinger Differential Revision: D28656925 Pulled By: jay-zhuang fbshipit-source-id: ea9e215cb53e7c830e76bc5fc75c45e21f12a1d6	2021-05-27 10:27:55 -07:00

1 2 3 4 5 ...

539 commits