rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-30 04:41:49 +00:00

Author	SHA1	Message	Date
Peter Dillinger	6de7081cf3	Always verify SST unique IDs on SST file open (#10532 ) Summary: Although we've been tracking SST unique IDs in the DB manifest unconditionally, checking has been opt-in and with an extra pass at DB::Open time. This changes the behavior of `verify_sst_unique_id_in_manifest` to check unique ID against manifest every time an SST file is opened through table cache (normal DB operations), replacing the explicit pass over files at DB::Open time. This change also enables the option by default and removes the "EXPERIMENTAL" designation. One possible criticism is that the option no longer ensures the integrity of a DB at Open time. This is far from an all-or-nothing issue. Verifying the IDs of all SST files hardly ensures all the data in the DB is readable. (VerifyChecksum is supposed to do that.) Also, with max_open_files=-1 (default, extremely common), all SST files are opened at DB::Open time anyway. Implementation details: * `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass that is now removed. * Unit tests that manipulate/corrupt table properties have to opt out of this check, because that corrupts the "actual" unique id. (And even for testing we don't currently have a mechanism to set "no unique id" in the in-memory file metadata for new files.) * A lot of other unit test churn relates to (a) default checking on, and (b) checking on SST open even without DB::Open (e.g. on flush) * Use `FileMetaData` for more `TableCache` operations (in place of `FileDescriptor`) so that we have access to the unique_id whenever we might need to open an SST file. There is the possibility of performance impact because we can no longer use the more localized `fd` part of an `FdWithKeyRange` but instead follow the `file_metadata` pointer. However, this change (possible regression) is only done for `GetMemoryUsageByTableReaders`. * Removed a completely unnecessary constructor overload of `TableReaderOptions` Possible follow-up: * Verification only happens when opening through table cache. Are there more places where this should happen? * Improve error message when there is a file size mismatch vs. manifest (FIXME added in the appropriate place). * I'm not sure there's a justification for `FileDescriptor` to be distinct from `FileMetaData`. * I'm skeptical that `FdWithKeyRange` really still makes sense for optimizing some data locality by duplicating some data in memory, but I could be wrong. * An unnecessary overload of NewTableReader was recently added, in the public API nonetheless (though unusable there). It should be cleaned up to put most things under `TableReaderOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532 Test Plan: updated unit tests Performance test showing no significant difference (just noise I think): `./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000` Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec Reviewed By: jay-zhuang Differential Revision: D38765551 Pulled By: pdillinger fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2	2022-09-07 22:52:42 -07:00
Hui Xiao	8a85946f58	Add missing mutex when reading from shared variable bg_bottom_compaction_scheduled_, bg_compaction_scheduled_ (#10610 ) Summary: Context/Summary: According to https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_job.h#L328-L332, any reading in the form of `bg_compaction_scheduled_` , `bg_bottom_compaction_scheduled_` should be protected by mutex, which isn't the case for some assert statement. This leads to a data race that can be repro-ed by the following command (command coming soon) ``` db=/dev/shm/rocksdb_crashtest_blackbox exp=/dev/shm/rocksdb_crashtest_expected rm -rf $db $exp mkdir -p $exp ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=1 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --value_size_mult=32 --writepercent=90 --compaction_pri=4 --use_txn=1 --level_compaction_dynamic_level_bytes=True --compaction_ttl=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --value_size_mult=32 --verify_db_one_in=1000 --write_buffer_size=65536 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_key=25000000 --max_key_len=3 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --target_file_size_base=2097152 --target_file_size_multiplier=2 ``` ``` WARNING: ThreadSanitizer: data race (pid=73424) Read of size 4 at 0x7b8c0000151c by thread T13: #0 ReleaseSubcompactionResources internal_repo_rocksdb/repo/db/compaction/compaction_job.cc:390 (db_stress+0x630aa3) https://github.com/facebook/rocksdb/issues/1 rocksdb::CompactionJob::Run() internal_repo_rocksdb/repo/db/compaction/compaction_job.cc:741 (db_stress+0x630aa3) https://github.com/facebook/rocksdb/issues/2 rocksdb::DBImpl::BackgroundCompaction(bool, rocksdb::JobContext, rocksdb::LogBuffer, rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:3436 (db_stress+0x60b2cc) https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2950 (db_stress+0x606d79) https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::BGWorkCompaction(void) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2693 (db_stress+0x60356a) Previous write of size 4 at 0x7b8c0000151c by thread T12 (mutexes: write M438955329917552448): #0 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:3018 (db_stress+0x6072a1) https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::BGWorkCompaction(void) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2693 (db_stress+0x60356a) Location is heap block of size 6720 at 0x7b8c00000000 allocated by main thread: #0 operator new(unsigned long, std::align_val_t) <null> (db_stress+0xbab5bb) https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle, std::allocator<rocksdb::ColumnFamilyHandle> >, rocksdb::DB, bool, bool) internal_repo_rocksdb/repo/db/db_impl/db_impl_open.cc:1811 (db_stress+0x69769a) https://github.com/facebook/rocksdb/issues/2 rocksdb::TransactionDB::Open(rocksdb::DBOptions const&, rocksdb::TransactionDBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle, std::allocator<rocksdb::ColumnFamilyHandle> >, rocksdb::TransactionDB*) internal_repo_rocksdb/repo/utilities/transactions/pessimistic_transaction_db.cc:258 (db_stress+0x8ae1f4) https://github.com/facebook/rocksdb/issues/3 rocksdb::StressTest::Open(rocksdb::SharedState) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:2611 (db_stress+0x32b927) https://github.com/facebook/rocksdb/issues/4 rocksdb::StressTest::InitDb(rocksdb::SharedState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:290 (db_stress+0x34712c) ``` This PR added all the missing mutex that should've been in place Pull Request resolved: https://github.com/facebook/rocksdb/pull/10610 Test Plan: - Past repro command - Existing CI Reviewed By: riversand963 Differential Revision: D39143016 Pulled By: hx235 fbshipit-source-id: 51dd4db55ad306f3dbda5d0dd54d6f2513cf70f2	2022-08-30 16:24:01 -07:00
Hui Xiao	e484b81eee	Sync dir containing CURRENT after RenameFile on CURRENT as much as possible (#10573 ) Summary: Context: Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation. This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want. The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped (which otherwise will internally call`Env/FS::SyncDic()` ) ``` ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35 ``` ``` stderr: WARNING: prefix_size is non-zero but memtablerep != prefix_hash db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext ): Assertion `tlist.find(tdn.second) == tlist.end()' failed.` ``` Summary:* The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following: - Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file - Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr. - Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file` - Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`. - `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573 Test Plan: - `make check` - Passed the repro db_stress command - For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed. Reviewed By: ajkr Differential Revision: D39005886 Pulled By: hx235 fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a	2022-08-29 17:35:21 -07:00
Chen Lixiang	9593fd1c82	Fix wrong compression type and options in universal compaction picker (#10515 ) Summary: In UniversalCompactionBuilder::PickCompactionToReduceSortedRuns, we passed start_level to get compression type and options. I think that is wrong and we should use output_level instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10515 Reviewed By: hx235 Differential Revision: D38611335 Pulled By: ajkr fbshipit-source-id: bb860caed4b6c6bbde8f75fc50cf875a9f04723d	2022-08-23 14:58:02 -07:00
Changyu Bi	fd165c869d	Add memtable per key-value checksum (#10281 ) Summary: Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281 Test Plan: - Added new unit test cases: `make check` - Benchmark on memtable insert ``` TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 # avg over 10 runs Baseline: 1166936 ops/sec memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%) memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%) write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%) ``` - Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum. ops/sec for `readrandom` dropped ~6.8%. ``` # Readseq sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 readseq [AVG 20 runs] : 7432840 (± 212005) ops/sec; 822.3 (± 23.5) MB/sec readseq [MEDIAN 20 runs] : 7573878 ops/sec; 837.9 MB/sec With -memtable_protection_bytes_per_key=2: readseq [AVG 20 runs] : 5134607 (± 119596) ops/sec; 568.0 (± 13.2) MB/sec readseq [MEDIAN 20 runs] : 5232946 ops/sec; 578.9 MB/sec # Readrandom sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100 readrandom [AVG 10 runs] : 140236 (± 3938) ops/sec; 9.8 (± 0.3) MB/sec readrandom [MEDIAN 10 runs] : 140545 ops/sec; 9.8 MB/sec With -memtable_protection_bytes_per_key=2: readrandom [AVG 10 runs] : 130632 (± 2738) ops/sec; 9.1 (± 0.2) MB/sec readrandom [MEDIAN 10 runs] : 130341 ops/sec; 9.1 MB/sec ``` - Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800` Reviewed By: ajkr Differential Revision: D37607896 Pulled By: cbi42 fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113	2022-08-12 13:51:32 -07:00
Peter Dillinger	9fa5c146d7	LOG more info on oldest snapshot and sequence numbers (#10454 ) Summary: The info LOG file does not currently give any direct information about the existence of old, live snapshots, nor how to estimate wall time from a sequence number within the scope of LOG history. This change addresses both with: * Logging smallest and largest seqnos for generated SST files, which can help associate sequence numbers with write time (based on flushes). * Logging oldest_snapshot_seqno for each compaction, which (along with that seqno info) helps us to determine how much old data might be kept around for old (leaked?) snapshots. Including the date here I thought might be excessive. I wanted to log the date and seqno of the oldest snapshot with periodic stats, but the current structure of the code doesn't really support that because `DumpDBStats` doesn't have access to the DB object. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10454 Test Plan: manual inspect LOG from `KEEP_DB=1 ./db_basic_test --gtest_filter=CompactBetweenSnapshots` Reviewed By: ajkr Differential Revision: D38326948 Pulled By: pdillinger fbshipit-source-id: 294918ffc04a419844146cd826045321b4d5c038	2022-08-12 13:08:50 -07:00
sherriiiliu	4753e5a2e9	Fix wrong value passed in compaction filter in BlobDB (#10391 ) Summary: New blobdb has a bug in compaction filter, where `blob_value_` is not reset for next iterated key. This will cause blob_value_ not empty and previous value read from blob is passed into the filter function for next key, even if its value is not in blob. Fixed by reseting regardless of key type. Test Case: Add `FilterByValueLength` test case in `DBBlobCompactionTest` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10391 Reviewed By: riversand963 Differential Revision: D38629900 Pulled By: ltamasi fbshipit-source-id: 47d23ff2e5ec697958a210db9e6ceeb8b2fc49fa	2022-08-11 13:55:28 -07:00
Yanqin Jin	fee2c472d0	Include minimal contextual information in `CompactionIterator` (#10505 ) Summary: The main purpose is to make debugging easier without sacrificing performance. Instead of using a boolean variable for `CompactionIterator::valid_`, we can extend it to an `uint8_t`, using the LSB to denote if the compaction iterator is valid and 4 additional bits to denote where the iterator is set valid inside `NextFromInput()`. Therefore, when the control flow reaches `PrepareOutput()` and hits assertion there, we can have a better idea of what has gone wrong. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10505 Test Plan: make check ``` TEST_TMPDIR=/dev/shm/rocksdb time ./db_bench -compression_type=none -write_buffer_size=1073741824 -benchmarks=fillseq,flush ``` The above command has a 'flush' benchmark which uses `CompactionIterator`. I haven't observed any CPU regression or drop in throughput or latency increase. Reviewed By: ltamasi Differential Revision: D38551615 Pulled By: riversand963 fbshipit-source-id: 1250848fc118bb753d71fa9ff8ba840df999f5e0	2022-08-09 17:07:24 -07:00
Jay Zhuang	3f763763aa	Change `bottommost_temperture` to `last_level_temperture` (#10471 ) Summary: Change tiered compaction feature from `bottommost_temperture` to `last_level_temperture`. The old option is kept for migration purpose only, which is behaving the same as `last_level_temperture` and it will be removed in the next release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471 Test Plan: CI Reviewed By: siying Differential Revision: D38450621 Pulled By: jay-zhuang fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527	2022-08-08 14:36:34 -07:00
Jay Zhuang	375534752a	Improve universal compaction picker for tiered compaction (#10467 ) Summary: Current universal compaction picker may cause extra size amplification compaction if there're more hot data on penultimate level. Improve the picker to skip the last level for size amp calculation if tiered compaction is enabled, which can 1. avoid extra unnecessary size amp compaction; 2. typically cold tier (the last level) is not size constrained, so skip size amp for cold tier is intended; Pull Request resolved: https://github.com/facebook/rocksdb/pull/10467 Test Plan: CI and added unittest Reviewed By: siying Differential Revision: D38391350 Pulled By: jay-zhuang fbshipit-source-id: 103c0731c05e0a7e8f267e9e829d022328be25d2	2022-08-08 14:34:36 -07:00
Levi Tamasi	cc8ded6152	Do not put blobs read during compaction into cache (#10457 ) Summary: During compaction, blobs are currently read using the default `ReadOptions`, which has the `fill_cache` flag set to true. Earlier, this didn't make any difference since we didn't have a blob cache; however, now we have to explicitly set this flag to false to avoid polluting the cache during compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D38333528 Pulled By: ltamasi fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2	2022-08-01 19:49:05 -07:00
Yanqin Jin	fbfcf5cbcd	Remove unused fields from FileMetaData (temporarily) (#10443 ) Summary: FileMetaData::[min\|max]_timestamp are not currently being used or tracked by RocksDB, even when user-defined timestamp is enabled. Each of them is a std::string which can occupy 32 bytes. Remove them for now. They may be added back when we have a pressing need for them. When we do add them back, consider store them in a more compact way, e.g. one boolean flag and a byte array of size 16. Per file min/max timestamp bounds are available as table properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10443 Test Plan: make check Reviewed By: pdillinger Differential Revision: D38292275 Pulled By: riversand963 fbshipit-source-id: 841dc4e855ad8f8481c80cb020603de9607c9c94	2022-08-01 17:56:13 -07:00
Akanksha Mahajan	56463d443d	Provide support for subcompactions with user-defined timestamps (#10344 ) Summary: The subcompaction logic currently picks file boundaries as subcompaction boundaries. This is not compatible with user-defined timestamps because of two issues. Issue1: ReadOptions.iterate_lower_bound and ReadOptions.iterate_upper_bound contains timestamps which results in assertion failure as BlockBasedTableIterator expects bounds to be without timestamps. As result, because of wrong comparison end key is returned as user_key resulting in assertion failure. Issue2: Since it might result in two keys that only differ by user timestamp getting processed by two different subcompactions (and thus two different CompactionIterator state machines), which in turn can cause data correction issues. This PR provide support to reenable subcompactions with user-defined timestamps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10344 Test Plan: Added new unit test - Without fix for Issue1 unit test MultipleSubCompactions fails with error: ``` db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterat│ or::AssertBounds(): Assertion `!valid_ \|\| !end_ \|\| cmp_->Compare(key(), end_) < 0' failed. Received signal 6 (Aborted) │ #0 /usr/local/fbcode/platform009/lib/libc.so.6(gsignal+0x100) [0x7f8fbbbfe530] db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterator::AssertBounds(): Assertion `!valid_ \|\| !end_ \|\| cmp_->Compare(key(), end_) < 0' failed. Aborted (core dumped) ``` Ran stress test `make crash_test_with_ts -j32` Reviewed By: riversand963 Differential Revision: D38220841 Pulled By: akankshamahajan15 fbshipit-source-id: 5d5cae2bd37fcaeba1e77fce0a69070ad4158ccb	2022-07-31 11:39:16 -07:00
Zichen Zhu	c945a9a664	Allow sufficient subcompactions under round-robin compaction priority (#10422 ) Summary: Allow sufficient subcompactions can be used when the number of input files is less than `max_subcompactions` under round-robin compaction priority. Test Case: Add `RoundRobinWithoutAdditionalResources` into `db_compaction_test` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10422 Reviewed By: ajkr Differential Revision: D38186545 Pulled By: littlepig2013 fbshipit-source-id: b8e5098306f1e5b9561dfafafc8300a38f7fe88e	2022-07-26 20:37:34 -07:00
Zichen Zhu	8860fc902a	Support subcmpct using reserved resources for round-robin priority (#10341 ) Summary: Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows: * Constraint 1: We can only pick consecutive files - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys) * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes` * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)` * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3 More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`. The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions regardless of the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps: * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()` * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()` * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions) More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341 Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc` Reviewed By: ajkr, hx235 Differential Revision: D37792644 Pulled By: littlepig2013 fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657	2022-07-24 11:12:44 -07:00
sdong	252bea405e	Improve SubCompaction Partitioning (#10393 ) Summary: Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393 Test Plan: 1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable; 2. Add a new unit test to ApproximateKeyAnchors() 3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly. Reviewed By: jay-zhuang Differential Revision: D38043783 fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f	2022-07-23 17:38:49 -07:00
LIU HU	8885b0537b	Fix underflow in FIFOCompactionPicker (#10386 ) Summary: Fix https://github.com/facebook/rocksdb/issues/10133 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10386 Reviewed By: riversand963 Differential Revision: D38067265 Pulled By: ajkr fbshipit-source-id: 3a99a98ac5d7ac37581b5b636fbfa7901563d834	2022-07-22 09:20:35 -07:00
Gang Liao	ec4ebeff30	Support prepopulating/warming the blob cache (#10298 ) Summary: Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298 Reviewed By: ltamasi Differential Revision: D37908743 Pulled By: gangliao fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1	2022-07-17 07:13:59 -07:00
Jay Zhuang	faa0f9723c	Tiered compaction: integrate Seqno time mapping with per key placement (#10370 ) Summary: Using the Sequence number to time mapping to decide if a key is hot or not in compaction and place it in the corresponding level. Note: the feature is not complete, level compaction will run indefinitely until all penultimate level data is cold and small enough to not trigger compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10370 Test Plan: CI * Run basic db_bench for universal compaction manually Reviewed By: siying Differential Revision: D37892338 Pulled By: jay-zhuang fbshipit-source-id: 792bbd91b1ccc2f62b5d14c53118434bcaac4bbe	2022-07-15 19:01:30 -07:00
Jay Zhuang	a3acf2ef87	Add seqno to time mapping (#10338 ) Summary: Which will be used for tiered storage to preclude hot data from compacting to the cold tier (the last level). Internally, adding seqno to time mapping. A periodic_task is scheduled to record the current_seqno -> current_time in certain cadence. When memtable flush, the mapping informaiton is stored in sstable property. During compaction, the mapping information are merged and get the approximate time of sequence number, which is used to determine if a key is recently inserted or not and preclude it from the last level if it's recently inserted (within the `preclude_last_level_data_seconds`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338 Test Plan: CI Reviewed By: siying Differential Revision: D37810187 Pulled By: jay-zhuang fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f	2022-07-14 21:49:34 -07:00
sdong	c8b20d469d	Make InternalKeyComparator not configurable (#10342 ) Summary: InternalKeyComparator is an internal class which is a simple wrapper of Comparator. https://github.com/facebook/rocksdb/pull/8336 made Comparator customizeable. As a side effect, internal key comparator was made configurable too. This introduces overhead to this simple wrapper. For example, every InternalKeyComparator will have an std::vector attached to it, which consumes memory and possible allocation overhead too. We remove InternalKeyComparator from being customizable by making InternalKeyComparator not a subclass of Comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10342 Test Plan: Run existing CI tests and make sure it doesn't fail Reviewed By: riversand963 Differential Revision: D37771351 fbshipit-source-id: 917256ee04b2796ed82974549c734fb6c4d8ccee	2022-07-14 10:09:31 -07:00
Jay Zhuang	6ce0b2ca34	Tiered Compaction: per key placement support (#9964 ) Summary: Support per_key_placement for last level compaction, which will be used for tiered compaction. * compaction iterator reports which level a key should output to; * compaction get the output level information and check if it's safe to output the data to penultimate level; * all compaction output files will be installed. * extra internal compaction stats added for penultimate level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9964 Test Plan: * Unittest * db_bench, no significate difference: https://gist.github.com/jay-zhuang/3645f8fb97ec0ab47c10704bb39fd6e4 * microbench manual compaction no significate difference: https://gist.github.com/jay-zhuang/ba679b3e89e24992615ee9eef310e6dd * run the db_stress multiple times (not covering the new feature) looks good (internal: https://fburl.com/sandcastle/9w84pp2m) Reviewed By: ajkr Differential Revision: D36249494 Pulled By: jay-zhuang fbshipit-source-id: a96da57c8031c1df83e4a7a8567b657a112b80a3	2022-07-13 20:54:49 -07:00
sdong	769b156e65	Remove customized naming from InternalKeyComparator (#10343 ) Summary: InternalKeyComparator is a thin wrapper around user comparator. Storing a string for name is relatively expensive to this small wrapper for both CPU and memory usage. Try to remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10343 Test Plan: Run existing tests Reviewed By: ajkr Differential Revision: D37772469 fbshipit-source-id: d2d106a8d022193058fd7f6b220108e3d94aca34	2022-07-12 13:30:35 -07:00
zczhu	8debfe2b21	Replace the output split key with its pointer in subcompaction (#10316 ) Summary: Earlier implementation of cutting the output files with a compact cursor under Round-Robin priority uses `Valid()` to determine if the `output_split_key` is valid in `ShouldStopBefore`. This contributes to excessive CPU computation, as pointed out by [this issue](https://github.com/facebook/rocksdb/issues/10315). In this PR, we change the type of `output_split_key` to be `InternalKey*` and set it as `nullptr` if it is not going to be used in `ShouldStopBefore`, `Valid()` condition checking can be avoided using that pointer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10316 Reviewed By: ajkr Differential Revision: D37661492 Pulled By: littlepig2013 fbshipit-source-id: 66ff1105f3378e5573d3a126fdaff9bb23b5498f	2022-07-06 16:19:45 -07:00
sdong	a9565ccb26	Try to trivial move more than one files (#10190 ) Summary: In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck. When pick up a file to compact and it doesn't have overlapping files in the next level, try to expand to the next file if there is still no overlapping. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10190 Test Plan: Add some unit tests. For performance, Try to run ./db_bench_multi_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes Together with https://github.com/facebook/rocksdb/pull/10188 , stalling will be eliminated in this benchmark. Reviewed By: jay-zhuang Differential Revision: D37230647 fbshipit-source-id: 42b260f545c46abc5d90335ac2bbfcd09602b549	2022-07-05 10:10:37 -07:00
sdong	4428c76181	Multi-File Trivial Move in L0->L1 (#10188 ) Summary: In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify. 1. We always try to find L0->L1 trivial move from the oldest files. Keep including newer files, until adding a new file won't trigger a trivial move 2. Modify the trivial move condition so that this compaction would be tagged as trivial move. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10188 Test Plan: See throughput improvements with db_bench with fast fillseq benchmark and small L0 files: ./db_bench_l0_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes The throughput improved by about 50%. Stalling still happens though. Reviewed By: jay-zhuang Differential Revision: D37224743 fbshipit-source-id: 8958d97f22e12bdfc14d2e85930f6fa0070e9659	2022-06-30 18:04:23 -07:00
zczhu	4f51101d31	Remove compact cursor when split sub-compactions (#10289 ) Summary: In round-robin compaction priority, when splitting the compaction into sub-compactions, the earlier implementation takes into account the compact cursor to have full use of available sub-compactions. But this may result in unbalanced sub-compactions, so we remove this here. The removal does not affect the cursor-based splitting mechanism within a sub-compaction, and thus the output files are still ensured to be split according to the cursor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10289 Reviewed By: ajkr Differential Revision: D37559091 Pulled By: littlepig2013 fbshipit-source-id: b8b45b99f63b09cf873f7f049bcb4ab13871fffc	2022-06-30 15:36:46 -07:00
Levi Tamasi	c73d2a9d18	Add API for writing wide-column entities (#10242 ) Summary: The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds a new API called `PutEntity` that can be used to write a wide-column entity to the database. The new API is added to both `DB` and `WriteBatch`. Note that currently there is no way to retrieve these entities; more precisely, all read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they encounter a wide-column entity that is required to answer a query. Read-side support (as well as other missing functionality like `Merge`, compaction filter, and timestamp support) will be added in later PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D37369748 Pulled By: ltamasi fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d	2022-06-25 15:30:47 -07:00
sdong	246d469750	Reduce overhead of SortFileByOverlappingRatio() (#10161 ) Summary: Currently SortFileByOverlappingRatio() is O(nlogn). It is usually OK but When there are a lot of files in an LSM-tree, SortFileByOverlappingRatio() can take non-trivial amount of time. The problem is severe when the user is loading keys in sorted order, where compaction is only trivial move and this operation becomes the bottleneck and limit the total throughput. This commit makes SortFileByOverlappingRatio() only find the top 50 files based on score. 50 files are usually enough for the parallel compactions needed for the level, and in case it is not enough, we would fall back to random, which should be acceptable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10161 Test Plan: Run a fillseq that generates a lot of files, and observe throughput improved (although stall is not yet eliminated). The command ran: TEST_TMPDIR=/dev/shm/ ./db_bench_sort --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 The throughput improved by 11%. Reviewed By: jay-zhuang Differential Revision: D37129469 fbshipit-source-id: 492da2ef5bfc7cdd6daa3986b50d2ff91f88542d	2022-06-24 14:01:11 -07:00
zczhu	17a1d65e3a	Cut output files at compaction cursors (#10227 ) Summary: The files behind the compaction cursor contain newer data than the files ahead of it. If a compaction writes a file that spans from before its output level’s cursor to after it, then data before the cursor will be contaminated with the old timestamp from the data after the cursor. To avoid this, we can split the output file into two – one entirely before the cursor and one entirely after the cursor. Note that, in rare cases, we DO NOT need to cut the file if it is a trivial move since the file will not be contaminated by older files. In such case, the compact cursor is not guaranteed to be the boundary of the file, but it does not hurt the round-robin selection process. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10227 Test Plan: Add 'RoundRobinCutOutputAtCompactCursor' unit test in `db_compaction_test` Task: [T122216351](https://www.internalfb.com/intern/tasks/?t=122216351) Reviewed By: jay-zhuang Differential Revision: D37388088 Pulled By: littlepig2013 fbshipit-source-id: 9246a6a084b6037b90d6ab3183ba4dfb75a3378d	2022-06-23 14:25:42 -07:00
zczhu	30141461f9	Add basic kRoundRobin compaction policy (#10107 ) Summary: Add `kRoundRobin` as a compaction priority. The implementation is as follows. - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`). - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function. - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107 Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test` Reviewed By: ajkr Differential Revision: D37316037 Pulled By: littlepig2013 fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee	2022-06-21 11:56:53 -07:00
Gang Liao	deff48bcef	Add blob source to retrieve blobs in RocksDB (#10198 ) Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost. Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel! This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198 Reviewed By: ltamasi Differential Revision: D37294735 Pulled By: gangliao fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e	2022-06-20 20:58:11 -07:00
Yanqin Jin	4d31d3c2ed	Abort in dbg mode after logging (#10183 ) Summary: In CompactionIterator code, there are multiple places where the process will abort in dbg mode before logging the error message describing the cause. This PR changes only the logging behavior for compaction iterator so that error message is written to LOG before the process aborts in debug mode. Also updated the triggering condition for an assertion for single delete with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10183 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D37190218 Pulled By: riversand963 fbshipit-source-id: 741bb007067be7cfbe94ac9e530ad4b2b339c009	2022-06-15 22:00:24 -07:00
gitbw95	5cbee1f609	Add unit test to verify that the dynamic priority can be passed from compaction to FS (#10088 ) Summary: Summary: Add unit tests to verify that the dynamic priority can be passed from compaction to FS. Compaction reads&writes and other DB reads&writes share the same read&write paths to FSRandomAccessFile or FSWritableFile, so a MockTestFileSystem is added to replace the default filesystem from Env to intercept and verify the io_priority. To prepare the compaction input files, use the default filesystem from Env. To test the io priority of the compaction reads and writes, db_options_.fs is set as MockTestFileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10088 Test Plan: Add unit tests. Reviewed By: anand1976 Differential Revision: D36882528 Pulled By: gitbw95 fbshipit-source-id: 120adc15801966f2b8c9fc45285f590a3fff96d1	2022-06-07 11:57:12 -07:00
zczhu	3ee6c9baec	Consolidate manual_compaction_paused_ check (#10070 ) Summary: As pointed out by [https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422](https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422), check `manual_compaction_paused` and `manual_compaction_canceled` can be reduced by setting `canceled` to be true in `DisableManualCompaction()` and `canceled` to be false in the last time calling `EnableManualCompaction()`. Changed Tests: The origin `DBTest2.PausingManualCompaction1` uses a callback function to increase `manual_compaction_paused` and the origin CompactionJob/CompactionIterator with `manual_compaction_paused` can detect this. I changed the callback function so that it sets `*canceled` as true if `canceled` is not `nullptr` (to notify CompactionJob/CompactionIterator the compaction has been canceled). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10070 Test Plan: This change does not introduce new features, but some slight difference in compaction implementation. Run the same manual compaction unit tests as before (e.g., PausingManualCompaction[1-4], CancelManualCompaction[1-2], CancelManualCompactionWithListener in db_test2, and db_compaction_test). Reviewed By: ajkr Differential Revision: D36949133 Pulled By: littlepig2013 fbshipit-source-id: c5dc4c956fbf8f624003a0f5ad2690240063a821	2022-06-06 18:32:26 -07:00
Gang Liao	e6432dfd4c	Make it possible to enable blob files starting from a certain LSM tree level (#10077 ) Summary: Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed. In order to achieve this, we would like to do the following: - Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic) - Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level` - Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` ) - Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh` - Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool) - Ideally extend the C and Java bindings with the new option - Update the BlobDB wiki to document the new option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077 Reviewed By: ltamasi Differential Revision: D36884156 Pulled By: gangliao fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d	2022-06-02 20:04:33 -07:00
Gang Liao	3dc6ebaf74	Support specifying blob garbage collection parameters when CompactRange() (#10073 ) Summary: Garbage collection is generally controlled by the BlobDB configuration options `enable_blob_garbage_collection` and `blob_garbage_collection_age_cutoff`. However, there might be use cases where we would want to temporarily override these options while performing a manual compaction. (One use case would be doing a full key-space manual compaction with full=100% garbage collection age cutoff in order to minimize the space occupied by the database.) Our goal here is to make it possible to override the configured GC parameters when using the `CompactRange` API to perform manual compactions. This PR would involve: - Extending the `CompactRangeOptions` structure so clients can both force-enable and force-disable GC, as well as use a different cutoff than what's currently configured - Storing whether blob GC should actually be enabled during a certain manual compaction and the cutoff to use in the `Compaction` object (considering the above overrides) and passing it to `CompactionIterator` via `CompactionProxy` - Updating the BlobDB wiki to document the new options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10073 Test Plan: Adding unit tests and adding the new options to the stress test tool. Reviewed By: ltamasi Differential Revision: D36848700 Pulled By: gangliao fbshipit-source-id: c878ef101d1c612429999f513453c319f75d78e9	2022-06-01 19:40:26 -07:00
Jay Zhuang	fd24e4479b	Fix failed VerifySstUniqueIds unittests (#10043 ) Summary: which should use UniqueId64x2 instead of string. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10043 Test Plan: unittest Reviewed By: pdillinger Differential Revision: D36620366 Pulled By: jay-zhuang fbshipit-source-id: cf937a1da362018472fa4396848225e48893848b	2022-05-24 09:00:06 -07:00
sdong	bea5831bff	Move three info logging within DB Mutex to use log buffer (#10029 ) Summary: info logging with DB Mutex could potentially invoke I/O and cause performance issues. Move three of the cases to use log buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10029 Test Plan: Run existing tests. Reviewed By: jay-zhuang Differential Revision: D36561694 fbshipit-source-id: cabb93fea299001a6b4c2802fcba3fde27fa062c	2022-05-23 10:09:37 -07:00
Jay Zhuang	c6d326d3d7	Track SST unique id in MANIFEST and verify (#9990 ) Summary: Start tracking SST unique id in MANIFEST, which is used to verify with SST properties to make sure the SST file is not overwritten or misplaced. A DB option `try_verify_sst_unique_id` is introduced to enable/disable the verification, if enabled, it opens all SST files during DB-open to read the unique_id from table properties (default is false), so it's recommended to use it with `max_open_files = -1` to pre-open the files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990 Test Plan: unittests, format-compatible test, mini-crash Reviewed By: anand1976 Differential Revision: D36381863 Pulled By: jay-zhuang fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f	2022-05-19 11:04:21 -07:00
gitbw95	4da34b97ee	Set Read rate limiter priority dynamically and pass it to FS (#9996 ) Summary: ### Context: Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users. ### Solution User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile. This PR is for the Read path. The Read: dynamic priority for different state are listed as follows: \| State \| Normal \| Delayed \| Stalled \| \| ----- \| ------ \| ------- \| ------- \| \| Flush (verification read in BuildTable()) \| IO_USER \| IO_USER \| IO_USER \| \| Compaction \| IO_LOW \| IO_USER \| IO_USER \| \| User \| User provided \| User provided \| User provided \| We will respect the read_options that the user provided and will not set it. The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read". Details 1. Set read_options.rate_limiter_priority dynamically: - User: Do not update the read_options. Use the read_options that the user provided. - Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction(). - Flush: Update read_options in BuildTable(). 2. Pass the rate limiter priority to FSRandomAccessFile functions: - After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions, including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock(). - In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch(). - Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator(). ### Test Plans Add unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996 Reviewed By: anand1976 Differential Revision: D36452483 Pulled By: gitbw95 fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf	2022-05-18 19:41:44 -07:00
gitbw95	05c678e135	Set Write rate limiter priority dynamically and pass it to FS (#9988 ) Summary: ### Context: Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users. From the RocksDB perspective, there can be two kinds of rate limiters, the internal (native) one and the external one. - The internal (native) rate limiter is introduced in [the wiki](https://github.com/facebook/rocksdb/wiki/Rate-Limiter). Currently, only IO_LOW and IO_HIGH are used and they are set statically. - For the external rate limiter, in FSWritableFile functions, IOOptions is open for end users to set and get rate_limiter_priority for their own rate limiter. Currently, RocksDB doesn’t pass the rate_limiter_priority through IOOptions to the file system. ### Solution During the User Read, Flush write, Compaction read/write, the WriteController is used to determine whether DB writes are stalled or slowed down. The rate limiter priority (Env::IOPriority) can be determined accordingly. We decided to always pass the priority in IOOptions. What the file system does with it should be a contract between the user and the file system. We would like to set the rate limiter priority at file level, since the Flush/Compaction job level may be too coarse with multiple files and block IO level is too granular. This PR is for the Write path. The Write: dynamic priority for different state are listed as follows: \| State \| Normal \| Delayed \| Stalled \| \| ----- \| ------ \| ------- \| ------- \| \| Flush \| IO_HIGH \| IO_USER \| IO_USER \| \| Compaction \| IO_LOW \| IO_USER \| IO_USER \| Flush and Compaction writes share the same call path through BlockBaseTableWriter, WritableFileWriter, and FSWritableFile. When a new FSWritableFile object is created, its io_priority_ can be set dynamically based on the state of the WriteController. In WritableFileWriter, before the call sites of FSWritableFile functions, WritableFileWriter::DecideRateLimiterPriority() determines the rate_limiter_priority. The options (IOOptions) argument of FSWritableFile functions will be updated with the rate_limiter_priority. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9988 Test Plan: Add unit tests. Reviewed By: anand1976 Differential Revision: D36395159 Pulled By: gitbw95 fbshipit-source-id: a7c82fc29759139a1a07ec46c37dbf7e753474cf	2022-05-18 00:41:41 -07:00
Jay Zhuang	b84e3363f5	Add table_properties_collector_factories override (#9995 ) Summary: Add table_properties_collector_factories override on the remote side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9995 Test Plan: unittest added Reviewed By: ajkr Differential Revision: D36392623 Pulled By: jay-zhuang fbshipit-source-id: 3ba031294d90247ca063d7de7b43178d38e3f66a	2022-05-17 20:57:51 -07:00
Yanqin Jin	3f263ef536	Add a temporary option for user to opt-out enforcement of SingleDelete contract (#9983 ) Summary: PR https://github.com/facebook/rocksdb/issues/9888 started to enforce the contract of single delete described in https://github.com/facebook/rocksdb/wiki/Single-Delete. For some of existing use cases, it is desirable to have a transition during which compaction will not fail if the contract is violated. Therefore, we add a temporary option `enforce_single_del_contracts` to allow application to opt out from this new strict behavior. Once transition completes, the flag can be set to `true` again. In a future release, the option will be removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9983 Test Plan: make check Reviewed By: ajkr Differential Revision: D36333672 Pulled By: riversand963 fbshipit-source-id: dcb703ea0ed08076a1422f1bfb9914afe3c2caa2	2022-05-16 15:44:59 -07:00
mrambacher	bfc6a8ee4a	Option type info functions (#9411 ) Summary: Add methods to set the various functions (Parse, Serialize, Equals) to the OptionTypeInfo. These methods simplify the number of constructors required for OptionTypeInfo and make the code a little clearer. Add functions to the OptionTypeInfo for Prepare and Validate. These methods allow types other than Configurable and Customizable to have Prepare and Validate logic. These methods could be used by an option to guarantee that its settings were in a range or that a value was initialized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9411 Reviewed By: pdillinger Differential Revision: D36174849 Pulled By: mrambacher fbshipit-source-id: 72517d8c6bab4723788a4c1a9e16590bff870125	2022-05-13 04:57:08 -07:00
sdong	736a7b5433	Remove own ToString() (#9955 ) Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471	2022-05-06 13:03:58 -07:00
sdong	49628c9a83	Use std::numeric_limits<> (#9954 ) Summary: Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954 Test Plan: See CI Runs. Reviewed By: riversand963 Differential Revision: D36173954 fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0	2022-05-05 13:08:21 -07:00
Yanqin Jin	9d634dd5b6	Rename kRemoveWithSingleDelete to kPurge (#9951 ) Summary: PR 9929 adds a new CompactionFilter::Decision, i.e. kRemoveWithSingleDelete so that CompactionFilter can indicate to CompactionIterator that a PUT can only be removed with SD. However, how CompactionIterator handles such a key is implementation detail which should not be implied in the public API. In fact, such a PUT can just be dropped. This is an optimization which we will apply in the near future. Discussion thread: https://github.com/facebook/rocksdb/pull/9929#discussion_r863198964 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9951 Test Plan: make check Reviewed By: ajkr Differential Revision: D36156590 Pulled By: riversand963 fbshipit-source-id: 7b7d01f47bba4cad7d9cca6ca52984f27f88b372	2022-05-05 08:16:20 -07:00
Yanqin Jin	06394ff4e7	Fix a bug of CompactionIterator/CompactionFilter using `Delete` (#9929 ) Summary: When compaction filter determines that a key should be removed, it updates the internal key's type to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction together with `SingleDelete`, it will cause compaction iterator to return Corruption. To fix the issue, compaction filter should return more information in addition to the intention of removing a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing `kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`. In addition, I updated db_stress_shared_state.[cc\|h] so that `no_overwrite_ids_` becomes `const`. It is easier to reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()` when calling from `Open()` so that we can set up the rollback deletion type callback for transactions. Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does not quite work with `MultiOpsTxnsStressTest`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929 Test Plan: make check make crash_test make crash_test_with_txn Reviewed By: anand1976 Differential Revision: D36069678 Pulled By: riversand963 fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92	2022-05-02 13:25:45 -07:00
Yanqin Jin	2b5c29f9f3	Enforce the contract of SingleDelete (#9888 ) Summary: Enforce the contract of SingleDelete so that they are not mixed with Delete for the same key. Otherwise, it will lead to undefined behavior. See https://github.com/facebook/rocksdb/wiki/Single-Delete#notes. Also fix unit tests and write-unprepared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9888 Test Plan: make check Reviewed By: ajkr Differential Revision: D35837817 Pulled By: riversand963 fbshipit-source-id: acd06e4dcba8cb18df92b44ed18c57e10e5a7635	2022-04-28 14:48:27 -07:00
Levi Tamasi	db536ee045	Propagate errors from UpdateBoundaries (#9851 ) Summary: In `FileMetaData`, we keep track of the lowest-numbered blob file referenced by the SST file in question for the purposes of BlobDB's garbage collection in the `oldest_blob_file_number` field, which is updated in `UpdateBoundaries`. However, with the current code, `BlobIndex` decoding errors (or invalid blob file numbers) are swallowed in this method. The patch changes this by propagating these errors and failing the corresponding flush/compaction. (Note that since blob references are generated by the BlobDB code and also parsed by `CompactionIterator`, in reality this can only happen in the case of memory corruption.) This change necessitated updating some unit tests that involved fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like `"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es. Some were relying on the earlier behavior to simulate corruption; these were replaced with `SyncPoint`-based test code that corrupts a valid blob reference at read time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D35683671 Pulled By: ltamasi fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6	2022-04-15 20:25:48 -07:00
Yanqin Jin	0bd4dcde6b	CompactionIterator sees consistent view of which keys are committed (#9830 ) Summary: This PR does not affect the functionality of `DB` and write-committed transactions. `CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed. As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if it is committed. In fact, the implementation of `KeyCommitted()` is as follows: ``` inline bool KeyCommitted(SequenceNumber seq) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr \|\| snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot; } ``` With that being said, we focus on write-prepared/write-unprepared transactions. A few notes: - A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database. - `CompactionIterator` outputs a key as long as the key is uncommitted. Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`. Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone. To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot to determine whether a key is committed or not with minor change to `KeyCommitted()`. ``` inline bool KeyCommitted(SequenceNumber sequence) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr \|\| snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) == SnapshotCheckerResult::kInSnapshot; } ``` As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble for `CompactionIterator`s assertions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830 Test Plan: make check Reviewed By: ltamasi Differential Revision: D35561162 Pulled By: riversand963 fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f	2022-04-14 11:11:04 -07:00
Jay Zhuang	dc1c90c4e3	Support canceling running RemoteCompaction on remote side (#9725 ) Summary: Add the ability to cancel remote compaction on the remote side by setting `OpenAndCompactOptions.canceled` to true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9725 Test Plan: added unittest Reviewed By: ajkr Differential Revision: D35018800 Pulled By: jay-zhuang fbshipit-source-id: be3652f9645e0347df429e42a5614d5a9b3a1ec4	2022-04-13 13:28:09 -07:00
Jay Zhuang	f934a0af46	Add event listener support on remote compactor side (#9821 ) Summary: So the user is able to set event listener on the compactor side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9821 Test Plan: unittest added Reviewed By: ajkr Differential Revision: D35485388 Pulled By: jay-zhuang fbshipit-source-id: 669d8a3aaee012b75b940470306756c03ffa09b2	2022-04-12 17:25:36 -07:00
Yanqin Jin	0ad9ee30ce	Remove dead code (#9825 ) Summary: Options `preserve_deletes` and `iter_start_seqnum` have been removed since 7.0. This PR removes dead code related to these two removed options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9825 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D35517950 Pulled By: riversand963 fbshipit-source-id: 86282ce5ec4087acb94a06a42a1b6d55b1715482	2022-04-11 10:26:55 -07:00
sdong	e03f8a0c12	L0 Subcompaction to trim input files (#9802 ) Summary: When sub compaction is decided for L0->L1 compaction, most of the cases, all L0 files will be involved in all sub compactions. However, it is not always the case. When files are generally (but not strictly) inserted in sequential order, there can be a subset of L0 files invovled. Yet RocksDB always open all those L0 files, and build an iterator, read many of the files' first of last block with expensive readahead. We trim some input files to reduce overhead a little bit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9802 Test Plan: Add a unit test to cover this case and manually validate the behavior while running the test. Reviewed By: ajkr Differential Revision: D35371031 fbshipit-source-id: 701ed7375b5cbe41672e93b38fe8a1503dad08b6	2022-04-06 18:19:19 -07:00
KNOEEE	cb4d188a34	Fix a bug in PosixClock (#9695 ) Summary: Multiplier here should be 1e6 to get microseconds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9695 Reviewed By: ajkr Differential Revision: D34897086 Pulled By: jay-zhuang fbshipit-source-id: 9c1d0811ea740ba0a007edc2da199edbd000b88b	2022-03-21 16:11:02 -07:00
Jay Zhuang	4dff279b19	DisableManualCompaction may fail to cancel an unscheduled task (#9659 ) Summary: https://github.com/facebook/rocksdb/issues/9625 didn't change the unschedule condition which was waiting for the background thread to clean-up the compaction. make sure we only unschedule the task when it's scheduled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9659 Reviewed By: ajkr Differential Revision: D34651820 Pulled By: jay-zhuang fbshipit-source-id: 23f42081b15ec8886cd81cbf131b116e0c74dc2f	2022-03-12 20:07:04 -08:00
Jay Zhuang	09b0e8f2c7	Fix a timer crash caused by invalid memory management (#9656 ) Summary: Timer crash when multiple DB instances doing heavy DB open and close operations concurrently. Which is caused by adding a timer task with smaller timestamp than the current running task. Fix it by moving the getting new task timestamp part within timer mutex protection. And other fixes: - Disallow adding duplicated function name to timer - Fix a minor memory leak in timer when a running task is cancelled Pull Request resolved: https://github.com/facebook/rocksdb/pull/9656 Reviewed By: ajkr Differential Revision: D34626296 Pulled By: jay-zhuang fbshipit-source-id: 6b6d96a5149746bf503546244912a9e41a0c5f6b	2022-03-12 11:45:56 -08:00
slk	95305c44a1	Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410 ) Summary: As disscussed in (https://github.com/facebook/rocksdb/issues/9223), Here added a new API named DB::OpenAndTrimHistory, this API will open DB and trim data to the timestamp specofied by trim_ts (The data with newer timestamp than specified trim bound will be removed). This API should only be used at a timestamp-enabled db instance recovery. And this PR implemented a new iterator named HistoryTrimmingIterator to support trimming history with a new API named DB::OpenAndTrimHistory. HistoryTrimmingIterator wrapped around the underlying InternalITerator such that keys whose timestamps newer than trim_ts should not be returned to the compaction iterator while trim_ts is not null. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9410 Reviewed By: ltamasi Differential Revision: D34410207 Pulled By: riversand963 fbshipit-source-id: e54049dc234eccd673244c566b15df58df5a6236	2022-03-11 16:13:23 -08:00
Jay Zhuang	36aec94d85	`compression_per_level` should be used for flush and changeable (#9658 ) Summary: - Make `compression_per_level` dynamical changeable with `SetOptions`; - Fix a bug that `compression_per_level` is not used for flush; Pull Request resolved: https://github.com/facebook/rocksdb/pull/9658 Test Plan: CI Reviewed By: ajkr Differential Revision: D34700749 Pulled By: jay-zhuang fbshipit-source-id: a23b9dfa7ad03d393c1d71781d19e91de796f49c	2022-03-07 18:06:19 -08:00
Bo Wang	67f071fade	Fixes #9565 (#9586 ) Summary: [Compaction::IsTrivialMove](`a2b9be42b6/db/compaction/compaction.cc (L318)`) checks whether allow_trivial_move is set, and if so it returns the value of is_trivial_move_. The allow_trivial_move option is there for universal compaction. So when this is set and leveled compaction is enabled, then useful code that follows this block never gets a chance to run. A check that [compaction_style == kCompactionStyleUniversal](`320d9a8e8a/db/db_impl/db_impl_compaction_flush.cc (L1030)`) should be added to avoid doing the wrong thing for leveled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9586 Test Plan: To reproduce this: First edit db/compaction/compaction.cc with ``` diff --git a/db/compaction/compaction.cc b/db/compaction/compaction.cc index 7ae50b91e..52dd489b1 100644 --- a/db/compaction/compaction.cc +++ b/db/compaction/compaction.cc @@ -319,6 +319,8 @@ bool Compaction::IsTrivialMove() const { // input files are non overlapping if ((mutable_cf_options_.compaction_options_universal.allow_trivial_move) && (output_level_ != 0)) { + printf("IsTrivialMove:: return %d because universal allow_trivial_move\n", (int) is_trivial_move_); + // abort(); return is_trivial_move_; } ``` And then run ``` ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1641328309 --universal_allow_trivial_move=1 ``` Example output with the debug code added ``` IsTrivialMove:: return 0 because universal allow_trivial_move IsTrivialMove:: return 0 because universal allow_trivial_move ``` After this PR, the bug is fixed. Reviewed By: ajkr Differential Revision: D34350451 Pulled By: gitbw95 fbshipit-source-id: 3232005cc47c40a7e75d316cfc7960beb5bdff3a	2022-02-18 14:23:07 -08:00
Jay Zhuang	2fbc672732	Add temperature information to the event listener callbacks (#9591 ) Summary: RocksDB try to provide temperature information in the event listener callbacks. The information is not guaranteed, as some operation like backup won't have these information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9591 Test Plan: Added unittest Reviewed By: siying, pdillinger Differential Revision: D34309339 Pulled By: jay-zhuang fbshipit-source-id: 4aca4f270f99fa49186d85d300da42594663d6d7	2022-02-18 11:23:18 -08:00
Jay Zhuang	f092f0fa5d	Add subcompaction event API (#9311 ) Summary: Add event callback for subcompaction and adds a sub_job_id to identify it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9311 Reviewed By: ajkr Differential Revision: D33892707 Pulled By: jay-zhuang fbshipit-source-id: 57b5e5e594d61b2112d480c18a79a36751f65a4e	2022-02-17 15:47:10 -08:00
Andrew Kryczka	babe56ddba	Add rate limiter priority to ReadOptions (#9424 ) Summary: Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working. `RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`. There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads). The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424 Test Plan: - new unit tests - new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart. - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true` - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true` - crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10` Reviewed By: hx235 Differential Revision: D33747386 Pulled By: ajkr fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c	2022-02-16 23:18:14 -08:00
Jay Zhuang	31031c0210	Remove deprecated RemoteCompaction API (#9570 ) Summary: Remove deprecated remote compaction APIs `CompactionService::Start()` and `CompactionService::WaitForComplete()`. Please use `CompactionService::StartV2()`, `CompactionService::WaitForCompleteV2()` instead, which provides the same information plus extra data like priority, db_id, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9570 Test Plan: CI Reviewed By: riversand963 Differential Revision: D34255969 Pulled By: jay-zhuang fbshipit-source-id: c6376eccdd1123f1c42ab53771b5f65f8160c325	2022-02-16 13:25:28 -08:00
Peter Dillinger	e24734f843	Use -Wno-invalid-offsetof instead of dangerous offset_of hack (#9563 ) Summary: After https://github.com/facebook/rocksdb/issues/9515 added a unique_ptr to Status, we see some warnings-as-error in some internal builds like this: ``` stderr: rocksdb/src/db/compaction/compaction_job.cc:2839:7: error: offset of on non-standard-layout type 'struct CompactionServiceResult' [-Werror,-Winvalid-offsetof] {offsetof(struct CompactionServiceResult, status), ^ ~~~~~~ ``` I see three potential solutions to resolving this: * Expand our use of an idiom that works around the warning (see offset_of functions removed in this change, inspired by https://gist.github.com/graphitemaster/494f21190bb2c63c5516) However, this construction is invoking undefined behavior that assumes consistent layout with no compiler-introduced indirection. A compiler incompatible with our assumptions will likely compile the code and exhibit undefined behavior. * Migrate to something in place of offset, like a function mapping CompactionServiceResult* to Status* (for the `status` field). This might be required in the long term. * Selected: Use our new C++17 dependency to use offsetof in a well-defined way when the compiler allows it. From a comment on https://gist.github.com/graphitemaster/494f21190bb2c63c5516: > A final note: in C++17, offsetof is conditionally supported, which > means that you can use it on any type (not just standard layout > types) and the compiler will error if it can't compile it correctly. > That appears to be the best option if you can live with C++17 and > don't need constexpr support. The C++17 semantics are confirmed on https://en.cppreference.com/w/cpp/types/offsetof, so we can suppress the warning as long as we accept that we might run into a compiler that rejects the code, and at that point we will find a solution, such as the more intrusive "migrate" solution above. Although this is currently only showing in our buck build, it will surely show up also with make and cmake, so I have updated those configurations as well. Also in the buck build, -Wno-expansion-to-defined does not appear to be needed anymore (both current compiler configurations) so I removed it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9563 Test Plan: Tried out buck builds with both current compiler configurations Reviewed By: riversand963 Differential Revision: D34220931 Pulled By: pdillinger fbshipit-source-id: d39436008259bd1eaaa87c77be69fb2a5b559e1f	2022-02-15 09:19:19 -08:00
Levi Tamasi	320d9a8e8a	Use a sorted vector instead of a map to store blob file metadata (#9526 ) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc	2022-02-09 12:36:43 -08:00
satyajanga	036bbab6f7	Use the comparator from the sst file table properties in sst_dump_tool (#9491 ) Summary: We introduced a new Comparator for timestamp in user keys. In the sst_dump_tool by default we use BytewiseComparator to read sst files. This change allows us to read comparator_name from table properties in meta data block and use it to read. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9491 Test Plan: added unittests for new functionality. make check ![image](https://user-images.githubusercontent.com/4923556/152915444-28b88a1f-7b4e-47d0-815f-7011552bd9a2.png) ![image](https://user-images.githubusercontent.com/4923556/152916196-bea3d2a1-a3d5-4362-b911-036131b83e8d.png) Reviewed By: riversand963 Differential Revision: D33993614 Pulled By: satyajanga fbshipit-source-id: 4b5cf938e6d2cb3931d763bef5baccc900b8c536	2022-02-08 12:15:35 -08:00
Levi Tamasi	42e0751b3a	Clean up VersionStorageInfo a bit (#9494 ) Summary: The patch does some cleanup in and around `VersionStorageInfo`: * Renames the method `PrepareApply` to `PrepareAppend` in `Version` to make it clear that it is to be called before appending the `Version` to `VersionSet` (via `AppendVersion`), not before applying any `VersionEdit`s. * Introduces a helper method `VersionStorageInfo::PrepareForVersionAppend` (called by `Version::PrepareAppend`) that encapsulates the population of the various derived data structures in `VersionStorageInfo`, and turns the methods computing the derived structures (`UpdateNumNonEmptyLevels`, `CalculateBaseBytes` etc.) into private helpers. * Changes `Version::PrepareAppend` so it only calls `UpdateAccumulatedStats` if the `update_stats` flag is set. (Earlier, this was checked by the callee.) Related to this, it also moves the call to `ComputeCompensatedSizes` to `VersionStorageInfo::PrepareForVersionAppend`. * Updates and cleans up `version_builder_test`, `version_set_test`, and `compaction_picker_test` so `PrepareForVersionAppend` is called anytime a new `VersionStorageInfo` is set up or saved. This cleanup also involves splitting `VersionStorageInfoTest.MaxBytesForLevelDynamic` into multiple smaller test cases. * Fixes up a bunch of comments that were outdated or just plain incorrect. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9494 Test Plan: Ran `make check` and the crash test script for a while. Reviewed By: riversand963 Differential Revision: D33971666 Pulled By: ltamasi fbshipit-source-id: fda52faac7783041126e4f8dec0fe01bdcadf65a	2022-02-04 08:19:20 -08:00
Peter Dillinger	a495448eea	Revisit #9118 for compaction outputs (#9480 ) Summary: Crash test recently started showing failures as in https://github.com/facebook/rocksdb/issues/9118 but for files created by compaction. This change applies a similar fix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9480 Test Plan: Updated / extended unit test. (Some re-arranging to do the simpler compaction testing before this special case.) Reviewed By: ltamasi Differential Revision: D33909835 Pulled By: pdillinger fbshipit-source-id: 58e4b44e4ecc2d21e4df2c2d8440ec0633aa1f6c	2022-02-01 11:08:34 -08:00
Yanqin Jin	d10c5c08d3	Remove iter_start_seqnum and preserve_deletes (#9430 ) Summary: According to https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl.cc#L2896:L2911 and https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl_open.cc#L203:L208, we are going to remove `iter_start_seqnum` and `preserve_deletes` starting from RocksDB 7.0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9430 Test Plan: make check and CI Reviewed By: ajkr Differential Revision: D33753639 Pulled By: riversand963 fbshipit-source-id: c80aab8e8d8fc33e52472fed524ed703d0ffc8b6	2022-01-28 13:28:38 -08:00
Peter Dillinger	fc9d4071f0	Fast path for detecting unchanged prefix_extractor (#9407 ) Summary: Fixes a major performance regression in 6.26, where extra CPU is spent in SliceTransform::AsString when reads involve a prefix_extractor (Get, MultiGet, Seek). Common case performance is now better than 6.25. This change creates a "fast path" for verifying that the current prefix extractor is unchanged and compatible with what was used to generate a table file. This fast path detects the common case by pointer comparison on the current prefix_extractor and a "known good" prefix extractor (if applicable) that is saved at the time the table reader is opened. The "known good" prefix extractor is saved as another shared_ptr copy (in an existing field, however) to ensure the pointer is not recycled. When the prefix_extractor has changed to a different instance but same compatible configuration (rare, odd), performance is still a regression compared to 6.25, but this is likely acceptable because of the oddity of such a case. The performance of incompatible prefix_extractor is essentially unchanged. Also fixed a minor case (ForwardIterator) where a prefix_extractor could be used via a raw pointer after being freed as a shared_ptr, if replaced via SetOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9407 Test Plan: ## Performance Populate DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Running head-to-head comparisons simultaneously with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=seekrandom -num=10000000 -duration=20 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Below each is compared by ops/sec vs. baseline which is version 6.25 (multiple baseline runs because of variable machine load) v6.26: 4833 vs. 6698 (<- major regression!) v6.27: 4737 vs. 6397 (still) New: 6704 vs. 6461 (better than baseline in common case) Disabled fastpath: 4843 vs. 6389 (e.g. if prefix extractor instance changes but is still compatible) Changed prefix size (no usable filter) in new: 787 vs. 5927 Changed prefix size (no usable filter) in new & baseline: 773 vs. 784 Reviewed By: mrambacher Differential Revision: D33677812 Pulled By: pdillinger fbshipit-source-id: 571d9711c461fb97f957378a061b7e7dbc4d6a76	2022-01-21 11:37:46 -08:00
Jay Zhuang	9c6fb26033	Fix clang13 build error (#9374 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9374 Test Plan: Add CI for clang13 build Reviewed By: riversand963 Differential Revision: D33522867 Pulled By: jay-zhuang fbshipit-source-id: 642756825cf0b51e35861fb847ebaee4611b76ca	2022-01-11 10:36:22 -08:00
mrambacher	1973fcba11	Restore Regex support for ObjectLibrary::Register, rename new APIs to allow old one to be deprecated in the future (#9362 ) Summary: In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions. The PatternEntry methods were left in place but renamed to AddFactory. The goal is to allow for the deprecation of the original regex Registry method in an upcoming release. Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362 Reviewed By: pdillinger Differential Revision: D33432562 Pulled By: mrambacher fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02	2022-01-11 06:33:48 -08:00
sdong	88875df821	File temperature information should be preserved when restart the DB (#9242 ) Summary: Fix a bug that causes file temperature not preserved after DB is restarted, or options.max_manifest_file_size is hit. Also, pass temperature information to NewRandomAccessFile() to allow users to hack a solution where they don't preserve tiering information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9242 Test Plan: Add a unit test that would fail without the fix. Reviewed By: jay-zhuang Differential Revision: D32818150 fbshipit-source-id: 36aa3f148c60107f7b8e9d65b63b039f9e1a1eec	2021-12-03 14:43:14 -08:00
leipeng	c712b68f5b	Fix num files in single compaction for universal compaction (#9168 ) Summary: https://github.com/facebook/rocksdb/issues/9026 fixed histogram NUM_FILES_IN_SINGLE_COMPACTION for level compaction, but missed fix for universal compaction. This PR fixed NUM_FILES_IN_SINGLE_COMPACTION for universal compaction. Quote from https://github.com/facebook/rocksdb/issues/9026: > currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input. Thanks for ajkr pointed this missed fix! Pull Request resolved: https://github.com/facebook/rocksdb/pull/9168 Reviewed By: akankshamahajan15 Differential Revision: D32434494 Pulled By: ajkr fbshipit-source-id: 93ea092af4afbd8dce67898ffb350cf26b065ed2	2021-11-30 15:11:21 -08:00
Levi Tamasi	dc5de45af8	Support readahead during compaction for blob files (#9187 ) Summary: The patch adds a new BlobDB configuration option `blob_compaction_readahead_size` that can be used to enable prefetching data from blob files during compaction. This is important when using storage with higher latencies like HDDs or remote filesystems. If enabled, prefetching is used for all cases when blobs are read during compaction, namely garbage collection, compaction filters (when the existing value has to be read from a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187 Test Plan: Ran `make check` and the stress/crash test. Reviewed By: riversand963 Differential Revision: D32565512 Pulled By: ltamasi fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d	2021-11-19 17:53:47 -08:00
slk	937fbcbddc	Track per-SST user-defined timestamp information in MANIFEST (#9092 ) Summary: Track per-SST user-defined timestamp information in MANIFEST https://github.com/facebook/rocksdb/issues/8957 Rockdb has supported user-defined timestamp feature. Application can specify a timestamp when writing each k-v pair. When data flush from memory to disk file called SST files, file creation activity will commit to MANIFEST. This commit is for tracking timestamp info in the MANIFEST for each file. The changes involved are as follows: 1) Track max/min timestamp in FileMetaData, and fix invoved codes. 2) Add NewFileCustomTag::kMinTimestamp and NewFileCustomTag::kMinTimestamp in NewFileCustomTag ( in the kNewFile4 part ), and support invoved codes such as VersionEdit Encode and Decode etc. 3) Add unit test code for VersionEdit EncodeDecodeNewFile4, and fix invoved test codes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9092 Reviewed By: ajkr, akankshamahajan15 Differential Revision: D32252323 Pulled By: riversand963 fbshipit-source-id: d2642898d6e3ad1fef0eb866b98045408bd4e162	2021-11-10 10:49:04 -08:00
Yanqin Jin	a113cecfc9	Fix a bug in timestamp-related GC (#9116 ) Summary: For multiple versions (ts + seq) of the same user key, if they cross the boundary of `full_history_ts_low_`, we should retain the version that is visible to the `full_history_ts_low_`. Namely, we keep the internal key with the largest timestamp smaller than `full_history_ts_low`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9116 Test Plan: make check Reviewed By: ltamasi Differential Revision: D32261514 Pulled By: riversand963 fbshipit-source-id: e10f47c254c04c05261440051e4f50cb7d95474e	2021-11-09 13:08:55 -08:00
anand76	dddb791c18	Enable a few unit tests to use custom Env objects (#9087 ) Summary: Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087 Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc. Reviewed By: riversand963 Differential Revision: D32007222 Pulled By: anand1976 fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792	2021-11-08 11:05:59 -08:00
Yanqin Jin	5237b39d2e	Fix assertion error during compaction with write-prepared txn enabled (#9105 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105 The user contract of SingleDelete is that: a SingleDelete can only be issued to a key that exists and has NOT been updated. For example, application can insert one key `key`, and uses a SingleDelete to delete it in the future. The `key` cannot be updated or removed using Delete. In reality, especially when write-prepared transaction is being used, things can get tricky. For example, a prepared transaction already writes `key` to the memtable after a successful Prepare(). Afterwards, should the transaction rollback, it will insert a Delete into the memtable to cancel out the prior Put. Consider the following sequence of operations. ``` // operation sequence 1 Begin txn Put(key) Prepare() Flush() Rollback txn Flush() ``` There will be two SSTs resulting from above. One of the contains a PUT, while the second one contains a Delete. It is also known that releasing a snapshot can lead to an L0 containing only a SD for a particular key. Consider the following operations following the above block. ``` // operation sequence 2 db->Put(key) db->SingleDelete(key) Flush() ``` The operation sequence 2 can result in an L0 with only the SD. Should there be a snapshot for conflict checking created before operation sequence 1, then an attempt to compact the db may hit the assertion failure below, because ikey_.type is Delete (from a rollback). ``` else if (clear_and_output_next_key_) { assert(ikey_.type == kTypeValue \|\| ikey_.type == kTypeBlobIndex); } ``` To fix the assertion failure, we can skip the SingleDelete if we detect an earlier Delete in the same snapshot interval. Reviewed By: ltamasi Differential Revision: D32056848 fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748	2021-11-05 15:29:18 -07:00
Yanqin Jin	9b53f14a35	Fixed a bug in CompactionIterator when write-preared transaction is used (#9060 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060 RocksDB bottommost level compaction may zero out an internal key's sequence if the key's sequence is in the earliest_snapshot. In write-prepared transaction, checking the visibility of a certain sequence in a specific released snapshot may return a "snapshot released" result. Therefore, it is possible, after a certain sequence of events, a PUT has its sequence zeroed out, but a subsequent SingleDelete of the same key will still be output with its original sequence. This violates the ascending order of keys and leads to incorrect result. The solution is to use an extra variable `last_key_seq_zeroed_` to track the information about visibility in earliest snapshot. With this variable, we can know for sure that a SingleDelete is in the earliest snapshot even if the said snapshot is released during compaction before processing the SD. Reviewed By: ltamasi Differential Revision: D31813016 fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1	2021-11-03 15:55:00 -07:00
Jay Zhuang	29102641dd	Skip directory fsync for filesystem btrfs (#8903 ) Summary: Directory fsync might be expensive on btrfs and it may not be needed. Here are 4 directory fsync cases: 1. creating a new file: dir-fsync is not needed on btrfs, as long as the new file itself is synced. 2. renaming a file: dir-fsync is not needed if the renamed file is synced. So an API `FsyncAfterFileRename(filename, ...)` is provided to sync the file on btrfs. By default, it just calls dir-fsync. 3. deleting files: dir-fsync is forced by set `IOOptions.force_dir_fsync = true` 4. renaming multiple files (like backup and checkpoint): dir-fsync is forced, the same as above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903 Test Plan: run tests on btrfs and non btrfs Reviewed By: ajkr Differential Revision: D30885059 Pulled By: jay-zhuang fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a	2021-11-03 12:21:27 -07:00
sdong	a2b9be42b6	Try to start TTL earlier with kMinOverlappingRatio is used (#8749 ) Summary: Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles. When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed. In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749 Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end. Reviewed By: jay-zhuang Differential Revision: D30735261 fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e	2021-11-01 14:36:31 -07:00
Yanqin Jin	fdf2a0d7eb	Fix a compaction bug for write-prepared txn (#9061 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061 In write-prepared txn, checking a sequence's visibility in a released (old) snapshot may return "Snapshot released". Suppose we have two snapshots: ``` earliest_snap < earliest_write_conflict_snap ``` If we release `earliest_write_conflict_snap` but keep `earliest_snap` during bottommost level compaction, then it is possible that certain sequence of events can lead to a PUT being seq-zeroed followed by a SingleDelete of the same key. This violates the ascending order of keys, and will cause data inconsistency. Reviewed By: ltamasi Differential Revision: D31813017 fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285	2021-10-29 15:23:17 -07:00
sdong	c66b4429ff	Incremental Space Amp Compactions in Universal Style (#8655 ) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb	2021-10-20 10:04:13 -07:00
sdong	f053851af6	Ignore non-overlapping levels when determinig grandparent files (#9051 ) Summary: Right now, when picking a compaction, grand parent files are from output_level + 1. This usually works, but if the level doesn't have any overlapping file, it will be more efficient to go further down. This is because the files are likely to be trivial moved further and might create a violation of max_compaction_bytes. This situation can naturally happen and might happen even more with TTL compactions. There is no harm to fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9051 Test Plan: Run existing tests and see it passes. Also briefly run crash test. Reviewed By: ajkr Differential Revision: D31748829 fbshipit-source-id: 52b99ab4284dc816d22f34406d528a3c98ff6719	2021-10-19 12:48:18 -07:00
Levi Tamasi	3e1bf771a3	Make it possible to force the garbage collection of the oldest blob files (#8994 ) Summary: The current BlobDB garbage collection logic works by relocating the valid blobs from the oldest blob files as they are encountered during compaction, and cleaning up blob files once they contain nothing but garbage. However, with sufficiently skewed workloads, it is theoretically possible to end up in a situation when few or no compactions get scheduled for the SST files that contain references to the oldest blob files, which can lead to increased space amp due to the lack of GC. In order to efficiently handle such workloads, the patch adds a new BlobDB configuration option called `blob_garbage_collection_force_threshold`, which signals to BlobDB to schedule targeted compactions for the SST files that keep alive the oldest batch of blob files if the overall ratio of garbage in the given blob files meets the threshold and all the given blob files are eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example, if the new option is set to 0.9, targeted compactions will get scheduled if the sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the oldest blob files, assuming all affected blob files are below the age-based cutoff.) The net result of these targeted compactions is that the valid blobs in the oldest blob files are relocated and the oldest blob files themselves cleaned up (since all SST files that rely on them get compacted away). These targeted compactions are similar to periodic compactions in the sense that they force certain SST files that otherwise would not get picked up to undergo compaction and also in the sense that instead of merging files from multiple levels, they target a single file. (Note: such compactions might still include neighboring files from the same level due to the need of having a "clean cut" boundary but they never include any files from any other level.) This functionality is currently only supported with the leveled compaction style and is inactive by default (since the default value is set to 1.0, i.e. 100%). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994 Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests. Reviewed By: riversand963 Differential Revision: D31489850 Pulled By: ltamasi fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab	2021-10-11 18:03:01 -07:00
Ramkumar Vadivelu	fe994bbd0b	Misc doc fixes (#8983 ) Summary: - Update few stale GitHub wiki link references from rocksdb.org - Update the API comments for ignore_range_deletions Pull Request resolved: https://github.com/facebook/rocksdb/pull/8983 Reviewed By: ajkr Differential Revision: D31355965 Pulled By: ramvadiv fbshipit-source-id: 245ac4a6913976dd82afa308bc4aae6bff3d788c	2021-10-07 11:22:17 -07:00
mrambacher	53e595d1f3	Cleanup multiple implementations of VectorIterator (#8901 ) Summary: There were three implementations of VectorIterator (util/vector_iterator, test_util/testutil.h and LoggingForwardVectorIterator). Merged them into one class to increase code coverage/testing and reduce duplication. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8901 Reviewed By: pdillinger Differential Revision: D31022673 Pulled By: mrambacher fbshipit-source-id: 8e3acbd2dfd60b4df609d02cc72846de2389d531	2021-10-06 07:48:31 -07:00
mrambacher	13ae16c315	Cleanup includes in dbformat.h (#8930 ) Summary: This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead. Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing. Hopefully, this sort of code hygiene cleanup will speed up the builds... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930 Reviewed By: pdillinger Differential Revision: D31142788 Pulled By: mrambacher fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d	2021-09-29 04:04:40 -07:00
Jay Zhuang	6b34eb0ebc	Add remote compaction read/write bytes statistics (#8939 ) Summary: Add basic read/write bytes statistics on the primary side: `REMOTE_COMPACT_READ_BYTES` `REMOTE_COMPACT_WRITE_BYTES` Fixed existing statistics missing some IO for remote compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8939 Test Plan: CI Reviewed By: ajkr Differential Revision: D31074672 Pulled By: jay-zhuang fbshipit-source-id: c57afdba369990185008ffaec7e3fe7c62e8902f	2021-09-28 14:00:37 -07:00
mrambacher	7fd68b7c39	Make WalFilter, SstPartitionerFactory, FileChecksumGenFactory, and TableProperties Customizable (#8638 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8638 Reviewed By: zhichao-cao Differential Revision: D31024729 Pulled By: mrambacher fbshipit-source-id: 954c04ccab0b8dee64050a27aadf78ed119106c0	2021-09-28 05:32:02 -07:00
Akanksha Mahajan	78afb4d81e	Support SingleDelete for user-defined timestamps (#8921 ) Summary: Added support for SingleDelete for user-defined timestamps. Users can now Get and Iterate over keys deleted with SingleDelete. It also includes changes in CompactionIterator which preserves the same user key with different timestamps, unless the timestamp is below a certain threshold full_history_ts_low. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8921 Test Plan: Added new unit tests Reviewed By: riversand963 Differential Revision: D31098191 Pulled By: akankshamahajan15 fbshipit-source-id: 78a59ef4b4884ae324fcd10f56e62a27d5ee2f49	2021-09-27 11:51:07 -07:00
Jay Zhuang	1c290c785d	RemoteCompaction support Fallback to local compaction (#8709 ) Summary: Add support for fallback to local compaction, the user can return `CompactionServiceJobStatus::kUseLocal` to instruct RocksDB to run the compaction locally instead of waiting for the remote compaction result. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8709 Test Plan: unittest Reviewed By: ajkr Differential Revision: D30560163 Pulled By: jay-zhuang fbshipit-source-id: 65d8905a4a1bc185a68daa120997f21d3198dbe1	2021-09-18 00:25:04 -07:00
Akanksha Mahajan	d6aa8c49f8	Expose blob file information through the EventListener interface (#8675 ) Summary: 1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files. 2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted. 3. Test OnFile*Finish operations notifications with Blob Files. 4. Log the blob file creation/deletion events through EventLogger in Log file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8675 Test Plan: Add new unit tests in listener_test Reviewed By: ltamasi Differential Revision: D30412613 Pulled By: akankshamahajan15 fbshipit-source-id: ca51b63c6e8c8d0485a38c503572bc5a82bd5d07	2021-09-16 17:23:36 -07:00
Jay Zhuang	b97c53b629	Add compaction priority information in RemoteCompaction (#8707 ) Summary: Add compaction priority information in RemoteCompaction, which can be used to schedule high priority job first. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8707 Test Plan: unittest Reviewed By: ajkr Differential Revision: D30548401 Pulled By: jay-zhuang fbshipit-source-id: b30446511fb31b4583c49edd8565d496cf013a34	2021-09-16 15:09:35 -07:00
Peter Dillinger	bda8d93ba9	Fix and detect headers with missing dependencies (#8893 ) Summary: It's always annoying to find a header does not include its own dependencies and only works when included after other includes. This change adds `make check-headers` which validates that each header can be included at the top of a file. Some headers are excluded e.g. because of platform or external dependencies. rocksdb_namespace.h had to be re-worked slightly to enable checking for failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.) Fixes mostly involve adding and cleaning up #includes, but for FileTraceWriter, a constructor was out-of-lined to make a forward declaration sufficient. This check is not currently run with `make check` but is added to CircleCI build-linux-unity since that one is already relatively fast. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893 Test Plan: existing tests and resolving issues detected by new check Reviewed By: mrambacher Differential Revision: D30823300 Pulled By: pdillinger fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572	2021-09-10 10:00:26 -07:00
mrambacher	beed86473a	Make MemTableRepFactory into a Customizable class (#8419 ) Summary: This PR does the following: -> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString -> Makes the existing implementations compatible with configurations -> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419 Reviewed By: zhichao-cao Differential Revision: D29558961 Pulled By: mrambacher fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c	2021-09-08 07:46:44 -07:00

1 2 3 4 5 ...

316 commits