Summary:
There was a crash test Bus Error crash in `IndexBlockIter::SeekToFirstImpl()` <- .. <-
`BlockBasedTable::~BlockBasedTable()` with `--mmap_read=1`, which suggests some kind of incompatibility that I haven't diagnosed. Bus Error is uncommon these days as CPUs support unaligned reads, but are associated with mmap problems.
Because mmap reads really only make sense without block cache, it's not a concerning loss to essentially disable the combination.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13051
Test Plan: watch crash test
Reviewed By: jowlyzhang
Differential Revision: D63795069
Pulled By: pdillinger
fbshipit-source-id: 6c823c619840086b5c9cff53dbc7470662b096be
Summary:
This PR makes file ingestion job's flush wait a bit further until the SuperVersion is also updated. This is necessary since follow up operations will use the current SuperVersion to do range overlapping check and level assignment.
In debug mode, file ingestion job's second `NeedsFlush` call could have been invoked when the memtables are flushed but the SuperVersion hasn't been updated yet, triggering the assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13045
Test Plan:
Existing tests
Manually stress tested
Reviewed By: cbi42
Differential Revision: D63671151
Pulled By: jowlyzhang
fbshipit-source-id: 95a169e58a7e59f6dd4125e7296e9060fe4c63a7
Summary:
... to note that memory may not be freed when reusing a transaction. This means reusing a large transaction can cause excessive memory usage and it may be better to destruct the transaction object in some cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13042
Test Plan: no code change.
Reviewed By: jowlyzhang
Differential Revision: D63570612
Pulled By: cbi42
fbshipit-source-id: f19ff556f76d54831fb94715e8808035d07e25fa
Summary:
When `avoid_flush_during_shutdown` is false, DB will flush the memtables if there is some unpersisted data:
79790cf2a8/db/db_impl/db_impl.cc (L505-L510)
`has_unpersisted_data_` is a flag that is only turned on for when WAL is disabled, for example:
79790cf2a8/db/db_impl/db_impl_write.cc (L525-L528)
In other cases, it just has its default false value.
So if disableWAL is false, and avoid_flush_during_shutdown is false, close won't flush memtables. Stress test is also not flush wal/sync wal. There could be missing data, while reopen in stress test doesn't tolerate missing data. To make the test simpler, this changes it to always flush/sync wal during reopen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13039
Reviewed By: hx235
Differential Revision: D63494695
Pulled By: jowlyzhang
fbshipit-source-id: 8f0fd9ed50a482a3955abc0882257ecc2e95926d
Summary:
The following DBOptions were not being propagated through BuildDBOptions, which could at least lead to settings being lost through `GetOptionsFromString()`, possibly elsewhere as well:
* background_close_inactive_wals
* write_dbid_to_manifest
* write_identity_file
* prefix_seek_opt_in_only
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13038
Test Plan:
This problem was not being caught by
OptionsSettableTest.DBOptionsAllFieldsSettable when the option was omitted from both options_helper.cc and options_settable_test.cc. I have added to the test to catch future instances (and the updated test was how I found three of the four missing options).
The same kind of bug seems to be caught by
ColumnFamilyOptionsAllFieldsSettable, and AFAIK analogous code does not exist for BlockBasedTableOptions.
Reviewed By: ltamasi
Differential Revision: D63483779
Pulled By: pdillinger
fbshipit-source-id: a5d5f6e434174bacb8e5d251b767e81e62b7225a
Summary:
When an item is inserted into the compressed secondary cache, this PR calculates the charge using the malloc_usable_size of the allocated memory, as well as the unique pointer allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13032
Test Plan: New unit test
Reviewed By: pdillinger
Differential Revision: D63418493
Pulled By: anand1976
fbshipit-source-id: 1db2835af6867442bb8cf6d9bf412e120ddd3824
Summary:
If the lowest_used_cache_tier DB option is set to kVolatileTier, skip insertion of compressed blocks into the secondary cache. Previously, these were always inserted into the secondary cache via the InsertSaved() method, leading to pollution of the secondary cache with blocks that would never be read.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13030
Test Plan: Add a new unit test
Reviewed By: pdillinger
Differential Revision: D63329841
Pulled By: anand1976
fbshipit-source-id: 14d2fce2ed309401d9ad4d2e7c356218b6673f7b
Summary:
Add the following to the `CompactionServiceJobInfo`
- compaction_reason
- is_full_compaction
- is_manual_compaction
- bottommost_level
Added `is_remote_compaction` to the `CompactionJobStats` and set initial values to avoid UB for uninitialized values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13029
Test Plan:
```
./compaction_service_test --gtest_filter="*CompactionInfo*"
```
Reviewed By: anand1976
Differential Revision: D63322878
Pulled By: jaykorean
fbshipit-source-id: f02a66ca45e660b9d354a43837d8ec6beb7621fb
Summary:
With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly.
Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`.
Related fixes / changes:
* Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation).
* Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.)
Suggested follow-up:
* Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness.
* Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek.
* Add `prefix_same_as_start` testing to db_stress
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026
Test Plan:
tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while.
Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all).
Reviewed By: ltamasi
Differential Revision: D63147378
Pulled By: pdillinger
fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e
Summary:
We've been serializing and deserializing DBOptions and CFOptions (and other CF into) as part of `CompactionServiceInput`. These are all readily available in the OPTIONS file and the remote worker can read the OPTIONS file to obtain the same information. This helps reducing the size of payload significantly.
In a very rare scenario if the OPTIONS file is purged due to options change by primary host at the same time while the remote host is loading the latest options, it may fail. In this case, we just retry once.
This also solves the problem where we had to open the default CF with the CFOption from another CF if the remote compaction is for a non-default column family. (TODO comment in /db_impl_secondary.cc)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13025
Test Plan:
Unit Tests
```
./compaction_service_test
```
```
./compaction_job_test
```
Also tested with Meta's internal Offload Infra
Reviewed By: anand1976, cbi42
Differential Revision: D63100109
Pulled By: jaykorean
fbshipit-source-id: b7162695e31e2c5a920daa7f432842163a5b156d
Summary:
This PR allows a Cache object to be created using the object registry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13024
Reviewed By: pdillinger
Differential Revision: D63043233
Pulled By: anand1976
fbshipit-source-id: 5bc3f7c29b35ad62638ff8205451303e2cecea9d
Summary:
Per customer request, we should not merge multiple SST files together during temperature change compaction, since this can cause FIFO TTL compactions to be delayed. This PR changes the compaction picking logic to pick one file at a time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13018
Test Plan: * updated some existing unit tests to test this new behavior.
Reviewed By: jowlyzhang
Differential Revision: D62883292
Pulled By: cbi42
fbshipit-source-id: 6a9fc8c296b5d9b17168ef6645f25153241c8b93
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13022
Currently, `blob_garbage_collection_force_threshold` applies to the oldest batch of blob files, which is typically only a small subset of the blob files currently eligible for garbage collection. This can result in a form of head-of-line blocking: no GC-triggered compactions will be scheduled if the oldest batch does not currently exceed the threshold, even if a lot of higher-numbered blob files do. This can in turn lead to high space amplification that exceeds the soft bound implicit in the force threshold (e.g. 50% would suggest a space amp of <2 and 75% would imply a space amp of <4). The patch changes the semantics of this configuration threshold to apply to the entire set of blob files that are eligible for garbage collection based on `blob_garbage_collection_age_cutoff`. This provides more intuitive semantics for the option and can provide a better write amp/space amp trade-off. (Note that GC-triggered compactions still pick the same SST files as before, so triggered GC still targets the oldest the blob files.)
Reviewed By: jowlyzhang
Differential Revision: D62977860
fbshipit-source-id: a999f31fe9cdda313de513f0e7a6fc707424d4a3
Summary:
* Set write_dbid_to_manifest=true by default
* Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file
* Refactor related DB open code to minimize code duplication
_Recommend hiding whitespace changes for review_
Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019
Test Plan: unit tests and stress test updated for new functionality
Reviewed By: anand1976
Differential Revision: D62898229
Pulled By: pdillinger
fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0
Summary:
Add an option `--only_print_seqno_gaps` for wal dump to help with debugging. This option will check the continuity of sequence numbers in WAL logs, assuming `seq_per_batch` is false. `--walfile` option now also takes a directory, and it will check all WAL logs in the directory in chronological order.
When a gap is found, we can further check if it's related to operations like external file ingestion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13014
Test Plan: Manually tested
Reviewed By: ltamasi
Differential Revision: D62989115
Pulled By: jowlyzhang
fbshipit-source-id: 22e3326344e7969ff9d5091d21fec2935770fbc7
Summary:
There was a subtle design/contract bug in the previous version of range filtering in experimental.h If someone implemented a key segments extractor with "all or nothing" fixed size segments, that could result in unsafe range filtering. For example, with two segments of width 3:
```
x = 0x|12 34 56|78 9A 00|
y = 0x|12 34 56||78 9B
z = 0x|12 34 56|78 9C 00|
```
Segment 1 of y (empty) is out of order with segment 1 of x and z.
I have re-worked the contract to make it clear what does work, and implemented a standard extractor for fixed-size segments, CappedKeySegmentsExtractor. The safe approach for filtering is to consume as much as is available for a segment in the case of a short key.
I have also added support for min-max filtering with reverse byte-wise comparator, which is probably the 2nd most common comparator for RocksDB users (because of MySQL). It might seem that a min-max filter doesn't care about forward or reverse ordering, but it does when trying to determine whether in input range from segment values v1 to v2, where it so happens that v2 is byte-wise less than v1, is an empty forward interval or a non-empty reverse interval. At least in the current setup, we don't have that context.
A new unit test (with some refactoring) tests CappedKeySegmentsExtractor, reverse byte-wise comparator, and the corresponding min-max filter.
I have also (contractually / mathematically) generalized the framework to comparators other than the byte-wise comparator, and made other generalizations to make the extractor limitations more explicitly connected to the particular filters and filtering used--at least in description.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13005
Test Plan: added unit tests as described
Reviewed By: jowlyzhang
Differential Revision: D62769784
Pulled By: pdillinger
fbshipit-source-id: 0d41f0d0273586bdad55e4aa30381ebc861f7044
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13015
`Close()`ing a database now releases tracked files in `SstFileManager`. Previously this space would be leaked until the database was later reopened.
Reviewed By: jowlyzhang
Differential Revision: D62590773
fbshipit-source-id: 5461bd253d974ac4967ad52fee92e2650f8a9a28
Summary:
The internal codebase is updated for the coro directory's graduation from experimental. Updating our build script for a newer version with this change too. Using this hash: 03041f014b
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13017
Reviewed By: nickbrekhus
Differential Revision: D62763932
Pulled By: jowlyzhang
fbshipit-source-id: 1b211707fbc7d974d6d6ceaf577e174424bb44ed
Summary:
A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events:
```
0. manual_wal_flush is true. There are multiple CFs in a DB.
1. Submit a write batch with updates to multiple CF
2. A FlushWAL or a memtable swtich that will try to write the buffered WAL data. Fail this write so that buffered WAL data is dropped: 4b1d595306/file/writable_file_writer.cc (L624)
The error needs to be retryable to start background auto recovery.
3. One CF successfully flushes its memtable during auto recovery.
4. Crash the process.
5. Reopen the DB, one CF will have the update as a result of successful flush. Other CFs will miss all the updates in the write batch since WAL does not have them.
```
This can happen if a users configures manual_wal_flush, uses more than one CF, and can hit retryable error for WAL writes. This PR is a short-term fix that upgrades WAL related errors to fatal and not trigger auto recovery.
A long-term fix may be not drop buffered WAL data by checking how much data is actually written, or require atomically flushing all column families during error recovery from this kind of errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12995
Test Plan:
added unit test to check error severity and if recovery is triggered. A crash test repro command that fails in a few runs before this PR:
```
python3 ./tools/db_crashtest.py blackbox --interval=60 --metadata_write_fault_one_in=1000 --column_families=10 --exclude_wal_from_write_fault_injection=0 --manual_wal_flush_one_in=1000 --WAL_size_limit_MB=10240 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=8 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=50 --writepercent=35 --ops_per_thread=100000 --preserve_unverified_changes=1
```
Reviewed By: hx235
Differential Revision: D62888510
Pulled By: cbi42
fbshipit-source-id: 308bdbbb8d897cc8eba950155cd0e37cf7eb76fe
Summary: I came across this code while buckifying parts of folly and fizz in open source. This is pretty hacky code and cleaning it up doesn't seem that hard, so I did it.
Reviewed By: zertosh, pdillinger
Differential Revision: D62781766
fbshipit-source-id: 43714bce992c53149d1e619063d803297362fb5d
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13010
The OnAddFile cur_compactions_reserved_size_ accounting causes wraparound when re-opening a database with an unowned SstFileManager and during recovery. It was introduced in #4164 which addresses out of space recovery with an unclear purpose. Compaction jobs do this accounting via EnoughRoomForCompaction/OnCompactionCompletion and to my understanding would never reuse a sst file name.
Reviewed By: anand1976
Differential Revision: D62535775
fbshipit-source-id: a7c44d6e0a4b5ff74bc47abfe57c32ca6770243d
Summary:
For SST checksum mismatch corruptions in the read path, RocksDB retries the read if the underlying file system supports verification and reconstruction of data (`FSSupportedOps::kVerifyAndReconstructRead`). There were a couple of places where the retry was missing - reading the SST footer and the properties block. This PR fixes the retry in those cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13007
Test Plan: Add new unit tests
Reviewed By: jaykorean
Differential Revision: D62519186
Pulled By: anand1976
fbshipit-source-id: 50aa38f18f2a53531a9fc8d4ccdf34fbf034ed59
Summary:
in ReFitLevel(), we were not setting being_compacted to false after ReFitLevel() is done. This is not a issue if refit level is successful, since new FileMetaData is created for files at the target level. However, if there's an error during RefitLevel(), e.g., Manifest write failure, we should clear the being_compacted field for these files. Otherwise, these files will not be picked for compaction until db reopen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13009
Test Plan:
existing test.
- stress test failure in T200339331 should not happen anymore.
Reviewed By: hx235
Differential Revision: D62597169
Pulled By: cbi42
fbshipit-source-id: 0ba659806da6d6d4b42384fc95268b2d7bad720e
Summary:
Prepare this internal API to be used by atomic data replacement. The main purpose of this API is to get a `VersionEdit` to mark the entire current `MemTableListVersion` as dropped. Flush needs the similar functionality when installing results, so that logic is refactored into a util function `GetDBRecoveryEditForObsoletingMemTables` to be shared by flush and this internal API.
To test this internal API, flush's result installation is redirected to use this API when it is flushing all the immutable MemTables in debug mode. It should achieve the exact same results, just with a duplicated `VersionEdit::log_number` field that doesn't upsets the recovery logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13001
Test Plan: Existing tests
Reviewed By: pdillinger
Differential Revision: D62309591
Pulled By: jowlyzhang
fbshipit-source-id: e25914d9a2e281c25ab7ee31a66eaf6adfae4b88
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13004
The patch extends the buckifier script so it generates a target for `db_bench` as well.
Reviewed By: cbi42
Differential Revision: D62407071
fbshipit-source-id: 0cb98a324ce0598ad84a8675aa77b7d0f91bf40c
Summary:
Add the `ApplyToHandle` method to the `Cache` interface to allow a caller to request the invocation of a callback on the given cache handle. The goal here is to allow a cache that manages multiple cache instances to use a callback on a handle to determine which instance it belongs to. For example, the callback can hash the key and use that to pick the correct target instance. This is useful to redirect methods like `Ref` and `Release`, which don't know the cache key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12987
Reviewed By: pdillinger
Differential Revision: D62151907
Pulled By: anand1976
fbshipit-source-id: e4ffbbb96eac9061d2ab0e7e1739eea5ebb1cd58
Summary:
`Compaction` is already creating its own ref for the input Version: 4b1d595306/db/compaction/compaction.cc (L73)
And properly Unref it during destruction:
4b1d595306/db/compaction/compaction.cc (L450)
This PR redirects compaction's access of `cfd->current()` to this input `Version`, to prepare for when a column family's data can be replaced all together, and `cfd->current()` is not safe to access for a compaction job. Because a new `Version` with just some other external files could be installed as `cfd->current()`. The compaction job's expectation of the current `Version` and the corresponding storage info to always have its input files will no longer be guaranteed.
My next follow up is to do a similar thing for flush, also to prepare it for when a column family's data can be replaced. I will make it create its own reference of the current `MemTableListVersion` and use it as input, all flush job's access of memtables will be wired to that input `MemTableListVersion`. Similarly this reference will be unreffed during a flush job's destruction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12992
Test Plan: Existing tests
Reviewed By: pdillinger
Differential Revision: D62212625
Pulled By: jowlyzhang
fbshipit-source-id: 9a781213469cf366857a128d50a702af683a046a
Summary:
The `SchedulePending*` API is a bit confusing since it doesn't immediately schedule the work and can be confused with the actual scheduling. So I have changed these to be `EnqueuePending*` and added some documentation for the corresponding state transitions of these background work.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12994
Test Plan: existing tests
Reviewed By: cbi42
Differential Revision: D62252746
Pulled By: jowlyzhang
fbshipit-source-id: ee68be6ed33070cad9a5004b7b3e16f5bcb041bf
Summary:
* https://github.com/facebook/rocksdb/issues/12936 was insufficient to fix the std::optional false positives. Making a fix validated in CI this time (see https://github.com/facebook/rocksdb/issues/12991)
* valgrind grinds to a halt on startup on my dev machine apparently because it expects internet access. Disable its attempts to access the internet when git is using a proxy.
* Move PORTABLE=1 from CI job to the Makefile. Without it, valgrind complains about illegal instructions (too new)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12990
Test Plan: manual, watch nightly valgrind job
Reviewed By: ltamasi
Differential Revision: D62203242
Pulled By: pdillinger
fbshipit-source-id: a611b08da7dbd173b0709ed7feb0578729553a17
Summary:
It appears the arm testsuite is failing because it is building without snappy, which is causing the SST files not to be compressed, which somehow causes these tests to fail. Manually setting LZ4 which is already required.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12993
Test Plan: reproduced and verified fix on ARM laptop
Reviewed By: anand1976
Differential Revision: D62216451
Pulled By: pdillinger
fbshipit-source-id: 3f21fcd9be0edaa66c7eca0cb7d56b998171e263
Summary:
`ignore_unknown_options=true` had an undocumented behavior of having no effect (disallow unknown options) if reading from the same or older major.minor version. Presumably this was intended to catch unintentional addition of new options in a patch release, but there is no automated version compatibility testing between patch releases. So this was a bad choice without such testing support, because it just means users would hit the failure in case of adding features to a patch release.
In this diff we respect ignore_unknown_options when reading a file from any newer version, even patch versions, and document this behavior in the API.
I don't think it's practical or necessary to test among patch releases in check_format_compatible.sh. This seems like an exceptional case of applying a *different semantics* to patch version updates than to minor/major versions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12989
Test Plan: unit test updated (and refactored)
Reviewed By: jaykorean
Differential Revision: D62168738
Pulled By: pdillinger
fbshipit-source-id: fb3c3ef30f0bbad0d5ffcc4570fb9ef963e7daac
Summary:
`check_format_compatible` script was broken due to extra comma added in 5b8f5cbcf4
e.g. https://github.com/facebook/rocksdb/actions/runs/10505042711/job/29101787220
```
...
2024-08-23T11:44:15.0175202Z == Building 9.5.fb, debug
2024-08-23T11:44:15.0190592Z fatal: ambiguous argument '_tmp_origin/9.5.fb,': unknown revision or path not in the working tree.
...
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12988
Test Plan:
```
tools/check_format_compatible.sh
```
```
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.6.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.6.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.6.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.7.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.7.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.7.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.8.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.8.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.8.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.9.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.9.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.9.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.10.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.10.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.10.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.11.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.11.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.11.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.0.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.0.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.0.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.1.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.1.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.1.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.2.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.2.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.2.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.3.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.3.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.3.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.4.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.4.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.4.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.5.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.5.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.5.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.6.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.6.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.6.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
==== Compatibility Test PASSED ====
```
Reviewed By: pdillinger
Differential Revision: D62162454
Pulled By: jaykorean
fbshipit-source-id: 562225c6cb27e0eb66f241a6f9424dc624d8c837
Summary:
Met the following error while compiling the project.
```
build_tools/check-sources.sh
utilities/fault_injection_fs.cc:509: // If there<E2><80><99>s no injected error, then cb will be called asynchronously when
utilities/fault_injection_fs.cc:510: // target_ actually finishes the read. But if there<E2><80><99>s an injected error, it
utilities/fault_injection_fs.cc:512: // isn<E2><80><99>t invoked at all.
^^^^ Use only ASCII characters in source files
make[1]: *** [Makefile:1291: check-sources] Error 1
make[1]: Leaving directory '/home/janus/Github/symious/rocksdb'
make: *** [Makefile:1084: check] Error 2
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12972
Reviewed By: hx235
Differential Revision: D61923865
Pulled By: cbi42
fbshipit-source-id: 63af0a38fea15e09a860895bdd5ed0a57700e447
Summary:
Add option `IngestExternalFileOptions::link_files` that hard links input files and preserves original file links after ingestion, unlike `move_files` which will unlink input files after ingestion. This can be useful when being used together with `allow_db_generated_files` to ingest files from another DB. Also reverted the change to `move_files` in https://github.com/facebook/rocksdb/issues/12959 to simplify the contract so that it will always unlink input files without exception.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12980
Test Plan: updated unit test `ExternSSTFileLinkFailFallbackTest.LinkFailFallBackExternalSst` to test that input files will not be unlinked.
Reviewed By: pdillinger
Differential Revision: D61925111
Pulled By: cbi42
fbshipit-source-id: eadaca72e1ae5288bdd195d57158466e5656fa62
Summary:
There are several crash test failures due to DB verification failure. Retain some trace history in the expected state directory to make debugging easier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12978
Reviewed By: cbi42
Differential Revision: D61864921
Pulled By: anand1976
fbshipit-source-id: 9f3f37b7e1e958bc89a3cf0373182354c2c1aa3b
Summary:
Followed instruction per https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#defining-access-for-the-github_token-scopes
It turns out that we did not need any of these except `Metadata: read`.
Before
```
GITHUB_TOKEN Permissions
Actions: write
Attestations: write
Checks: write
Contents: write
Deployments: write
Discussions: write
Issues: write
Metadata: read
Packages: write
Pages: write
PullRequests: write
RepositoryProjects: write
SecurityEvents: write
Statuses: write
```
After
```
GITHUB_TOKEN Permissions
Metadata: read
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12973
Test Plan: GitHub Actions triggered by this PR
Reviewed By: cbi42
Differential Revision: D61812651
Pulled By: jaykorean
fbshipit-source-id: 4413756c93f503e8b2fb77eb8b684ef9e6a6c13d
Summary:
so `IngestExternalFileOptions::move_files` and `IngestExternalFileOptions::allow_db_generated_files` are now compatible. The original file links won't be removed if `allow_db_generated_files` is true. This is to prevent deleting files from another DB.
There was a [comment](https://github.com/facebook/rocksdb/pull/12750#discussion_r1684509620) in https://github.com/facebook/rocksdb/issues/12750 about how exactly-once ingestion would work with `move_files`. I've discussed with customer and decided that it can be done by reading the target DB to see if it contains any ingested key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12959
Test Plan: updated unit tests `IngestDBGeneratedFileTest*` to enable `move_files`.
Reviewed By: jowlyzhang
Differential Revision: D61703480
Pulled By: cbi42
fbshipit-source-id: 6b4294369767f989a2f36bbace4ca3c0257aeaf7
Summary:
.. so that appropriate implementations can return temperature information from GetChildrenFileAttributes
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12965
Test Plan: just an API placeholder for now
Reviewed By: anand1976
Differential Revision: D61748199
Pulled By: pdillinger
fbshipit-source-id: b457e324cb451e836611a0bf630c3da0f30a8abf
Summary:
We have a request to use the cold tier as primary source of truth for the DB, and to best support such use cases and to complement the existing options controlling SST file temperatures, we add two new DB options:
* `metadata_write_temperature` for DB "small" files that don't contain much user data
* `wal_write_temperature` for WALs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12957
Test Plan: Unit test included, though it's hard to be sure we've covered all the places
Reviewed By: jowlyzhang
Differential Revision: D61664815
Pulled By: pdillinger
fbshipit-source-id: 8e19c9dd8fd2db059bb15f74938d6bc12002e82b
Summary:
Disabling the job temporarily. We will re-enable this when ready again
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12964
Reviewed By: ltamasi
Differential Revision: D61740941
Pulled By: jaykorean
fbshipit-source-id: 167e50c4f5e38d508a8e56633261611467f30690
Summary:
Issue: MultiGet(PinnableSlice) can't read out all timestamps.
Fixed the impl, and added an UT as well. In the original impl, if MultiGet reads multiple column families, a later column family would clean up timestamps of previous column family.
Fix: https://github.com/facebook/rocksdb/issues/12950#issue-2476996580
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12943
Reviewed By: anand1976
Differential Revision: D61729257
Pulled By: pdillinger
fbshipit-source-id: 55267c26076c8a59acedd27e14714711729a40df
Summary:
this helps to avoid scanning input files when ingesting db generated files: ecb844babd/db/external_sst_file_ingestion_job.cc (L917-L935)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12951
Test Plan:
* `IngestDBGeneratedFileTest.FailureCase` is updated to verify that this table property is verified during ingestion
* existing unit tests for other ingestion use cases.
Reviewed By: jowlyzhang
Differential Revision: D61608285
Pulled By: cbi42
fbshipit-source-id: b5b7aae9741531349ab247be6ffaa3f3628b76ca