Summary:
**Context/Summary:**
This reverts commit fc74abb436 and related HISTORY record.
The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004
Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D41572604
Pulled By: hx235
fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
Summary:
Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910
Reviewed By: riversand963
Differential Revision: D40880683
Pulled By: ajkr
fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
Summary:
**Context:**
Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
Repro:
```
COERCE_CONTEXT_SWICH=1 make -j56 db_stress
./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35
put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
```
**Summary:**
FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
- [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
- For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
- This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
- [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
- For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
- This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
- [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
- [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
- Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.
Some additional clean-up included in this PR:
- Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
- Added comment about `earliest_memtable_seqno` in related APIs
- Made parameter `earliest_memtable_seqno` constant and required
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777
Test Plan:
- make check
- New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
- Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only
Reviewed By: ajkr
Differential Revision: D40090485
Pulled By: hx235
fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
Summary:
FIFO compaction can theoretically open a DB with any compaction style.
However, the current code only allows FIFO compaction to open a DB with
a single level.
This PR relaxes the limitation of FIFO compaction and allows it to open a
DB with multiple levels. Below is the read / write / compaction behavior:
* The read behavior is untouched, and it works like a regular rocksdb instance.
* The write behavior is untouched as well. When a FIFO compacted DB
is opened with multiple levels, all new files will still be in level 0, and no files
will be moved to a different level.
* Compaction logic is extended. It will first identify the bottom-most non-empty level.
Then, it will delete the oldest file in that level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348
Test Plan:
Added a new test to verify the migration from level to FIFO where the db has multiple levels.
Extended existing test cases in db_test and db_basic_test to also verify
all entries of a key after reopening the DB with FIFO compaction.
Reviewed By: jay-zhuang
Differential Revision: D40233744
fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1
Summary:
With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773
Test Plan:
Reproduce: D39993203 has detailed steps.
Run the test with and without the fix.
Reviewed By: cbi42
Differential Revision: D40077955
Pulled By: cbi42
fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240
Summary:
when there is a single memtable without range tombstones and no SST files in the database, DBIter should wrap memtable iterator directly. Currently we create a merging iterator on top of the memtable iterator, and have DBIter wrap around it. This causes iterator regression and this PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10705
Test Plan:
- `make check`
- Performance:
- Set up: `./db_bench -benchmarks=filluniquerandom -write_buffer_size=$((1 << 30)) -num=10000`
- Benchmark: `./db_bench -benchmarks=seekrandom -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=$((1 << 30)) -num=10000 -threads=16 -duration=60 -seek_nexts=$seek_nexts`
```
seek_nexts main op/sec https://github.com/facebook/rocksdb/issues/10705 RocksDB v7.6
0 5746568 5749033 5786180
30 2411690 3006466 2837699
1000 102556 128902 124667
```
Reviewed By: ajkr
Differential Revision: D39644221
Pulled By: cbi42
fbshipit-source-id: 8063ff611ba31b0e5670041da3927c8c54b2097d
Summary:
sanity check value for option `memtable_protection_bytes_per_key` in `ColumnFamilyData::ValidateOptions()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10621
Test Plan: `make check`, added unit test in ColumnFamilyTest.
Reviewed By: ajkr
Differential Revision: D39180133
Pulled By: cbi42
fbshipit-source-id: 009e0da3ccb332d1c9e14d20193304610bd4eb8a
Summary:
- Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
- db_bench is updated to print out the number of range deletions executed if there is any.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380
Test Plan:
- CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
- Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable.
```
single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100
multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
```
Commit 99cdf16464 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.
Single thread result:
| Max # tombstones | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op | 99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0 |6.68 |6.57 |6.72 |4.72 |4.79 |4.54 |
| 1 |6.67 |6.58 |6.62 |5.41 |4.74 |4.72 |
| 10 |6.59 |6.5 |6.56 |7.83 |4.69 |4.59 |
| 100 |6.62 |6.75 |6.58 |29.57 |5.04 |5.09 |
| 1000 |6.54 |6.82 |6.61 |320.33 |5.22 |5.21 |
32-thread result: note that "Max # tombstones" is per thread.
| Max # tombstones | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op | 99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0 |234.52 |260.25 |239.42 |5.06 |5.38 |5.09 |
| 1 |236.46 |262.0 |231.1 |19.57 |22.14 |5.45 |
| 10 |236.95 |263.84 |251.49 |151.73 |21.61 |5.73 |
| 100 |268.16 |296.8 |280.13 |2308.52 |22.27 |6.57 |
Reviewed By: ajkr
Differential Revision: D37916564
Pulled By: cbi42
fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
Summary:
TL;DR: due to a recent change, if you drop a column family,
often that DB will no longer fsync after writing new SST files
to remaining or new column families, which could lead to data
loss on power loss.
More bug detail:
The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
DB::Close time rather than waiting for DB object destruction.
Unfortunately, it also closes shared FSDirectory objects on
DropColumnFamily (& destroy remaining handles), which can lead
to use-after-Close on FSDirectory shared with remaining column
families. Those "uses" are only Fsyncs (or redundant Closes). In
the default Posix filesystem, an Fsync on a closed FSDirectory is a
quiet no-op. Consequently (under most configurations), if you drop
a column family, that DB will no longer fsync after writing new SST
files to column families sharing the same directory (true under most
configurations).
More fix detail:
Basically, this removes unnecessary Close ops on destroying
ColumnFamilyData. We let `shared_ptr` take care of calling the
destructor at the right time. If the intent was to require Close be
called before destroying FSDirectory, that was not made clear by the
author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
timely destruction of FSDirectory to suffice as Close (in
CountedFileSystem). Added a TODO to revisit that.
Also in this PR:
* Added a TODO to share FSDirectory instances between DB and its column
families. (Already shared among column families.)
* Made DB::Close attempt to close all its open FSDirectory objects even
if there is a failure in closing one. Also code clean-up around this
logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460
Test Plan:
add an assert to check for use-after-Close. With that
existing tests can detect the misuse. With fix, tests pass (except noted
relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)
Reviewed By: ajkr
Differential Revision: D38357922
Pulled By: pdillinger
fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
Summary:
Minor issue, I just found a few typos on db_test and column_family while reading the code. And I have this PR opened to contribute. :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10139
Reviewed By: ajkr
Differential Revision: D38007098
Pulled By: jay-zhuang
fbshipit-source-id: 511947b32424c34348184691216640f32c410fb1
Summary:
Currently, when installing a new super version, when stalling condition triggers, we compare estimated compaction bytes to previously, and if the new value is larger or equal to the previous one, we reduce the slowdown write rate. However, if concurrent compactions happen, the same value might be used. The result is that, although some compactions reduce estimated compaction bytes, we treat them as a signal for further slowing down. In some cases, it causes slowdown rate drops all the way to the minimum, far lower than needed.
Fix the bug by not triggering a re-calculation if a new super version doesn't have Version or a memtable change. With this fix, number of compaction finishes are still undercounted in this algorithm, but it is still better than the current bug where they are negatively counted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10270
Test Plan: Run a benchmark where the slowdown rate is dropped to minimal unnessarily and see it is back to a normal value.
Reviewed By: ajkr
Differential Revision: D37497327
fbshipit-source-id: 9bca961cc38fed965c3af0fa6c9ca0efaa7637c4
Summary:
**Summary**
Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.
**Motivation**
RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.
**Content of this PR**
This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.
**Benchmarking**
I will add numbers to prove that there is no performance impact within the next 12 hours.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011
Reviewed By: pdillinger
Differential Revision: D36462357
Pulled By: bjlemaire
fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.
Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198
Reviewed By: ltamasi
Differential Revision: D37294735
Pulled By: gangliao
fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
Summary:
The default implementation of Close() function in Directory/FSDirectory classes returns `NotSupported` status. However, we don't want operations that worked in older versions to begin failing after upgrading when run on FileSystems that have not implemented Directory::Close() yet. So we require the upper level that calls Close() function should properly handle "NotSupported" status instead of treating it as an error status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10127
Reviewed By: ajkr
Differential Revision: D36971112
Pulled By: littlepig2013
fbshipit-source-id: 100f0e6ad1191e1acc1ba6458c566a11724cf466
Summary:
In [close_db_dir](https://github.com/facebook/rocksdb/pull/10049) pull request, some merging conflicts occurred (some comments and one line `s.PermitUncheckedError()` are missing). This pull request aims to put them back.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10093
Reviewed By: ajkr
Differential Revision: D36884117
Pulled By: littlepig2013
fbshipit-source-id: 8c0e2a8793fc52804067c511843bd1ff4912c1c3
Summary:
Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
```
strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
```
There is one directory file descriptor that is not closed in the strace log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049
Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.
Reviewed By: ajkr, hx235
Differential Revision: D36722135
Pulled By: littlepig2013
fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
Test Plan: See CI Runs.
Reviewed By: riversand963
Differential Revision: D36173954
fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
Summary:
Although ColumnFamilySet comments say that DB mutex can be
freed during iteration, as long as you hold a ref while releasing DB
mutex, this is not quite true because UnrefAndTryDelete might delete cfd
right before it is needed to get ->next_ for the next iteration of the
loop.
This change solves the problem by making a wrapper class that makes such
iteration easier while handling the tricky details of UnrefAndTryDelete
on the previous cfd only after getting next_ in operator++.
FreeDeadColumnFamilies should already have been obsolete; this removes
it for good. Similarly, ColumnFamilySet::iterator doesn't need to check
for cfd with 0 refs, because those are immediately deleted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9730
Test Plan:
was reported with ASAN on unit tests like
DBLogicalBlockSizeCacheTest.CreateColumnFamily (very rare); keep watching
Reviewed By: ltamasi
Differential Revision: D35038143
Pulled By: pdillinger
fbshipit-source-id: 0a5478d5be96c135343a00603711b7df43ae19c9
Summary:
https://github.com/facebook/rocksdb/issues/9625 didn't change the unschedule condition which was waiting for the background thread to clean-up the compaction.
make sure we only unschedule the task when it's scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9659
Reviewed By: ajkr
Differential Revision: D34651820
Pulled By: jay-zhuang
fbshipit-source-id: 23f42081b15ec8886cd81cbf131b116e0c74dc2f
Summary:
Timer crash when multiple DB instances doing heavy DB open and close
operations concurrently. Which is caused by adding a timer task with
smaller timestamp than the current running task. Fix it by moving the
getting new task timestamp part within timer mutex protection.
And other fixes:
- Disallow adding duplicated function name to timer
- Fix a minor memory leak in timer when a running task is cancelled
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9656
Reviewed By: ajkr
Differential Revision: D34626296
Pulled By: jay-zhuang
fbshipit-source-id: 6b6d96a5149746bf503546244912a9e41a0c5f6b
Summary:
As disscussed in (https://github.com/facebook/rocksdb/issues/9223), Here added a new API named DB::OpenAndTrimHistory, this API will open DB and trim data to the timestamp specofied by **trim_ts** (The data with newer timestamp than specified trim bound will be removed). This API should only be used at a timestamp-enabled db instance recovery.
And this PR implemented a new iterator named HistoryTrimmingIterator to support trimming history with a new API named DB::OpenAndTrimHistory. HistoryTrimmingIterator wrapped around the underlying InternalITerator such that keys whose timestamps newer than **trim_ts** should not be returned to the compaction iterator while **trim_ts** is not null.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9410
Reviewed By: ltamasi
Differential Revision: D34410207
Pulled By: riversand963
fbshipit-source-id: e54049dc234eccd673244c566b15df58df5a6236
Summary:
Saw error like this:
`Backup failed -- IO error: No such file or directory: While opening a
file for sequentially reading:
/dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or
directory`
Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.)
relies on no file deletions happening while its operating, which
means not only disabling (more) deletions, but ensuring any pending
deletions are completed. Two fixes related to this:
* There was a gap in several places between decrementing
pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where
the db mutex would be released and GetSortedWalFiles (and others) could
get false information that no deletions are pending.
* The fix to https://github.com/facebook/rocksdb/issues/8591 (disabling deletions in GetSortedWalFiles) seems
incomplete because it doesn't prevent pending deletions from occuring
during the operation (if deletions not already disabled, the case that
was to be fixed by the change).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9208
Test Plan:
existing tests (it's hard to write a test for interleavings
that are now excluded - this is what stress test is for)
Reviewed By: ajkr
Differential Revision: D32630675
Pulled By: pdillinger
fbshipit-source-id: a121e3da648de130cd24d44c524232f4eb22f178
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162
Existing TransactionUtil::CheckKeyForConflict() performs only seq-based
conflict checking. If user-defined timestamp is enabled, it should perform
conflict checking based on timestamps too.
Update TransactionUtil::CheckKey-related methods to verify the timestamp of the
latest version of a key is smaller than the read timestamp. Note that
CheckKeysForConflict() is not updated since it's used only by optimistic
transaction, and we do not plan to update it in this upcoming batch of diffs.
Existing GetLatestSequenceForKey() returns the sequence of the latest
version of a specific user key. Since we support user-defined timestamp, we
need to update this method to also return the timestamp (if enabled) of the
latest version of the key. This will be needed for snapshot validation.
Reviewed By: ltamasi
Differential Revision: D31567960
fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44
Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.
In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).
These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)
This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994
Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.
Reviewed By: riversand963
Differential Revision: D31489850
Pulled By: ltamasi
fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
Summary:
This change introduces warnings instead of a silent override when trying to use level_compaction_dynamic_level_bytes with multiple cf_paths/db_paths.
I have completed the CLA.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8329
Reviewed By: hx235
Differential Revision: D31399713
Pulled By: ajkr
fbshipit-source-id: 29c6fe5258d1f739b4590ecd44aee44f55415595
Summary:
This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead.
Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
Hopefully, this sort of code hygiene cleanup will speed up the builds...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
Reviewed By: pdillinger
Differential Revision: D31142788
Pulled By: mrambacher
fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
Summary:
RocksDB exposes certain internal statistics via the DB property interface.
However, there are currently no properties related to BlobDB.
For starters, we would like to add the following BlobDB properties:
`rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree).
`rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version.
`rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions.
`rocksdb.live-blob-file-size`: the total size of all blob files in the current Version.
`rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734
Test Plan:
```
➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test
[==========] Running 16 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 10 tests from DBBlobBasicTest
[ RUN ] DBBlobBasicTest.GetBlob
[ OK ] DBBlobBasicTest.GetBlob (12 ms)
[ RUN ] DBBlobBasicTest.MultiGetBlobs
[ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms)
[ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex
[ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms)
[ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex
[ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms)
[ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber
[ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms)
[ RUN ] DBBlobBasicTest.GenerateIOTracing
[ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms)
[ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile
[ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms)
[ RUN ] DBBlobBasicTest.GetMergeBlobWithPut
[ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms)
[ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut
[ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms)
[ RUN ] DBBlobBasicTest.BlobDBProperties
[ OK ] DBBlobBasicTest.BlobDBProperties (21 ms)
[----------] 10 tests from DBBlobBasicTest (124 ms total)
[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms)
[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total)
[----------] Global test environment tear-down
[==========] 16 tests from 2 test cases ran. (2190 ms total)
[ PASSED ] 16 tests.
```
Reviewed By: ltamasi
Differential Revision: D30690849
Pulled By: Zhiyi-Zhang
fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
Reviewed By: zhichao-cao
Differential Revision: D29558961
Pulled By: mrambacher
fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736
Test Plan: existing tests
Reviewed By: mrambacher
Differential Revision: D30700039
Pulled By: pdillinger
fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
https://github.com/facebook/rocksdb/issues/8440 ,
https://github.com/facebook/rocksdb/issues/8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.
Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the
unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.
The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605
Test Plan: Ran `make check` and stress tests locally.
Reviewed By: pdillinger
Differential Revision: D30051035
Pulled By: ltamasi
fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
Summary:
FileOptions has an implicit conversion from EnvOptions and some
internal APIs take `const FileOptions&` and save the reference, which is
counter to Google C++ guidelines,
> Avoid defining functions that require a const reference parameter to outlive the call, because const reference parameters bind to temporaries. Instead, find a way to eliminate the lifetime requirement (for example, by copying the parameter), or pass it by const pointer and document the lifetime and non-null requirements.
This is at least a problem for repair.cc, which passes an EnvOptions to
TableCache(), which would save a reference to the temporary copy as
FileOptions. This was unfortunately only caught as a side effect of
changes in https://github.com/facebook/rocksdb/issues/8544.
This change fixes the repair.cc case and updates the involved internal
APIs that save a reference to use `const FileOptions*` instead.
Unfortunately, I don't know how to get any of our sanitizers to reliably
report bugs like this, so I can't rule out more existing in our
codebase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8571
Test Plan:
Test that issues seen with https://github.com/facebook/rocksdb/issues/8544 are fixed (can reproduce on
AWS EC2)
Reviewed By: ajkr
Differential Revision: D29943890
Pulled By: pdillinger
fbshipit-source-id: 95f9c5251548777b4dc994c1a083dd2add5799c9
Summary:
The main challenge to make the memtable garbage collection prototype (nicknamed `mempurge`) was to not get rid of WAL files that contain unflushed (but mempurged) data. That was successfully guaranteed by not writing the VersionEdit to the MANIFEST file after a successful mempurge.
By not writing VersionEdits to the `MANIFEST` file after a succesful mempurge operation, we do not change the earliest log file number that contains unflushed data: `cfd->GetLogNumber()` (`cfd->SetLogNumber()` is only called in `VersionSet::ProcessManifestWrites`). As a result, a number of functions introduced earlier just for the mempurge operation are not obscolete/redundant. (e.g.: `FlushJob::ExtractEarliestLogFileNumber`), and this PR aims at cleaning up all these now-unnecessary functions. In particular, we no longer need to store the earliest log file number in the `MemTable` struct itself. This PR therefore also reverts the `MemTable` struct to its original form.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8558
Test Plan: Already included in `db_flush_test.cc`.
Reviewed By: anand1976
Differential Revision: D29764351
Pulled By: bjlemaire
fbshipit-source-id: 0f43b260fa270251862512f397d3f24ee62e8437
Summary:
In this PR, `mempurge` is made compatible with the Write Ahead Log: in case of recovery, the DB is now capable of recovering the data that was "mempurged" and kept in the `imm()` list of immutable memtables.
The twist was to add a uint64_t to the `memtable` struct to store the number of the earliest log file containing entries from the `memtable`. When a `Flush` operation is replaced with a `MemPurge`, the `VersionEdit` (which usually contains the new min log file number to pick up for recovery and the level 0 file path of the newly created SST file) is no longer appended to the manifest log, and every time the `deleteWal` method is called, a check is made on the list of immutable memtables.
This PR also includes a unit test that verifies that no data is lost upon Reopening of the database when the mempurge feature is activated. This extensive unit test includes two column families, with valid data contained in the imm() at time of "crash"/reopening (recovery).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8528
Reviewed By: pdillinger
Differential Revision: D29701097
Pulled By: bjlemaire
fbshipit-source-id: 072a900fb6ccc1edcf5eef6caf88f3060238edf9
Summary:
In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype.
In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).
Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.
MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`.
The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505
Reviewed By: pdillinger
Differential Revision: D29619283
Pulled By: bjlemaire
fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95
Summary:
Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
Reviewed By: anand1976
Differential Revision: D29433971
Pulled By: bjlemaire
fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
Summary:
Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360
Test Plan: add the test to lru_cache_test
Reviewed By: pdillinger
Differential Revision: D29006215
Pulled By: zhichao-cao
fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
Summary:
This patch does two things:
1) Introduces some aliases in order to eliminate/prevent long-winded type names
w/r/t the internal table property collectors (see e.g.
`std::vector<std::unique_ptr<IntTblPropCollectorFactory>>`).
2) Makes it possible to apply only a subrange of table property collectors during
table building by turning `TableBuilderOptions::int_tbl_prop_collector_factories`
from a pointer to a `vector` into a range (i.e. a pair of iterators).
Rationale: I plan to introduce a BlobDB related table property collector, which
should only be applied during table creation if blob storage is enabled at the moment
(which can be changed dynamically). This change will make it possible to include/
exclude the BlobDB related collector as needed without having to introduce
a second `vector` of collectors in `ColumnFamilyData` with pretty much the same
contents.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8298
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D28430910
Pulled By: ltamasi
fbshipit-source-id: a81d28f2c59495865300f43deb2257d2e6977c8e
Summary:
Larger arena block size does provide the benefit of reducing allocation overhead, however it may cause other troubles. For example, allocator is more likely not to allocate them to physical memory and trigger page fault. Weighing the risk, we cap the arena block size to 1MB. Users can always use a larger value if they want.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7907
Test Plan: Run all existing tests
Reviewed By: pdillinger
Differential Revision: D26135269
fbshipit-source-id: b7f55afd03e6ee1d8715f90fa11b6c33944e9ea8
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
Reviewed By: pdillinger
Differential Revision: D28226540
Pulled By: mrambacher
fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
Summary:
Renaming ImmutableCFOptions::info_log and statistics to logger and stats. This is stage 2 in creating an ImmutableOptions class. It is necessary because the names match those in ImmutableOptions and have different types.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8227
Reviewed By: jay-zhuang
Differential Revision: D28000967
Pulled By: mrambacher
fbshipit-source-id: 3bf2aa04e8f1e8724d825b7deacf41080c14420b
Summary:
Resolves https://github.com/facebook/rocksdb/issues/8014
- Add an assertion on `DB::Open` to ensure `db_options.max_open_files` is unlimited if FIFO Compaction is being used.
- This is to align with what the docs mention and to prevent premature data deletion.
- Update tests to work with this assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8172
Test Plan:
```bash
$ make check -j$(nproc)
Generated TARGETS Summary:
- 6 libs
- 0 binarys
- 180 tests
```
Reviewed By: ajkr
Differential Revision: D27768792
Pulled By: thejchap
fbshipit-source-id: cf6350535e3a3577fec72bcba75b3c094dc7a6f3
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes.
For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved.
Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found)
6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found)
PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found)
(Note that the times for 6.18 are prior to revert of the SystemClock).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
Reviewed By: pdillinger
Differential Revision: D27014563
Pulled By: mrambacher
fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
Summary:
This PR
- adds a class `ManifestTailer` that inherits from `VersionEditHandlerPointInTime`. `ManifestTailer::Iterate()` can be called multiple times to tail the primary instance's MANIFEST and apply the changes to the secondary,
- updates the implementation of `ReactiveVersionSet::ReadAndApply` to use this class,
- removes unused code in version_set.cc,
- updates existing tests, e.g. removing deleted sync points from unit tests,
- adds a new test to address the bug in https://github.com/facebook/rocksdb/issues/7815.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7998
Test Plan:
make check
Existing and newly-added tests in version_set_test.cc and db_secondary_test.cc
Reviewed By: jay-zhuang
Differential Revision: D26926641
Pulled By: riversand963
fbshipit-source-id: 8d4dd15db0ba863c213f743e33b5a207e948c980
Summary:
## 1. Bug description:
When RocksDB Checkpoint, it may be stuck in `WaitUntilFlushWouldNotStallWrites` method.
## 2. Simple analysis of the reasons:
### 2.1 Configuration parameters:
```yaml
Compaction Style : Universal
max_write_buffer_number : 4
min_write_buffer_number_to_merge : 3
```
Checkpoint is usually very fast. When the Checkpoint is executed, `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable MemTables, which are less than `min_write_buffer_number_to_merge`, they will not be flushed. But will enter this code.
```c++
// method: GetWriteStallConditionAndCause
if (mutable_cf_options.max_write_buffer_number> 3 &&
num_unflushed_memtables >=
mutable_cf_options.max_write_buffer_number-1) {
return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit};
}
```
code link: fbed72f03c/db/column_family.cc (L847)
Checkpoint thought there was a FlushJob, but it didn't. So will always wait.
### 2.2 solution:
Increase the restriction: the `number of Immutable MemTable` >= `min_write_buffer_number_to_merge will wait`.
If there are other better solutions, you can correct me.
### 2.3 Code that can reproduce the problem:
https://github.com/1996fanrui/fanrui-learning/blob/flink-1.12/module-java/src/main/java/com/dream/rocksdb/RocksDBCheckpointStuck.java
## 3. Interesting point
This bug will be triggered only when `the number of sorted runs >= level0_file_num_compaction_trigger`.
Because there is a break in WaitUntilFlushWouldNotStallWrites.
```c++
if (cfd->imm()->NumNotFlushed() <
cfd->ioptions()->min_write_buffer_number_to_merge &&
vstorage->l0_delay_trigger_count() <
mutable_cf_options.level0_file_num_compaction_trigger) {
break;
}
```
code link: fbed72f03c/db/db_impl/db_impl_compaction_flush.cc (L1974)
Universal may have `l0_delay_trigger_count() >= level0_file_num_compaction_trigger`, so this bug is triggered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7921
Reviewed By: jay-zhuang
Differential Revision: D26900559
Pulled By: ajkr
fbshipit-source-id: 133c1252dad7393753f04a47590b68c7d8e670df
Summary:
Added a "only_mutable_options" flag to the ConfigOptions. When set, the Configurable methods will only look at/update options that are marked as kMutable.
Fixed DB::SetOptions to allow for the update of any mutable TableFactory options. Fixes https://github.com/facebook/rocksdb/issues/7385.
Added tests for the new flag. Updated HISTORY.md
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936
Reviewed By: akankshamahajan15
Differential Revision: D26389646
Pulled By: mrambacher
fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
Summary:
Add support for IOTracing in blob files
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7958
Test Plan:
Add a new test and checked manually the trace_file for blob
files being recorded during read and write.
Reviewed By: ltamasi
Differential Revision: D26415950
Pulled By: akankshamahajan15
fbshipit-source-id: 49c2859b3a4f8307e7cb69a92704403a4da46d44
Summary:
Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
Reviewed By: pdillinger
Differential Revision: D26006406
Pulled By: mrambacher
fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
Summary:
Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds. This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it. In this case, without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798
Reviewed By: jay-zhuang
Differential Revision: D25680451
Pulled By: pdillinger
fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917