Commit graph

347 commits

Author SHA1 Message Date
Hui Xiao 9502856edd Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.

RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in 0f88160f67, 5c64fb67d2 and 87dfc1d23e.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.

Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](c62f322169/db/db_iter.cc (L342-L343)) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.

Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.

**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them.  File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (0f88160f67, 5c64fb67d2 and 87dfc1d23e.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988

Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761

Reviewed By: cbi42

Differential Revision: D41535685

Pulled By: hx235

fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 15:05:36 -08:00
Changyu Bi cc6f323705 Include estimated bytes deleted by range tombstones in compensated file size (#10734)
Summary:
compensate file sizes in compaction picking so files with range tombstones are preferred, such that they get compacted down earlier as they tend to delete a lot of data. This PR adds a `compensated_range_deletion_size` field in FileMeta that is computed during Flush/Compaction and persisted in MANIFEST. This value is added to `compensated_file_size` which will be used for compaction picking. Currently, for a file in level L, `compensated_range_deletion_size` is set to the estimated bytes deleted by range tombstone of this file in all levels > L. This helps to reduce space amp when data in older levels are covered by range tombstones in level L.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10734

Test Plan:
- Added unit tests.
- benchmark to check if the above definition `compensated_range_deletion_size` is reducing space amp as intended, without affecting write amp too much. The experiment set up favorable for this optimization: large range tombstone issued infrequently. Command used:
```
./db_bench -benchmarks=fillrandom,waitforcompaction,stats,levelstats -use_existing_db=false -avoid_flush_during_recovery=true -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -max_bytes_for_level_base=134217728 -target_file_size_base=33554432 -writes_per_range_tombstone=500000 -range_tombstone_width=5000000 -num=50000000 -benchmark_write_rate_limit=8388608 -threads=16 -duration=1800 --max_num_range_tombstones=1000000000
```

In this experiment, each thread wrote 16 range tombstones over the duration of 30 minutes, each range tombstone has width 5M that is the 10% of the key space width. Results shows this PR generates a smaller DB size.

Compaction stats from this PR:
```
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      2/0   31.54 MB   0.5      0.0     0.0      0.0       8.4      8.4       0.0   1.0      0.0     63.4    135.56            110.94       544    0.249       0      0       0.0       0.0
  L4      3/0   96.55 MB   0.8     18.5     6.7     11.8      18.4      6.6       0.0   2.7     65.3     64.9    290.08            284.03       108    2.686    284M  1957K       0.0       0.0
  L5     15/0   404.41 MB   1.0     19.1     7.7     11.4      18.8      7.4       0.3   2.5     66.6     65.7    292.93            285.34       220    1.332    293M  3808K       0.0       0.0
  L6    143/0    4.12 GB   0.0     45.0     7.5     37.5      41.6      4.1       0.0   5.5     71.2     65.9    647.00            632.66       251    2.578    739M    47M       0.0       0.0
 Sum    163/0    4.64 GB   0.0     82.6    21.9     60.7      87.2     26.5       0.3  10.4     61.9     65.4   1365.58           1312.97      1123    1.216   1318M    52M       0.0       0.0
```

Compaction stats from main:
```
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       8.4      8.4       0.0   1.0      0.0     60.5    142.12            115.89       569    0.250       0      0       0.0       0.0
  L4      3/0   85.68 MB   1.0     17.7     6.8     10.9      17.6      6.7       0.0   2.6     62.7     62.3    289.05            281.79       112    2.581    272M  2309K       0.0       0.0
  L5     11/0   293.73 MB   1.0     18.8     7.5     11.2      18.5      7.2       0.5   2.5     64.9     63.9    296.07            288.50       220    1.346    288M  4365K       0.0       0.0
  L6    130/0    3.94 GB   0.0     51.5     7.6     43.9      47.9      3.9       0.0   6.3     67.2     62.4    784.95            765.92       258    3.042    848M    51M       0.0       0.0
 Sum    144/0    4.31 GB   0.0     88.0    21.9     66.0      92.3     26.3       0.5  11.0     59.6     62.5   1512.19           1452.09      1159    1.305   1409M    58M       0.0       0.0```

Reviewed By: ajkr

Differential Revision: D39834713

Pulled By: cbi42

fbshipit-source-id: fe9341040b8704a8fbb10cad5cf5c43e962c7e6b
2022-12-29 13:28:24 -08:00
Changyu Bi 53b703eafe Fix an assertion failure in CompactionOutputs::AddRangeDels() (#11040)
Summary:
the [assertion](c3f720c60d/db/compaction/compaction_outputs.cc (L643)) in `CompactionOutputs::AddRangeDels()` can fail after https://github.com/facebook/rocksdb/pull/10802. The assertion fails when `lower_bound_from_range_tombstone` is true during `AddRangeDels()` for a new compaction output file, while the lower bound range tombstone key has seqno 0 and op_type kTypeRangeDeletion. It can have seqno 0 when it was truncated at a point key whose seqno was zeroed out during compaction, the seqno and op_type could be set [here](c3f720c60d/db/compaction/compaction_outputs.cc (L594)). This PR fixes the assertion excluding the case when `lower_bound_from_range_tombstone` is true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11040

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D42119914

Pulled By: cbi42

fbshipit-source-id: 0897e71b5304cb02aac30f71667b590c37b72baf
2022-12-19 16:36:39 -08:00
Changyu Bi f02c708aa3 Consider range tombstone in compaction output file cutting (#10802)
Summary:
This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802

Test Plan:
- added unit tests in db_range_del_test.
- stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000`

Reviewed By: ajkr, jay-zhuang

Differential Revision: D40308827

Pulled By: cbi42

fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df
2022-12-15 09:11:54 -08:00
Hui Xiao 98d5db5c2e Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
-  File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
       - insert k1@1 to memtable m1
       - ingest file s1 with k2@2, ingest file s2 with k3@3
        - insert k4@4 to m1
       - compact files s1, s2 and  result in new file s3 of seqno range [2, 3]
       - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
    - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr  for this example)
        - an existing SST s1 contains only k1@1
        - insert k1@2 to memtable m1
        - ingest file s2 with k3@3, ingest file s3 with k4@4
        - insert single delete k5@5 in m1
        - flush m1 and result in new file s4 of seqno range [2, 5]
        - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
        - compact s4 and result in new file s6 of seqno range [2] due to single delete
    - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`

Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number`  ordering check. This will result in file ordering s1 <  s2 <  s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.

**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
  - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
  - Compaction output file  is assigned with the minimum `epoch_number` among input files'
      - Refit level: reuse refitted file's epoch_number
  -  Other paths needing `epoch_number` treatment:
     - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
     - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
  -  Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or  by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
  - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
   - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
   - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
   - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
   - update existing tests with `epoch_number` so make check will pass
   - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
   - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922

Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run 36a5686ec0 (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761

Reviewed By: ajkr

Differential Revision: D41063187

Pulled By: hx235

fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 13:29:37 -08:00
Jay Zhuang 1078d860a9 Add an unittest for Periodic compaction conflict with ongoing compaction (#10908)
Summary:
Add a tiered storage migration test which would conflict with
an ongoing penultimate level compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10908

Test Plan: Test only change

Reviewed By: anand1976

Differential Revision: D40864509

Pulled By: ajkr

fbshipit-source-id: e316e849a01a6c71a41be130101f909b6c0498cb
2022-12-12 10:37:55 -08:00
Hui Xiao f1574a20ff Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999)
Summary:
**Context/Summary:**

This reverts commit fc74abb436 and related HISTORY record.

The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF.  This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004

Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D41572604

Pulled By: hx235

fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
2022-11-29 10:56:42 -08:00
Hui Xiao d8c043f7ad Trigger FIFO file deletion in non L0 only if exceeding max_table_files_size (#10955)
Summary:
**Context**

https://github.com/facebook/rocksdb/pull/10348 allows multi-level FIFO but accidentally made change to the logic of deleting files in `FIFOCompactionPicker::PickSizeCompaction`. With [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR156) and [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR235) together, it deletes one file in non-L0 even when `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`, which is incorrect.

As a consequence, FIFO exercises more file deletion in our crash testing, which is not able to verify correctly on deleted keys in the file deleted by compaction. This results in errors  `error : inconsistent values for key 000000000000239F000000000000012B000000000000028B: expected state has the key, Get() returns NotFound.
Verification failed :(` or `Expected state has key 00000000000023A90000000000000003787878, iterator is at key 00000000000023A9000000000000004178
Column family: default, op_logs: S 00000000000023A90000000000000003787878`

**Summary**:
- Delete file for non-L0 only if `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`
- Add some helpful log to LOG file

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10955

Test Plan:
- Errors repro-ed by
```
./db_stress --preserve_unverified_changes=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=10 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=1000000 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=0 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=65 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=20
```
is gone after this fix
- CI

Reviewed By: ajkr

Differential Revision: D41319441

Pulled By: hx235

fbshipit-source-id: 6939753767007f7449ea7055b1420aabd03d7709
2022-11-28 15:45:03 -08:00
Andrew Kryczka 54c2542df2 Support tiering when file endpoints overlap (#10961)
Summary:
Enabled output to penultimate level when file endpoints overlap. This is probably only possible when range tombstones span files. Otherwise the overlapping files would all be included in the penultimate level inputs thanks to our atomic compaction unit logic.

Also, corrected `penultimate_output_range_type_`, which is a minor fix as it appears only used for logging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10961

Test Plan: updated unit test

Reviewed By: cbi42

Differential Revision: D41370615

Pulled By: ajkr

fbshipit-source-id: 7e75ec369a3b41b8382b336446c81825a4c4f572
2022-11-23 09:20:58 -08:00
Changyu Bi 6c5ec92070 Set correct temperature for range tombstone only file in penultimate level (#10972)
Summary:
before this PR, if there is a range tombstone-only file generated in penultimate level, it is marked the `last_level_temperature`. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10972

Test Plan: added unit test for this scenario.

Reviewed By: ajkr

Differential Revision: D41449215

Pulled By: cbi42

fbshipit-source-id: 1e06b5ae3bc0183db2991a45965a9807a7e8be0c
2022-11-21 17:08:50 -08:00
Andrew Kryczka 097f9f4425 Fix CompactionIterator flag for penultimate level output (#10967)
Summary:
We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967

Test Plan:
- built `db_bench` with DEBUG_LEVEL=0
- ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120`
- compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output
  - Before this fix, Y was always zero
  - After this fix, Y gradually increased throughout the benchmark

Reviewed By: riversand963

Differential Revision: D41417726

Pulled By: ajkr

fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329
2022-11-21 16:14:03 -08:00
Yanqin Jin 7d26e4c5a3 Basic Support for Merge with user-defined timestamp (#10819)
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.

Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```

Some code comments are added for clarity.

Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40603195

Pulled By: riversand963

fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
2022-10-31 22:28:58 -07:00
Jay Zhuang b36ec37a4b clang-format for db/compaction (#10882)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10882

Reviewed By: riversand963

Differential Revision: D40724867

Pulled By: jay-zhuang

fbshipit-source-id: 7f387724f8cd07d8d2b90566a515a4e9078d21f1
2022-10-26 12:35:12 -07:00
Hui Xiao fc74abb436 Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777)
Summary:
**Context:**
Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
Repro:
```
COERCE_CONTEXT_SWICH=1 make -j56 db_stress

./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35

put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
```

**Summary:**
FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
-  [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
   - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
   - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
- [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
  - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
  - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
- [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
- [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
    - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930  in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.

Some additional clean-up included in this PR:
- Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
- Added comment about `earliest_memtable_seqno` in related APIs
- Made parameter `earliest_memtable_seqno` constant and required

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777

Test Plan:
- make check
- New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
- Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761  and on FIFO compaction only

Reviewed By: ajkr

Differential Revision: D40090485

Pulled By: hx235

fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
2022-10-25 10:39:58 -07:00
Jay Zhuang f726d29a82 Allow penultimate level output for the last level only compaction (#10822)
Summary:
Allow the last level only compaction able to output result to penultimate level if the penultimate level is empty. Which will also block the other compaction output to the penultimate level.
(it includes the PR https://github.com/facebook/rocksdb/issues/10829)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10822

Reviewed By: siying

Differential Revision: D40389180

Pulled By: jay-zhuang

fbshipit-source-id: 4e5dcdce307795b5e07b5dd1fa29dd75bb093bad
2022-10-22 08:57:38 -07:00
Changyu Bi 333abe9c55 Ignore max_compaction_bytes for compaction input that are within output key-range (#10835)
Summary:
When picking compaction input files, we sometimes stop picking a file that is fully included in the output key-range due to hitting max_compaction_bytes. Including these input files can potentially reduce WA at the expense of larger compactions. Larger compaction should be fine as files from input level are usually 10X smaller than files from output level. This PR adds a mutable CF option `ignore_max_compaction_bytes_for_input` that is enabled by default. We can remove this option once we are sure it is safe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10835

Test Plan:
- CI, a unit test on max_compaction_bytes fails before turning this flag off.
- Benchmark does not show much difference in WA: `./db_bench --benchmarks=fillrandom,waitforcompaction,stats,levelstats -max_background_jobs=12 -num=2000000000 -target_file_size_base=33554432 --write_buffer_size=33554432`
```
main:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0   91.59 MB   0.8     70.9     0.0     70.9     200.8    129.9       0.0   1.5     25.2     71.2   2886.55           2463.45      9725    0.297   1093M   254K       0.0       0.0
  L1      9/0   248.03 MB   1.0    392.0   129.8    262.2     391.7    129.5       0.0   3.0     69.0     68.9   5821.71           5536.90       804    7.241   6029M  5814K       0.0       0.0
  L2     87/0    2.50 GB   1.0    537.0   128.5    408.5     533.8    125.2       0.7   4.2     69.5     69.1   7912.24           7323.70      4417    1.791   8299M    36M       0.0       0.0
  L3    836/0   24.99 GB   1.0    616.9   118.3    498.7     594.5     95.8       5.2   5.0     66.9     64.5   9442.38           8490.28      4204    2.246   9749M   306M       0.0       0.0
  L4   2355/0   62.95 GB   0.3     67.3    37.1     30.2      54.2     24.0      38.9   1.5     72.2     58.2    954.37            821.18       917    1.041   1076M   173M       0.0       0.0
 Sum   3290/0   90.77 GB   0.0   1684.2   413.7   1270.5    1775.0    504.5      44.9  13.7     63.8     67.3  27017.25          24635.52     20067    1.346     26G   522M       0.0       0.0

Cumulative compaction: 1774.96 GB write, 154.29 MB/s write, 1684.19 GB read, 146.40 MB/s read, 27017.3 seconds

This PR:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0   45.71 MB   0.8     72.9     0.0     72.9     202.8    129.9       0.0   1.6     25.4     70.7   2938.16           2510.36      9741    0.302   1124M   265K       0.0       0.0
  L1      8/0   234.54 MB   0.9    384.5   129.8    254.7     384.2    129.6       0.0   3.0     69.0     68.9   5708.08           5424.43       791    7.216   5913M  5753K       0.0       0.0
  L2     84/0    2.47 GB   1.0    543.1   128.6    414.5     539.9    125.4       0.7   4.2     69.6     69.2   7989.31           7403.13      4418    1.808   8393M    36M       0.0       0.0
  L3    839/0   24.96 GB   1.0    615.6   118.4    497.2     593.2     96.0       5.1   5.0     66.6     64.1   9471.23           8489.31      4193    2.259   9726M   306M       0.0       0.0
  L4   2360/0   63.04 GB   0.3     67.6    37.3     30.3      54.4     24.1      38.9   1.5     71.5     57.6    967.30            827.99       907    1.066   1080M   173M       0.0       0.0
 Sum   3294/0   90.75 GB   0.0   1683.8   414.2   1269.6    1774.5    504.9      44.8  13.7     63.7     67.1  27074.08          24655.22     20050    1.350     26G   522M       0.0       0.0

Cumulative compaction: 1774.52 GB write, 157.09 MB/s write, 1683.77 GB read, 149.06 MB/s read, 27074.1 seconds
```

Reviewed By: ajkr

Differential Revision: D40518319

Pulled By: cbi42

fbshipit-source-id: f4ea614bc0ebefe007ffaf05bb9aec9a8ca25b60
2022-10-21 10:22:41 -07:00
Jay Zhuang 1663f77d2a Fix no internal time recorded for small preclude_last_level (#10829)
Summary:
When the `preclude_last_level_data_seconds` or
`preserve_internal_time_seconds` is smaller than 100 (seconds), no seqno->time information was recorded.
Also make sure all data will be compacted to the last level even if there's no write to record the time information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10829

Test Plan: added unittest

Reviewed By: siying

Differential Revision: D40443934

Pulled By: jay-zhuang

fbshipit-source-id: 2ecf1361daf9f3e5c3385aee6dc924fa59e2813a
2022-10-20 17:11:38 -07:00
Yueh-Hsuan Chiang e267909ecf Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348)
Summary:
FIFO compaction can theoretically open a DB with any compaction style.
However, the current code only allows FIFO compaction to open a DB with
a single level.

This PR relaxes the limitation of FIFO compaction and allows it to open a
DB with multiple levels.  Below is the read / write / compaction behavior:

* The read behavior is untouched, and it works like a regular rocksdb instance.
* The write behavior is untouched as well.  When a FIFO compacted DB
is opened with multiple levels, all new files will still be in level 0, and no files
will be moved to a different level.
* Compaction logic is extended.  It will first identify the bottom-most non-empty level.
Then, it will delete the oldest file in that level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348

Test Plan:
Added a new test to verify the migration from level to FIFO where the db has multiple levels.
Extended existing test cases in db_test and db_basic_test to also verify
all entries of a key after reopening the DB with FIFO compaction.

Reviewed By: jay-zhuang

Differential Revision: D40233744

fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1
2022-10-18 14:38:13 -07:00
Peter Dillinger e466173d5c Print stack traces on frozen tests in CI (#10828)
Summary:
Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.

For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828

Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI

Reviewed By: hx235

Differential Revision: D40447634

Pulled By: pdillinger

fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
2022-10-18 00:35:35 -07:00
Jay Zhuang 5a5f21c489 Allow the last level data moving up to penultimate level (#10782)
Summary:
Lock the penultimate level for the whole compaction inputs range, so any
key in that compaction is safe to move up from the last level to
penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10782

Reviewed By: siying

Differential Revision: D40231540

Pulled By: siying

fbshipit-source-id: ca115cc8b4018b35d797329fa85a19b06cc8c13e
2022-10-10 22:50:34 -07:00
Jay Zhuang c401f285c3 Add option preserve_internal_time_seconds to preserve the time info (#10747)
Summary:
Add option `preserve_internal_time_seconds` to preserve the internal
time information.
It's mostly for the migration of the existing data to tiered storage (
`preclude_last_level_data_seconds`). When the tiering feature is just
enabled, the existing data won't have the time information to decide if
it's hot or cold. Enabling this feature will start collect and preserve
the time information for the new data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747

Reviewed By: siying

Differential Revision: D39910141

Pulled By: siying

fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138
2022-10-07 18:49:40 -07:00
Yanqin Jin 11943e8b27 Exclude timestamp when checking compaction boundaries (#10787)
Summary:
When checking if a range [start, end) overlaps with a compaction whose range is [start1, end1), always exclude timestamp from start, end, start1 and end1, otherwise some versions of one user key may be compacted to bottommost layer while others remain in the original level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10787

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40187672

Pulled By: ltamasi

fbshipit-source-id: 81226267fd3e33ffa79665c62abadf2ebec45496
2022-10-07 14:11:23 -07:00
Jay Zhuang 23fa5b7789 Use sstableKeyCompare() for compaction output boundary check (#10763)
Summary:
To make it consistent with the compaction picker which uses the `sstableKeyCompare()` to pick the overlap files. For example, without this change, it may cut L1 files like:
```
 L1: [2-21]  [22-30]
 L2: [1-10] [21-30]
```
Because "21" on L1 is smaller than "21" on L2. But for compaction, these 2 files are overlapped.
`sstableKeyCompare()` also take range delete into consideration which may cut file for the same key.
It also makes the `max_compaction_bytes` calculation more accurate for cases like above, the overlapped bytes was under estimated. Also make sure the 2 keys won't be splitted to 2 files because of reaching `max_compaction_bytes`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10763

Reviewed By: cbi42

Differential Revision: D39971904

Pulled By: cbi42

fbshipit-source-id: bcc309e9c3dc61a8f50667a6f633e6132c0154a8
2022-10-06 15:54:58 -07:00
Jay Zhuang f007ad8b4f RoundRobin TTL compaction (#10725)
Summary:
For RoundRobin compaction, the data should be mostly sorted per level and within level. Use normal compaction picker for RR until all expired data is compacted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10725

Reviewed By: ajkr

Differential Revision: D39771069

Pulled By: jay-zhuang

fbshipit-source-id: 7ccf88d7c093fad5673bda73a7b08cc4757780cd
2022-10-04 14:53:32 -07:00
Changyu Bi 9f2363f4c4 User-defined timestamp support for DeleteRange() (#10661)
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661

Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`.  Did not see consistent regression in no timestamp case.

| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|

Reviewed By: riversand963

Differential Revision: D39441192

Pulled By: cbi42

fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
2022-09-30 16:13:03 -07:00
Changyu Bi fd71a82f4f Use actual file size when checking max_compaction_size (#10728)
Summary:
currently, there are places in compaction_picker where we add up `compensated_file_size` of files being compacted and limit the sum to be under `max_compaction_bytes`. `compensated_file_size` contains booster for point tombstones and should be used only for determining file's compaction priority. This PR replaces `compensated_file_size` with actual file size in such places.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10728

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D39789427

Pulled By: cbi42

fbshipit-source-id: 1f89fb6c0159c53bf01d8dc783f465959f442c81
2022-09-30 10:50:44 -07:00
Jay Zhuang f3cc66632b Align compaction output file boundaries to the next level ones (#10655)
Summary:
Try to align the compaction output file boundaries to the next level ones
(grandparent level), to reduce the level compaction write-amplification.

In level compaction, there are "wasted" data at the beginning and end of the
output level files. Align the file boundary can avoid such "wasted" compaction.
With this PR, it tries to align the non-bottommost level file boundaries to its
next level ones. It may cut file when the file size is large enough (at least
50% of target_file_size) and not too large (2x target_file_size).

db_bench shows about 12.56% compaction reduction:
```
TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432

# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds

# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
```

The compaction simulator shows a similar result (14% with 100G random data).
As a side effect, with this PR, the SST file size can exceed the
target_file_size, but is capped at 2x target_file_size. And there will be
smaller files. Here are file size statistics when loading 100GB with the target
file size 32MB:
```
          baseline      this_PR
count  1.656000e+03  1.705000e+03
mean   3.116062e+07  3.028076e+07
std    7.145242e+06  8.046139e+06
```

The feature is enabled by default, to revert to the old behavior disable it
with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`

Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
use case like user adding 2 or more non-overlapping data range at the same
time, it can reduce the overlapping of 2 datasets in the lower levels.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655

Reviewed By: cbi42

Differential Revision: D39552321

Pulled By: jay-zhuang

fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
2022-09-29 19:43:55 -07:00
Jay Zhuang 92df36985d Deflake CompactionServiceTest.BasicCompactions (#10697)
Summary:
The background compaction may still running while the test end, which would cause ASAN stack-use-after-scope error.
Explicitly close the DB before test end.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10697

Test Plan:
able to reproduce with:
```
gtest-parallel ./compaction_service_test --gtest_filter=CompactionServiceTest.BasicCompactions -r 10000 -w 100
```

Reviewed By: gitbw95

Differential Revision: D39590974

Pulled By: jay-zhuang

fbshipit-source-id: da264b2e6a276afbda7d5ff7adb9d7b8d4213d90
2022-09-19 14:10:05 -07:00
Jay Zhuang 849cf1bf68 Refactor Compaction file cut ShouldStopBefore() (#10629)
Summary:
Consolidate compaction output cut logic to `ShouldStopBefore()` and move
it inside of CompactionOutputs class.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10629

Reviewed By: cbi42

Differential Revision: D39315536

Pulled By: jay-zhuang

fbshipit-source-id: 7d81037babbd35c276bbaad02dbc2bb555fdac18
2022-09-14 22:09:12 -07:00
Yanqin Jin ce2c11d848 Fix a bug by setting up subcompaction bounds properly (#10658)
Summary:
When user-defined timestamp is enabled, subcompaction bounds should be set up properly. When creating InputIterator for the compaction, the `start` and `end` should have their timestamp portions set to kMaxTimestamp, which is the highest possible timestamp. This is similar to what we do with setting up their sequence numbers to `kMaxSequenceNumber`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10658

Test Plan:
```bash
make check
rm -rf /dev/shm/rocksdb/* && mkdir
/dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress
--allow_data_in_errors=True --clear_column_family_one_in=0
--continuous_verification_interval=0 --data_block_index_type=1
--db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5
--delrangepercent=0
--expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected
--iterpercent=0 --max_background_compactions=20
--max_bytes_for_level_base=10485760 --max_key=25000000
--max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1
--ops_per_thread=300000 --paranoid_file_checks=1 --partition_filters=0
--prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0
--snapshot_hold_ops=100000 --subcompactions=4
--target_file_size_base=65536 --target_file_size_multiplier=2
--test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1
--user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1
--write_buffer_size=65536 --writepercent=60 -disable_wal=1
-column_families=1
```

Reviewed By: akankshamahajan15

Differential Revision: D39393402

Pulled By: riversand963

fbshipit-source-id: f276e35b19fce51a175c368a502fb0718d1f3871
2022-09-14 21:59:56 -07:00
Yanqin Jin 3d67d79154 Fix overlapping check by excluding timestamp (#10615)
Summary:
With user-defined timestamp, checking overlapping should exclude
timestamp part from key. This has already been done for range checking
for files in sstableKeyCompare(), but not yet done when checking with
concurrent compactions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615

Test Plan:
(Will add more tests)

make check
(Repro seems easier with this commit sha: git checkout 78bbdef530)
rm -rf /dev/shm/rocksdb/* &&
mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected &&
./db_stress
--allow_data_in_errors=True --clear_column_family_one_in=0
--continuous_verification_interval=0 --data_block_index_type=1
--db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5
--delrangepercent=0
--expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected
--iterpercent=0 --max_background_compactions=20
--max_bytes_for_level_base=10485760 --max_key=25000000
--max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1
--ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0
--prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0
--snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3
--target_file_size_base=65536 --target_file_size_multiplier=2
--test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1
--user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1
--write_buffer_size=65536 --writepercent=60 -disable_wal=1

Reviewed By: akankshamahajan15

Differential Revision: D39146797

Pulled By: riversand963

fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 13:03:07 -07:00
Peter Dillinger 6de7081cf3 Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.

One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.

Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`

Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532

Test Plan:
updated unit tests

Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec

Reviewed By: jay-zhuang

Differential Revision: D38765551

Pulled By: pdillinger

fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-07 22:52:42 -07:00
Hui Xiao 8a85946f58 Add missing mutex when reading from shared variable bg_bottom_compaction_scheduled_, bg_compaction_scheduled_ (#10610)
Summary:
**Context/Summary:**
According to https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_job.h#L328-L332, any reading in the form of `*bg_compaction_scheduled_` , `*bg_bottom_compaction_scheduled_` should be protected by mutex, which isn't the case for some assert statement. This leads to a data race that can be repro-ed by the following command (command coming soon)

```
db=/dev/shm/rocksdb_crashtest_blackbox
exp=/dev/shm/rocksdb_crashtest_expected
rm -rf $db $exp
mkdir -p $exp

./db_stress --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=1 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --value_size_mult=32 --writepercent=90  --compaction_pri=4 --use_txn=1 --level_compaction_dynamic_level_bytes=True  --compaction_ttl=0  --compact_files_one_in=1000000 --compact_range_one_in=1000000 --value_size_mult=32 --verify_db_one_in=1000  --write_buffer_size=65536 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_key=25000000 --max_key_len=3 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --target_file_size_base=2097152 --target_file_size_multiplier=2
```
```
WARNING: ThreadSanitizer: data race (pid=73424)
  Read of size 4 at 0x7b8c0000151c by thread T13:
    #0 ReleaseSubcompactionResources internal_repo_rocksdb/repo/db/compaction/compaction_job.cc:390 (db_stress+0x630aa3)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::CompactionJob::Run() internal_repo_rocksdb/repo/db/compaction/compaction_job.cc:741 (db_stress+0x630aa3)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:3436 (db_stress+0x60b2cc)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2950 (db_stress+0x606d79)
    https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::BGWorkCompaction(void*) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2693 (db_stress+0x60356a)

  Previous write of size 4 at 0x7b8c0000151c by thread T12 (mutexes: write M438955329917552448):
    #0 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:3018 (db_stress+0x6072a1)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::BGWorkCompaction(void*) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2693 (db_stress+0x60356a)

Location is heap block of size 6720 at 0x7b8c00000000 allocated by main thread:
    #0 operator new(unsigned long, std::align_val_t) <null> (db_stress+0xbab5bb)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) internal_repo_rocksdb/repo/db/db_impl/db_impl_open.cc:1811 (db_stress+0x69769a)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::TransactionDB::Open(rocksdb::DBOptions const&, rocksdb::TransactionDBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::TransactionDB**) internal_repo_rocksdb/repo/utilities/transactions/pessimistic_transaction_db.cc:258 (db_stress+0x8ae1f4)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::StressTest::Open(rocksdb::SharedState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:2611 (db_stress+0x32b927)
    https://github.com/facebook/rocksdb/issues/4 rocksdb::StressTest::InitDb(rocksdb::SharedState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:290 (db_stress+0x34712c)
```
This PR added all the missing mutex that should've been in place

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10610

Test Plan:
- Past repro command
- Existing CI

Reviewed By: riversand963

Differential Revision: D39143016

Pulled By: hx235

fbshipit-source-id: 51dd4db55ad306f3dbda5d0dd54d6f2513cf70f2
2022-08-30 16:24:01 -07:00
Hui Xiao e484b81eee Sync dir containing CURRENT after RenameFile on CURRENT as much as possible (#10573)
Summary:
**Context:**
Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.

This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.

The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped  (which otherwise will internally call`Env/FS::SyncDic()` )
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
```

```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
```

**Summary:**
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
- Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
- Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
    -  Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
- Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
    - `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573

Test Plan:
- `make check`
- Passed the repro db_stress command
- For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr  "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.

Reviewed By: ajkr

Differential Revision: D39005886

Pulled By: hx235

fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a
2022-08-29 17:35:21 -07:00
Chen Lixiang 9593fd1c82 Fix wrong compression type and options in universal compaction picker (#10515)
Summary:
In UniversalCompactionBuilder::PickCompactionToReduceSortedRuns, we passed start_level to get compression type and options. I think that is wrong and we should use output_level instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10515

Reviewed By: hx235

Differential Revision: D38611335

Pulled By: ajkr

fbshipit-source-id: bb860caed4b6c6bbde8f75fc50cf875a9f04723d
2022-08-23 14:58:02 -07:00
Changyu Bi fd165c869d Add memtable per key-value checksum (#10281)
Summary:
Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281

Test Plan:
- Added new unit test cases: `make check`
- Benchmark on memtable insert
```
TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

# avg over 10 runs
Baseline: 1166936 ops/sec
memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%)
memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%)
write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%)
```
-  Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum.
ops/sec for `readrandom` dropped ~6.8%.
```
# Readseq
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

readseq [AVG    20 runs] : 7432840 (± 212005) ops/sec;  822.3 (± 23.5) MB/sec
readseq [MEDIAN 20 runs] : 7573878 ops/sec;  837.9 MB/sec

With -memtable_protection_bytes_per_key=2:

readseq [AVG    20 runs] : 5134607 (± 119596) ops/sec;  568.0 (± 13.2) MB/sec
readseq [MEDIAN 20 runs] : 5232946 ops/sec;  578.9 MB/sec

# Readrandom
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100
readrandom [AVG    10 runs] : 140236 (± 3938) ops/sec;    9.8 (± 0.3) MB/sec
readrandom [MEDIAN 10 runs] : 140545 ops/sec;    9.8 MB/sec

With -memtable_protection_bytes_per_key=2:
readrandom [AVG    10 runs] : 130632 (± 2738) ops/sec;    9.1 (± 0.2) MB/sec
readrandom [MEDIAN 10 runs] : 130341 ops/sec;    9.1 MB/sec
```

- Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800`

Reviewed By: ajkr

Differential Revision: D37607896

Pulled By: cbi42

fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113
2022-08-12 13:51:32 -07:00
Peter Dillinger 9fa5c146d7 LOG more info on oldest snapshot and sequence numbers (#10454)
Summary:
The info LOG file does not currently give any direct
information about the existence of old, live snapshots, nor how to
estimate wall time from a sequence number within the scope of LOG
history. This change addresses both with:
* Logging smallest and largest seqnos for generated SST files, which can
help associate sequence numbers with write time (based on flushes).
* Logging oldest_snapshot_seqno for each compaction, which (along with
that seqno info) helps us to determine how much old data might be kept
around for old (leaked?) snapshots. Including the date here I thought might
be excessive.

I wanted to log the date and seqno of the oldest snapshot with periodic
stats, but the current structure of the code doesn't really support that
because `DumpDBStats` doesn't have access to the DB object.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10454

Test Plan:
manual inspect LOG from
`KEEP_DB=1 ./db_basic_test --gtest_filter=*CompactBetweenSnapshots*`

Reviewed By: ajkr

Differential Revision: D38326948

Pulled By: pdillinger

fbshipit-source-id: 294918ffc04a419844146cd826045321b4d5c038
2022-08-12 13:08:50 -07:00
sherriiiliu 4753e5a2e9 Fix wrong value passed in compaction filter in BlobDB (#10391)
Summary:
New blobdb has a bug in compaction filter, where `blob_value_` is not reset for next iterated key. This will cause blob_value_ not empty and previous value read from blob is passed into the filter function for next key, even if its value is not in blob. Fixed by reseting regardless of key type.

Test Case:
Add `FilterByValueLength` test case in `DBBlobCompactionTest`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10391

Reviewed By: riversand963

Differential Revision: D38629900

Pulled By: ltamasi

fbshipit-source-id: 47d23ff2e5ec697958a210db9e6ceeb8b2fc49fa
2022-08-11 13:55:28 -07:00
Yanqin Jin fee2c472d0 Include minimal contextual information in CompactionIterator (#10505)
Summary:
The main purpose is to make debugging easier without sacrificing performance.

Instead of using a boolean variable for `CompactionIterator::valid_`, we can extend it to an `uint8_t`,
using the LSB to denote if the compaction iterator is valid and 4 additional bits to denote where
the iterator is set valid inside `NextFromInput()`. Therefore, when the control flow reaches
`PrepareOutput()` and hits assertion there, we can have a better idea of what has gone wrong.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10505

Test Plan:
make check
```
TEST_TMPDIR=/dev/shm/rocksdb time ./db_bench -compression_type=none -write_buffer_size=1073741824 -benchmarks=fillseq,flush
```
The above command has a 'flush' benchmark which uses `CompactionIterator`.  I haven't observed any CPU regression or drop in throughput or latency increase.

Reviewed By: ltamasi

Differential Revision: D38551615

Pulled By: riversand963

fbshipit-source-id: 1250848fc118bb753d71fa9ff8ba840df999f5e0
2022-08-09 17:07:24 -07:00
Jay Zhuang 3f763763aa Change bottommost_temperture to last_level_temperture (#10471)
Summary:
Change tiered compaction feature from `bottommost_temperture` to
`last_level_temperture`. The old option is kept for migration purpose only,
which is behaving the same as `last_level_temperture` and it will be removed in
the next release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471

Test Plan: CI

Reviewed By: siying

Differential Revision: D38450621

Pulled By: jay-zhuang

fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527
2022-08-08 14:36:34 -07:00
Jay Zhuang 375534752a Improve universal compaction picker for tiered compaction (#10467)
Summary:
Current universal compaction picker may cause extra size amplification
compaction if there're more hot data on penultimate level. Improve the picker
to skip the last level for size amp calculation if tiered compaction is
enabled, which can
1. avoid extra unnecessary size amp compaction;
2. typically cold tier (the last level) is not size constrained, so skip size
   amp for cold tier is intended;

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10467

Test Plan: CI and added unittest

Reviewed By: siying

Differential Revision: D38391350

Pulled By: jay-zhuang

fbshipit-source-id: 103c0731c05e0a7e8f267e9e829d022328be25d2
2022-08-08 14:34:36 -07:00
Levi Tamasi cc8ded6152 Do not put blobs read during compaction into cache (#10457)
Summary:
During compaction, blobs are currently read using the default
`ReadOptions`, which has the `fill_cache` flag set to true. Earlier,
this didn't make any difference since we didn't have a blob cache;
however, now we have to explicitly set this flag to false to avoid
polluting the cache during compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D38333528

Pulled By: ltamasi

fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2
2022-08-01 19:49:05 -07:00
Yanqin Jin fbfcf5cbcd Remove unused fields from FileMetaData (temporarily) (#10443)
Summary:
FileMetaData::[min|max]_timestamp are not currently being used or
tracked by RocksDB, even when user-defined timestamp is enabled. Each of
them is a std::string which can occupy 32 bytes. Remove them for now.
They may be added back when we have a pressing need for them. When we do
add them back, consider store them in a more compact way, e.g. one
boolean flag and a byte array of size 16.

Per file min/max timestamp bounds are available as table properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10443

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D38292275

Pulled By: riversand963

fbshipit-source-id: 841dc4e855ad8f8481c80cb020603de9607c9c94
2022-08-01 17:56:13 -07:00
Akanksha Mahajan 56463d443d Provide support for subcompactions with user-defined timestamps (#10344)
Summary:
The subcompaction logic currently picks file boundaries as subcompaction boundaries. This is not compatible with user-defined timestamps because of two issues.
Issue1: ReadOptions.iterate_lower_bound and ReadOptions.iterate_upper_bound contains timestamps which results in assertion failure as BlockBasedTableIterator expects bounds to be without timestamps. As result, because of wrong comparison end key is returned as user_key resulting in assertion failure.
Issue2: Since it might result in two keys that only differ by user timestamp getting processed by two different subcompactions (and thus two different CompactionIterator state machines), which in turn can cause data correction issues.

This PR provide support to reenable subcompactions with user-defined timestamps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10344

Test Plan:
Added new unit test
- Without fix for Issue1 unit test MultipleSubCompactions fails with error:
```
db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterat│
or::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
Received signal 6 (Aborted)                                                                             │
#0   /usr/local/fbcode/platform009/lib/libc.so.6(gsignal+0x100) [0x7f8fbbbfe530] db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterator::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
Aborted (core dumped)
```
Ran stress test
`make crash_test_with_ts -j32`

Reviewed By: riversand963

Differential Revision: D38220841

Pulled By: akankshamahajan15

fbshipit-source-id: 5d5cae2bd37fcaeba1e77fce0a69070ad4158ccb
2022-07-31 11:39:16 -07:00
Zichen Zhu c945a9a664 Allow sufficient subcompactions under round-robin compaction priority (#10422)
Summary:
Allow sufficient subcompactions can be used when the number of input files is less than `max_subcompactions` under round-robin compaction priority.

Test Case:
Add `RoundRobinWithoutAdditionalResources` into `db_compaction_test`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10422

Reviewed By: ajkr

Differential Revision: D38186545

Pulled By: littlepig2013

fbshipit-source-id: b8e5098306f1e5b9561dfafafc8300a38f7fe88e
2022-07-26 20:37:34 -07:00
Zichen Zhu 8860fc902a Support subcmpct using reserved resources for round-robin priority (#10341)
Summary:
Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows:
 * Constraint 1: We can only pick consecutive files
   - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files
   - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys)
 * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes`
 * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)`
 * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3

More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`.

The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit.  The number of subcompactions for round-robin compaction priority is determined through the following steps:
* Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()`
* Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()`
* Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions)

More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341

Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc`

Reviewed By: ajkr, hx235

Differential Revision: D37792644

Pulled By: littlepig2013

fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
2022-07-24 11:12:44 -07:00
sdong 252bea405e Improve SubCompaction Partitioning (#10393)
Summary:
Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393

Test Plan:
1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable;
2. Add a new unit test to ApproximateKeyAnchors()
3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly.

Reviewed By: jay-zhuang

Differential Revision: D38043783

fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f
2022-07-23 17:38:49 -07:00
LIU HU 8885b0537b Fix underflow in FIFOCompactionPicker (#10386)
Summary:
Fix https://github.com/facebook/rocksdb/issues/10133

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10386

Reviewed By: riversand963

Differential Revision: D38067265

Pulled By: ajkr

fbshipit-source-id: 3a99a98ac5d7ac37581b5b636fbfa7901563d834
2022-07-22 09:20:35 -07:00
Gang Liao ec4ebeff30 Support prepopulating/warming the blob cache (#10298)
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298

Reviewed By: ltamasi

Differential Revision: D37908743

Pulled By: gangliao

fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
2022-07-17 07:13:59 -07:00
Jay Zhuang faa0f9723c Tiered compaction: integrate Seqno time mapping with per key placement (#10370)
Summary:
Using the Sequence number to time mapping to decide if a key is hot or not in
compaction and place it in the corresponding level.

Note: the feature is not complete, level compaction will run indefinitely until
all penultimate level data is cold and small enough to not trigger compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10370

Test Plan:
CI
* Run basic db_bench for universal compaction manually

Reviewed By: siying

Differential Revision: D37892338

Pulled By: jay-zhuang

fbshipit-source-id: 792bbd91b1ccc2f62b5d14c53118434bcaac4bbe
2022-07-15 19:01:30 -07:00
Jay Zhuang a3acf2ef87 Add seqno to time mapping (#10338)
Summary:
Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the `preclude_last_level_data_seconds`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338

Test Plan: CI

Reviewed By: siying

Differential Revision: D37810187

Pulled By: jay-zhuang

fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f
2022-07-14 21:49:34 -07:00
sdong c8b20d469d Make InternalKeyComparator not configurable (#10342)
Summary:
InternalKeyComparator is an internal class which is a simple wrapper of Comparator. https://github.com/facebook/rocksdb/pull/8336 made Comparator customizeable. As a side effect, internal key comparator was made configurable too. This introduces overhead to this simple wrapper. For example, every InternalKeyComparator will have an std::vector attached to it, which consumes memory and possible allocation overhead too.
We remove InternalKeyComparator from being customizable by making InternalKeyComparator not a subclass of Comparator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10342

Test Plan: Run existing CI tests and make sure it doesn't fail

Reviewed By: riversand963

Differential Revision: D37771351

fbshipit-source-id: 917256ee04b2796ed82974549c734fb6c4d8ccee
2022-07-14 10:09:31 -07:00
Jay Zhuang 6ce0b2ca34 Tiered Compaction: per key placement support (#9964)
Summary:
Support per_key_placement for last level compaction, which will
be used for tiered compaction.
* compaction iterator reports which level a key should output to;
* compaction get the output level information and check if it's safe to
  output the data to penultimate level;
* all compaction output files will be installed.
* extra internal compaction stats added for penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9964

Test Plan:
* Unittest
* db_bench, no significate difference: https://gist.github.com/jay-zhuang/3645f8fb97ec0ab47c10704bb39fd6e4
* microbench manual compaction no significate difference: https://gist.github.com/jay-zhuang/ba679b3e89e24992615ee9eef310e6dd
* run the db_stress multiple times (not covering the new feature) looks good (internal: https://fburl.com/sandcastle/9w84pp2m)

Reviewed By: ajkr

Differential Revision: D36249494

Pulled By: jay-zhuang

fbshipit-source-id: a96da57c8031c1df83e4a7a8567b657a112b80a3
2022-07-13 20:54:49 -07:00
sdong 769b156e65 Remove customized naming from InternalKeyComparator (#10343)
Summary:
InternalKeyComparator is a thin wrapper around user comparator. Storing a string for name is relatively expensive to this small wrapper for both CPU and memory usage. Try to remove it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10343

Test Plan: Run existing tests

Reviewed By: ajkr

Differential Revision: D37772469

fbshipit-source-id: d2d106a8d022193058fd7f6b220108e3d94aca34
2022-07-12 13:30:35 -07:00
zczhu 8debfe2b21 Replace the output split key with its pointer in subcompaction (#10316)
Summary:
Earlier implementation of cutting the output files with a compact cursor under Round-Robin priority uses `Valid()` to determine if the `output_split_key` is valid in `ShouldStopBefore`. This contributes to excessive CPU computation, as pointed out by [this issue](https://github.com/facebook/rocksdb/issues/10315). In this PR, we change the type of `output_split_key` to be `InternalKey*` and set it as `nullptr` if it is not going to be used in `ShouldStopBefore`, `Valid()` condition checking can be avoided using that pointer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10316

Reviewed By: ajkr

Differential Revision: D37661492

Pulled By: littlepig2013

fbshipit-source-id: 66ff1105f3378e5573d3a126fdaff9bb23b5498f
2022-07-06 16:19:45 -07:00
sdong a9565ccb26 Try to trivial move more than one files (#10190)
Summary:
In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck.

When pick up a file to compact and it doesn't have overlapping files in the next level, try to expand to the next file if there is still no overlapping.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10190

Test Plan:
Add some unit tests.
For performance, Try to run
./db_bench_multi_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
Together with https://github.com/facebook/rocksdb/pull/10188 , stalling will be eliminated in this benchmark.

Reviewed By: jay-zhuang

Differential Revision: D37230647

fbshipit-source-id: 42b260f545c46abc5d90335ac2bbfcd09602b549
2022-07-05 10:10:37 -07:00
sdong 4428c76181 Multi-File Trivial Move in L0->L1 (#10188)
Summary:
In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify.
1. We always try to find L0->L1 trivial move from the oldest files. Keep including newer files, until adding a new file won't trigger a trivial move
2. Modify the trivial move condition so that this compaction would be tagged as trivial move.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10188

Test Plan:
See throughput improvements with db_bench with fast fillseq benchmark and small L0 files:

./db_bench_l0_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes

The throughput improved by about 50%. Stalling still happens though.

Reviewed By: jay-zhuang

Differential Revision: D37224743

fbshipit-source-id: 8958d97f22e12bdfc14d2e85930f6fa0070e9659
2022-06-30 18:04:23 -07:00
zczhu 4f51101d31 Remove compact cursor when split sub-compactions (#10289)
Summary:
In round-robin compaction priority, when splitting the compaction into sub-compactions, the earlier implementation takes into account the compact cursor to have full use of available sub-compactions. But this may result in unbalanced sub-compactions, so we remove this here.  The removal does not affect the cursor-based splitting mechanism within a sub-compaction, and thus the output files are still ensured to be split according to the cursor.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10289

Reviewed By: ajkr

Differential Revision: D37559091

Pulled By: littlepig2013

fbshipit-source-id: b8b45b99f63b09cf873f7f049bcb4ab13871fffc
2022-06-30 15:36:46 -07:00
Levi Tamasi c73d2a9d18 Add API for writing wide-column entities (#10242)
Summary:
The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
a new API called `PutEntity` that can be used to write a wide-column entity
to the database. The new API is added to both `DB` and `WriteBatch`. Note
that currently there is no way to retrieve these entities; more precisely, all
read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
encounter a wide-column entity that is required to answer a query. Read-side
support (as well as other missing functionality like `Merge`, compaction filter,
and timestamp support) will be added in later PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D37369748

Pulled By: ltamasi

fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d
2022-06-25 15:30:47 -07:00
sdong 246d469750 Reduce overhead of SortFileByOverlappingRatio() (#10161)
Summary:
Currently SortFileByOverlappingRatio() is O(nlogn). It is usually OK but When there are a lot of files in an LSM-tree, SortFileByOverlappingRatio() can take non-trivial amount of time. The problem is severe when the user is loading keys in sorted order, where compaction is only trivial move and this operation becomes the bottleneck and limit the total throughput. This commit makes SortFileByOverlappingRatio() only find the top 50 files based on score. 50 files are usually enough for the parallel compactions needed for the level, and in case it is not enough, we would fall back to random, which should be acceptable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10161

Test Plan:
Run a fillseq that generates a lot of files, and observe throughput improved (although stall is not yet eliminated). The command ran:

TEST_TMPDIR=/dev/shm/ ./db_bench_sort --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000

The throughput improved by 11%.

Reviewed By: jay-zhuang

Differential Revision: D37129469

fbshipit-source-id: 492da2ef5bfc7cdd6daa3986b50d2ff91f88542d
2022-06-24 14:01:11 -07:00
zczhu 17a1d65e3a Cut output files at compaction cursors (#10227)
Summary:
The files behind the compaction cursor contain newer data than the files ahead of it. If a compaction writes a file that spans from before its output level’s cursor to after it, then data before the cursor will be contaminated with the old timestamp from the data after the cursor. To avoid this, we can split the output file into two – one entirely before the cursor and one entirely after the cursor. Note that, in rare cases, we **DO NOT** need to cut the file if it is a trivial move since the file will not be contaminated by older files. In such case, the compact cursor is not guaranteed to be the boundary of the file, but it does not hurt the round-robin selection process.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10227

Test Plan:
Add 'RoundRobinCutOutputAtCompactCursor' unit test in `db_compaction_test`

Task: [T122216351](https://www.internalfb.com/intern/tasks/?t=122216351)

Reviewed By: jay-zhuang

Differential Revision: D37388088

Pulled By: littlepig2013

fbshipit-source-id: 9246a6a084b6037b90d6ab3183ba4dfb75a3378d
2022-06-23 14:25:42 -07:00
zczhu 30141461f9 Add basic kRoundRobin compaction policy (#10107)
Summary:
Add `kRoundRobin` as a compaction priority. The implementation is as follows.

- Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`).
- After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function.
- Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107

Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test`

Reviewed By: ajkr

Differential Revision: D37316037

Pulled By: littlepig2013

fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2022-06-21 11:56:53 -07:00
Gang Liao deff48bcef Add blob source to retrieve blobs in RocksDB (#10198)
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB.  BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.

Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198

Reviewed By: ltamasi

Differential Revision: D37294735

Pulled By: gangliao

fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
2022-06-20 20:58:11 -07:00
Yanqin Jin 4d31d3c2ed Abort in dbg mode after logging (#10183)
Summary:
In CompactionIterator code, there are multiple places where the process
will abort in dbg mode before logging the error message describing the
cause. This PR changes only the logging behavior for compaction iterator so
that error message is written to LOG before the process aborts in debug
mode.

Also updated the triggering condition for an assertion for single delete with
user-defined timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10183

Test Plan: make check

Reviewed By: akankshamahajan15

Differential Revision: D37190218

Pulled By: riversand963

fbshipit-source-id: 741bb007067be7cfbe94ac9e530ad4b2b339c009
2022-06-15 22:00:24 -07:00
gitbw95 5cbee1f609 Add unit test to verify that the dynamic priority can be passed from compaction to FS (#10088)
Summary:
**Summary:**
Add unit tests to verify that the dynamic priority can be passed from compaction to FS. Compaction reads&writes and other DB reads&writes share the same read&write paths to FSRandomAccessFile or FSWritableFile, so a MockTestFileSystem is added to replace the default filesystem from Env to intercept and verify the io_priority. To prepare the compaction input files, use the default filesystem from Env. To test the io priority of the compaction reads and writes, db_options_.fs is set as MockTestFileSystem.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10088

Test Plan: Add unit tests.

Reviewed By: anand1976

Differential Revision: D36882528

Pulled By: gitbw95

fbshipit-source-id: 120adc15801966f2b8c9fc45285f590a3fff96d1
2022-06-07 11:57:12 -07:00
zczhu 3ee6c9baec Consolidate manual_compaction_paused_ check (#10070)
Summary:
As pointed out by [https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422](https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422), check `manual_compaction_paused` and `manual_compaction_canceled` can be reduced by setting `*canceled` to be true in `DisableManualCompaction()` and `*canceled` to be false in the last time calling `EnableManualCompaction()`.

Changed Tests: The origin `DBTest2.PausingManualCompaction1` uses a callback function to increase `manual_compaction_paused` and the origin CompactionJob/CompactionIterator with `manual_compaction_paused` can detect this. I changed the callback function so that it sets `*canceled` as true if `canceled` is not `nullptr` (to notify CompactionJob/CompactionIterator the compaction has been canceled).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10070

Test Plan: This change does not introduce new features, but some slight difference in compaction implementation. Run the same manual compaction unit tests as before (e.g., PausingManualCompaction[1-4], CancelManualCompaction[1-2], CancelManualCompactionWithListener in db_test2, and db_compaction_test).

Reviewed By: ajkr

Differential Revision: D36949133

Pulled By: littlepig2013

fbshipit-source-id: c5dc4c956fbf8f624003a0f5ad2690240063a821
2022-06-06 18:32:26 -07:00
Gang Liao e6432dfd4c Make it possible to enable blob files starting from a certain LSM tree level (#10077)
Summary:
Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.

In order to achieve this, we would like to do the following:
- Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
- Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
- Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
- Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
- Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
- Ideally extend the C and Java bindings with the new option
- Update the BlobDB wiki to document the new option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077

Reviewed By: ltamasi

Differential Revision: D36884156

Pulled By: gangliao

fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
2022-06-02 20:04:33 -07:00
Gang Liao 3dc6ebaf74 Support specifying blob garbage collection parameters when CompactRange() (#10073)
Summary:
Garbage collection is generally controlled by the BlobDB configuration options `enable_blob_garbage_collection` and `blob_garbage_collection_age_cutoff`. However, there might be use cases where we would want to temporarily override these options while performing a manual compaction. (One use case would be doing a full key-space manual compaction with full=100% garbage collection age cutoff in order to minimize the space occupied by the database.) Our goal here is to make it possible to override the configured GC parameters when using the `CompactRange` API to perform manual compactions. This PR would involve:

- Extending the `CompactRangeOptions` structure so clients can both force-enable and force-disable GC, as well as use a different cutoff than what's currently configured
- Storing whether blob GC should actually be enabled during a certain manual compaction and the cutoff to use in the `Compaction` object (considering the above overrides) and passing it to `CompactionIterator` via `CompactionProxy`
- Updating the BlobDB wiki to document the new options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10073

Test Plan: Adding unit tests and adding the new options to the stress test tool.

Reviewed By: ltamasi

Differential Revision: D36848700

Pulled By: gangliao

fbshipit-source-id: c878ef101d1c612429999f513453c319f75d78e9
2022-06-01 19:40:26 -07:00
Jay Zhuang fd24e4479b Fix failed VerifySstUniqueIds unittests (#10043)
Summary:
which should use UniqueId64x2 instead of string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10043

Test Plan: unittest

Reviewed By: pdillinger

Differential Revision: D36620366

Pulled By: jay-zhuang

fbshipit-source-id: cf937a1da362018472fa4396848225e48893848b
2022-05-24 09:00:06 -07:00
sdong bea5831bff Move three info logging within DB Mutex to use log buffer (#10029)
Summary:
info logging with DB Mutex could potentially invoke I/O and cause performance issues. Move three of the cases to use log buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10029

Test Plan: Run existing tests.

Reviewed By: jay-zhuang

Differential Revision: D36561694

fbshipit-source-id: cabb93fea299001a6b4c2802fcba3fde27fa062c
2022-05-23 10:09:37 -07:00
Jay Zhuang c6d326d3d7 Track SST unique id in MANIFEST and verify (#9990)
Summary:
Start tracking SST unique id in MANIFEST, which is used to verify with
SST properties to make sure the SST file is not overwritten or
misplaced. A DB option `try_verify_sst_unique_id` is introduced to
enable/disable the verification, if enabled, it opens all SST files
during DB-open to read the unique_id from table properties (default is
false), so it's recommended to use it with `max_open_files = -1` to
pre-open the files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990

Test Plan: unittests, format-compatible test, mini-crash

Reviewed By: anand1976

Differential Revision: D36381863

Pulled By: jay-zhuang

fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f
2022-05-19 11:04:21 -07:00
gitbw95 4da34b97ee Set Read rate limiter priority dynamically and pass it to FS (#9996)
Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.

### Solution
User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile.

**This PR is for the Read path.** The **Read:** dynamic priority for different state are listed as follows:

| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
|  Flush (verification read in BuildTable()) | IO_USER | IO_USER | IO_USER |
|  Compaction | IO_LOW  | IO_USER | IO_USER |
|  User | User provided | User provided | User provided |

We will respect the read_options that the user provided and will not set it.
The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read".

**Details**
1. Set read_options.rate_limiter_priority dynamically:
- User: Do not update the read_options. Use the read_options that the user provided.
- Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction().
- Flush: Update read_options in BuildTable().

2. Pass the rate limiter priority to FSRandomAccessFile functions:
- After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions,  including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock().
- In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch().
- Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator().

### Test Plans
Add unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996

Reviewed By: anand1976

Differential Revision: D36452483

Pulled By: gitbw95

fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf
2022-05-18 19:41:44 -07:00
gitbw95 05c678e135 Set Write rate limiter priority dynamically and pass it to FS (#9988)
Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.

From the RocksDB perspective, there can be two kinds of rate limiters, the internal (native) one and the external one.
- The internal (native) rate limiter is introduced in [the wiki](https://github.com/facebook/rocksdb/wiki/Rate-Limiter). Currently, only IO_LOW and IO_HIGH are used and they are set statically.
- For the external rate limiter, in FSWritableFile functions,  IOOptions is open for end users to set and get rate_limiter_priority for their own rate limiter. Currently, RocksDB doesn’t pass the rate_limiter_priority through IOOptions to the file system.

### Solution
During the User Read, Flush write, Compaction read/write, the WriteController is used to determine whether DB writes are stalled or slowed down. The rate limiter priority (Env::IOPriority) can be determined accordingly. We decided to always pass the priority in IOOptions. What the file system does with it should be a contract between the user and the file system. We would like to set the rate limiter priority at file level, since the Flush/Compaction job level may be too coarse with multiple files and block IO level is too granular.

**This PR is for the Write path.** The **Write:** dynamic priority for different state are listed as follows:

| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
|  Flush | IO_HIGH | IO_USER | IO_USER |
|  Compaction | IO_LOW | IO_USER | IO_USER |

Flush and Compaction writes share the same call path through BlockBaseTableWriter, WritableFileWriter, and FSWritableFile. When a new FSWritableFile object is created, its io_priority_ can be set dynamically based on the state of the WriteController. In WritableFileWriter, before the call sites of FSWritableFile functions, WritableFileWriter::DecideRateLimiterPriority() determines the rate_limiter_priority. The options (IOOptions) argument of FSWritableFile functions will be updated with the rate_limiter_priority.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9988

Test Plan: Add unit tests.

Reviewed By: anand1976

Differential Revision: D36395159

Pulled By: gitbw95

fbshipit-source-id: a7c82fc29759139a1a07ec46c37dbf7e753474cf
2022-05-18 00:41:41 -07:00
Jay Zhuang b84e3363f5 Add table_properties_collector_factories override (#9995)
Summary:
Add table_properties_collector_factories override on the remote
side.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9995

Test Plan: unittest added

Reviewed By: ajkr

Differential Revision: D36392623

Pulled By: jay-zhuang

fbshipit-source-id: 3ba031294d90247ca063d7de7b43178d38e3f66a
2022-05-17 20:57:51 -07:00
Yanqin Jin 3f263ef536 Add a temporary option for user to opt-out enforcement of SingleDelete contract (#9983)
Summary:
PR https://github.com/facebook/rocksdb/issues/9888 started to enforce the contract of single delete described in https://github.com/facebook/rocksdb/wiki/Single-Delete.

For some of existing use cases, it is desirable to have a transition during which compaction will not fail
if the contract is violated. Therefore, we add a temporary option `enforce_single_del_contracts` to allow
application to opt out from this new strict behavior. Once transition completes, the flag can be set to `true` again.

In a future release, the option will be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9983

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D36333672

Pulled By: riversand963

fbshipit-source-id: dcb703ea0ed08076a1422f1bfb9914afe3c2caa2
2022-05-16 15:44:59 -07:00
mrambacher bfc6a8ee4a Option type info functions (#9411)
Summary:
Add methods to set the various functions (Parse, Serialize, Equals) to the OptionTypeInfo.  These methods simplify the number of constructors required for OptionTypeInfo and make the code a little clearer.

Add functions to the OptionTypeInfo for Prepare and Validate.  These methods allow types other than Configurable and Customizable to have Prepare and Validate logic.  These methods could be used by an option to guarantee that its settings were in a range or that a value was initialized.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9411

Reviewed By: pdillinger

Differential Revision: D36174849

Pulled By: mrambacher

fbshipit-source-id: 72517d8c6bab4723788a4c1a9e16590bff870125
2022-05-13 04:57:08 -07:00
sdong 736a7b5433 Remove own ToString() (#9955)
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955

Test Plan: Watch CI tests

Reviewed By: riversand963

Differential Revision: D36176799

fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
2022-05-06 13:03:58 -07:00
sdong 49628c9a83 Use std::numeric_limits<> (#9954)
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954

Test Plan: See CI Runs.

Reviewed By: riversand963

Differential Revision: D36173954

fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
2022-05-05 13:08:21 -07:00
Yanqin Jin 9d634dd5b6 Rename kRemoveWithSingleDelete to kPurge (#9951)
Summary:
PR 9929 adds a new CompactionFilter::Decision, i.e.
kRemoveWithSingleDelete so that CompactionFilter can indicate to
CompactionIterator that a PUT can only be removed with SD. However, how
CompactionIterator handles such a key is implementation detail which
should not be implied in the public API. In fact,
such a PUT can just be dropped. This is an optimization which we will apply in the near future.

Discussion thread: https://github.com/facebook/rocksdb/pull/9929#discussion_r863198964

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9951

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D36156590

Pulled By: riversand963

fbshipit-source-id: 7b7d01f47bba4cad7d9cca6ca52984f27f88b372
2022-05-05 08:16:20 -07:00
Yanqin Jin 06394ff4e7 Fix a bug of CompactionIterator/CompactionFilter using Delete (#9929)
Summary:
When compaction filter determines that a key should be removed, it updates the internal key's type
to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction
together with `SingleDelete`, it will cause compaction iterator to return Corruption.

To fix the issue, compaction filter should return more information in addition to the intention of removing
a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing
`kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`.

In addition, I updated db_stress_shared_state.[cc|h] so that `no_overwrite_ids_` becomes `const`. It is easier to
reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()`
when calling from `Open()` so that we can set up the rollback deletion type callback for transactions.

Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does
not quite work with `MultiOpsTxnsStressTest`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929

Test Plan:
make check
make crash_test
make crash_test_with_txn

Reviewed By: anand1976

Differential Revision: D36069678

Pulled By: riversand963

fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92
2022-05-02 13:25:45 -07:00
Yanqin Jin 2b5c29f9f3 Enforce the contract of SingleDelete (#9888)
Summary:
Enforce the contract of SingleDelete so that they are not mixed with
Delete for the same key. Otherwise, it will lead to undefined behavior.
See https://github.com/facebook/rocksdb/wiki/Single-Delete#notes.

Also fix unit tests and write-unprepared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9888

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D35837817

Pulled By: riversand963

fbshipit-source-id: acd06e4dcba8cb18df92b44ed18c57e10e5a7635
2022-04-28 14:48:27 -07:00
Levi Tamasi db536ee045 Propagate errors from UpdateBoundaries (#9851)
Summary:
In `FileMetaData`, we keep track of the lowest-numbered blob file
referenced by the SST file in question for the purposes of BlobDB's
garbage collection in the `oldest_blob_file_number` field, which is
updated in `UpdateBoundaries`. However, with the current code,
`BlobIndex` decoding errors (or invalid blob file numbers) are swallowed
in this method. The patch changes this by propagating these errors
and failing the corresponding flush/compaction. (Note that since blob
references are generated by the BlobDB code and also parsed by
`CompactionIterator`, in reality this can only happen in the case of
memory corruption.)

This change necessitated updating some unit tests that involved
fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like
`"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es.
Some were relying on the earlier behavior to simulate corruption; these
were replaced with `SyncPoint`-based test code that corrupts a valid
blob reference at read time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D35683671

Pulled By: ltamasi

fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6
2022-04-15 20:25:48 -07:00
Yanqin Jin 0bd4dcde6b CompactionIterator sees consistent view of which keys are committed (#9830)
Summary:
**This PR does not affect the functionality of `DB` and write-committed transactions.**

`CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed.
As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if
it is committed. In fact, the implementation of `KeyCommitted()` is as follows:

```
inline bool KeyCommitted(SequenceNumber seq) {
  // For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
  return snapshot_checker_ == nullptr ||
         snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot;
}
```

With that being said, we focus on write-prepared/write-unprepared transactions.

A few notes:
- A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database.
- `CompactionIterator` outputs a key as long as the key is uncommitted.

Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without
doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes
committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`.
Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone.

To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that
for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting
processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot
to determine whether a key is committed or not with minor change to `KeyCommitted()`.

```
inline bool KeyCommitted(SequenceNumber sequence) {
  // For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
  return snapshot_checker_ == nullptr ||
         snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) ==
             SnapshotCheckerResult::kInSnapshot;
}
```

As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble
for `CompactionIterator`s assertions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D35561162

Pulled By: riversand963

fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f
2022-04-14 11:11:04 -07:00
Jay Zhuang dc1c90c4e3 Support canceling running RemoteCompaction on remote side (#9725)
Summary:
Add the ability to cancel remote compaction on the remote side by
setting `OpenAndCompactOptions.canceled` to true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9725

Test Plan: added unittest

Reviewed By: ajkr

Differential Revision: D35018800

Pulled By: jay-zhuang

fbshipit-source-id: be3652f9645e0347df429e42a5614d5a9b3a1ec4
2022-04-13 13:28:09 -07:00
Jay Zhuang f934a0af46 Add event listener support on remote compactor side (#9821)
Summary:
So the user is able to set event listener on the compactor
side.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9821

Test Plan: unittest added

Reviewed By: ajkr

Differential Revision: D35485388

Pulled By: jay-zhuang

fbshipit-source-id: 669d8a3aaee012b75b940470306756c03ffa09b2
2022-04-12 17:25:36 -07:00
Yanqin Jin 0ad9ee30ce Remove dead code (#9825)
Summary:
Options `preserve_deletes` and `iter_start_seqnum` have been removed since 7.0.

This PR removes dead code related to these two removed options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9825

Test Plan: make check

Reviewed By: akankshamahajan15

Differential Revision: D35517950

Pulled By: riversand963

fbshipit-source-id: 86282ce5ec4087acb94a06a42a1b6d55b1715482
2022-04-11 10:26:55 -07:00
sdong e03f8a0c12 L0 Subcompaction to trim input files (#9802)
Summary:
When sub compaction is decided for L0->L1 compaction, most of the cases, all L0 files will be involved in all sub compactions. However, it is not always the case. When files are generally (but not strictly) inserted in sequential order, there can be a subset of L0 files invovled. Yet RocksDB always open all those L0 files, and build an iterator, read many of the files' first of last block with expensive readahead. We trim some input files to reduce overhead a little bit.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9802

Test Plan: Add a unit test to cover this case and manually validate the behavior while running the test.

Reviewed By: ajkr

Differential Revision: D35371031

fbshipit-source-id: 701ed7375b5cbe41672e93b38fe8a1503dad08b6
2022-04-06 18:19:19 -07:00
KNOEEE cb4d188a34 Fix a bug in PosixClock (#9695)
Summary:
Multiplier here should be 1e6 to get microseconds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9695

Reviewed By: ajkr

Differential Revision: D34897086

Pulled By: jay-zhuang

fbshipit-source-id: 9c1d0811ea740ba0a007edc2da199edbd000b88b
2022-03-21 16:11:02 -07:00
Jay Zhuang 4dff279b19 DisableManualCompaction may fail to cancel an unscheduled task (#9659)
Summary:
https://github.com/facebook/rocksdb/issues/9625 didn't change the unschedule condition which was waiting for the background thread to clean-up the compaction.
make sure we only unschedule the task when it's scheduled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9659

Reviewed By: ajkr

Differential Revision: D34651820

Pulled By: jay-zhuang

fbshipit-source-id: 23f42081b15ec8886cd81cbf131b116e0c74dc2f
2022-03-12 20:07:04 -08:00
Jay Zhuang 09b0e8f2c7 Fix a timer crash caused by invalid memory management (#9656)
Summary:
Timer crash when multiple DB instances doing heavy DB open and close
operations concurrently. Which is caused by adding a timer task with
smaller timestamp than the current running task. Fix it by moving the
getting new task timestamp part within timer mutex protection.
And other fixes:
- Disallow adding duplicated function name to timer
- Fix a minor memory leak in timer when a running task is cancelled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9656

Reviewed By: ajkr

Differential Revision: D34626296

Pulled By: jay-zhuang

fbshipit-source-id: 6b6d96a5149746bf503546244912a9e41a0c5f6b
2022-03-12 11:45:56 -08:00
slk 95305c44a1 Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410)
Summary:
As disscussed in (https://github.com/facebook/rocksdb/issues/9223), Here added a new API  named DB::OpenAndTrimHistory, this API will open DB and trim data to the timestamp specofied by **trim_ts** (The data with newer timestamp than specified trim bound will be removed). This API should only be used at a timestamp-enabled db instance recovery.

And this PR implemented a new iterator named HistoryTrimmingIterator to support trimming history with a new API named DB::OpenAndTrimHistory. HistoryTrimmingIterator wrapped around the underlying InternalITerator such that keys whose timestamps newer than **trim_ts** should not be returned to the compaction iterator while **trim_ts** is not null.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9410

Reviewed By: ltamasi

Differential Revision: D34410207

Pulled By: riversand963

fbshipit-source-id: e54049dc234eccd673244c566b15df58df5a6236
2022-03-11 16:13:23 -08:00
Jay Zhuang 36aec94d85 compression_per_level should be used for flush and changeable (#9658)
Summary:
- Make `compression_per_level` dynamical changeable with `SetOptions`;
- Fix a bug that `compression_per_level` is not used for flush;

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9658

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D34700749

Pulled By: jay-zhuang

fbshipit-source-id: a23b9dfa7ad03d393c1d71781d19e91de796f49c
2022-03-07 18:06:19 -08:00
Bo Wang 67f071fade Fixes #9565 (#9586)
Summary:
[Compaction::IsTrivialMove](a2b9be42b6/db/compaction/compaction.cc (L318)) checks whether allow_trivial_move is set, and if so it returns the value of is_trivial_move_. The allow_trivial_move option is there for universal compaction. So when this is set and leveled compaction is enabled, then useful code that follows this block never gets a chance to run.

A check that [compaction_style == kCompactionStyleUniversal](320d9a8e8a/db/db_impl/db_impl_compaction_flush.cc (L1030)) should be added to avoid doing the wrong thing for leveled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9586

Test Plan:
To reproduce this:
First edit db/compaction/compaction.cc with
```
 diff --git a/db/compaction/compaction.cc b/db/compaction/compaction.cc
index 7ae50b91e..52dd489b1 100644
 --- a/db/compaction/compaction.cc
+++ b/db/compaction/compaction.cc
@@ -319,6 +319,8 @@ bool Compaction::IsTrivialMove() const {
   // input files are non overlapping
   if ((mutable_cf_options_.compaction_options_universal.allow_trivial_move) &&
       (output_level_ != 0)) {
+    printf("IsTrivialMove:: return %d because universal allow_trivial_move\n", (int) is_trivial_move_);
+    // abort();
     return is_trivial_move_;
   }
```

And then run
```
./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1641328309 --universal_allow_trivial_move=1
```
Example output with the debug code added

```
IsTrivialMove:: return 0 because universal allow_trivial_move
IsTrivialMove:: return 0 because universal allow_trivial_move
```

After this PR, the bug is fixed.

Reviewed By: ajkr

Differential Revision: D34350451

Pulled By: gitbw95

fbshipit-source-id: 3232005cc47c40a7e75d316cfc7960beb5bdff3a
2022-02-18 14:23:07 -08:00
Jay Zhuang 2fbc672732 Add temperature information to the event listener callbacks (#9591)
Summary:
RocksDB try to provide temperature information in the event
listener callbacks. The information is not guaranteed, as some operation
like backup won't have these information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9591

Test Plan: Added unittest

Reviewed By: siying, pdillinger

Differential Revision: D34309339

Pulled By: jay-zhuang

fbshipit-source-id: 4aca4f270f99fa49186d85d300da42594663d6d7
2022-02-18 11:23:18 -08:00
Jay Zhuang f092f0fa5d Add subcompaction event API (#9311)
Summary:
Add event callback for subcompaction and adds a sub_job_id to identify it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9311

Reviewed By: ajkr

Differential Revision: D33892707

Pulled By: jay-zhuang

fbshipit-source-id: 57b5e5e594d61b2112d480c18a79a36751f65a4e
2022-02-17 15:47:10 -08:00
Andrew Kryczka babe56ddba Add rate limiter priority to ReadOptions (#9424)
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.

`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.

There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).

The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424

Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
  - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
  - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`

Reviewed By: hx235

Differential Revision: D33747386

Pulled By: ajkr

fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
2022-02-16 23:18:14 -08:00
Jay Zhuang 31031c0210 Remove deprecated RemoteCompaction API (#9570)
Summary:
Remove deprecated remote compaction APIs
`CompactionService::Start()` and `CompactionService::WaitForComplete()`.
Please use `CompactionService::StartV2()`,
`CompactionService::WaitForCompleteV2()` instead, which provides the
same information plus extra data like priority, db_id, etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9570

Test Plan: CI

Reviewed By: riversand963

Differential Revision: D34255969

Pulled By: jay-zhuang

fbshipit-source-id: c6376eccdd1123f1c42ab53771b5f65f8160c325
2022-02-16 13:25:28 -08:00
Peter Dillinger e24734f843 Use -Wno-invalid-offsetof instead of dangerous offset_of hack (#9563)
Summary:
After https://github.com/facebook/rocksdb/issues/9515 added a unique_ptr to Status, we see some
warnings-as-error in some internal builds like this:

```
stderr: rocksdb/src/db/compaction/compaction_job.cc:2839:7: error:
offset of on non-standard-layout type 'struct CompactionServiceResult'
[-Werror,-Winvalid-offsetof]
     {offsetof(struct CompactionServiceResult, status),
      ^                                        ~~~~~~
```

I see three potential solutions to resolving this:

* Expand our use of an idiom that works around the warning (see offset_of
functions removed in this change, inspired by
https://gist.github.com/graphitemaster/494f21190bb2c63c5516)  However,
this construction is invoking undefined behavior that assumes consistent
layout with no compiler-introduced indirection. A compiler incompatible
with our assumptions will likely compile the code and exhibit undefined
behavior.
* Migrate to something in place of offset, like a function mapping
CompactionServiceResult* to Status* (for the `status` field). This might
be required in the long term.
* **Selected:** Use our new C++17 dependency to use offsetof in a well-defined way
when the compiler allows it. From a comment on
https://gist.github.com/graphitemaster/494f21190bb2c63c5516:

> A final note: in C++17, offsetof is conditionally supported, which
> means that you can use it on any type (not just standard layout
> types) and the compiler will error if it can't compile it correctly.
> That appears to be the best option if you can live with C++17 and
> don't need constexpr support.

The C++17 semantics are confirmed on
https://en.cppreference.com/w/cpp/types/offsetof, so we can suppress the
warning as long as we accept that we might run into a compiler that
rejects the code, and at that point we will find a solution, such as
the more intrusive "migrate" solution above.

Although this is currently only showing in our buck build, it will
surely show up also with make and cmake, so I have updated those
configurations as well.

Also in the buck build, -Wno-expansion-to-defined does not appear to be
needed anymore (both current compiler configurations) so I
removed it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9563

Test Plan: Tried out buck builds with both current compiler configurations

Reviewed By: riversand963

Differential Revision: D34220931

Pulled By: pdillinger

fbshipit-source-id: d39436008259bd1eaaa87c77be69fb2a5b559e1f
2022-02-15 09:19:19 -08:00
Levi Tamasi 320d9a8e8a Use a sorted vector instead of a map to store blob file metadata (#9526)
Summary:
The patch replaces `std::map` with a sorted `std::vector` for
`VersionStorageInfo::blob_files_` and preallocates the space
for the `vector` before saving the `BlobFileMetaData` into the
new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`.
These changes reduce the time the DB mutex is held while
saving new `Version`s, and using a sorted `vector` also makes
lookups faster thanks to better memory locality.

In addition, the patch introduces helper methods
`VersionStorageInfo::GetBlobFileMetaData` and
`VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by
clients to perform lookups in the `vector`, and does some general
cleanup in the parts of code where blob file metadata are used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526

Test Plan:
Ran `make check` and the crash test script for a while.

Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced:

```
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value>
```

Final statistics before the patch:

```
Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s
Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s
```

With the patch:

```
Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s
Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s
```

Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs.

Reviewed By: riversand963

Differential Revision: D34082728

Pulled By: ltamasi

fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 12:36:43 -08:00
satyajanga 036bbab6f7 Use the comparator from the sst file table properties in sst_dump_tool (#9491)
Summary:
We introduced a new Comparator for timestamp in user keys. In the sst_dump_tool by default we use BytewiseComparator to read sst files. This change allows us to read comparator_name from table properties in meta data block and use it to read.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9491

Test Plan:
added unittests for new functionality.
make check
![image](https://user-images.githubusercontent.com/4923556/152915444-28b88a1f-7b4e-47d0-815f-7011552bd9a2.png)
![image](https://user-images.githubusercontent.com/4923556/152916196-bea3d2a1-a3d5-4362-b911-036131b83e8d.png)

Reviewed By: riversand963

Differential Revision: D33993614

Pulled By: satyajanga

fbshipit-source-id: 4b5cf938e6d2cb3931d763bef5baccc900b8c536
2022-02-08 12:15:35 -08:00