Commit Graph

46 Commits

Author SHA1 Message Date
Jay Huh 2dab137182 Mark more files for periodic compaction during offpeak (#12031)
Summary:
- The struct previously named `OffpeakTimeInfo` has been renamed to `OffpeakTimeOption` to indicate that it's a user-configurable option. Additionally, a new struct, `OffpeakTimeInfo`, has been introduced, which includes two fields: `is_now_offpeak` and `seconds_till_next_offpeak_start`. This change prevents the need to parse the `daily_offpeak_time_utc` string twice.
- It's worth noting that we may consider adding more fields to the `OffpeakTimeInfo` struct, such as `elapsed_seconds` and `total_seconds`, as needed for further optimization.
- Within `VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction()`, we've adjusted the `allowed_time_limit` to include files that are expected to expire by the next offpeak start.
- We might explore further optimizations, such as evenly distributing files to mark during offpeak hours, if the initial approach results in marking too many files simultaneously during the first scoring in offpeak hours. The primary objective of this PR is to prevent periodic compactions during non-offpeak hours when offpeak hours are configured. We'll start with this straightforward solution and assess whether it suffices for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12031

Test Plan:
Unit Tests added
- `DBCompactionTest::LevelPeriodicCompactionOffpeak` for Leveled
- `DBTestUniversalCompaction2::PeriodicCompaction` for Universal

Reviewed By: cbi42

Differential Revision: D50900292

Pulled By: jaykorean

fbshipit-source-id: 267e7d3332d45a5d9881796786c8650fa0a3b43d
2023-11-06 11:43:59 -08:00
Changyu Bi 520c64fd2e Add missing status check in ExternalSstFileIngestionJob and ImportColumnFamilyJob (#12042)
Summary:
.. and update some unit tests that failed with this change. See comment in ExternalSSTFileBasicTest.IngestFileWithCorruptedDataBlock for more explanation.

The missing status check is not caught by `ASSERT_STATUS_CHECKED=1` due to this line: 8505b26db1/table/block_based/block.h (L394). Will explore if we can remove it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12042

Test Plan: existing unit tests.

Reviewed By: ajkr

Differential Revision: D50994769

Pulled By: cbi42

fbshipit-source-id: c91615bccd6094a91634c50b98401d456cbb927b
2023-11-06 07:41:36 -08:00
914022466 2648e0a747 Fix a bug when ingest plaintable sst file (#11969)
Summary:
Plaintable doesn't support SeekToLast. And GetIngestedFileInfo is using SeekToLast without checking the validity.

We are using IngestExternalFile or CreateColumnFamilyWithImport with some sst file in PlainTable format . But after running for a while, compaction error often happens. Such as
![image](https://github.com/facebook/rocksdb/assets/13954644/b4fa49fc-73fc-49ce-96c6-f198a30800b8)

I simply add some std::cerr log to find why.
![image](https://github.com/facebook/rocksdb/assets/13954644/2cf1d5ff-48cc-4125-b917-87090f764fcd)
It shows that the smallest key is always equal to largest key.
![image](https://github.com/facebook/rocksdb/assets/13954644/6d43e978-0be0-4306-aae3-f9e4ae366395)
Then I found the root cause is that PlainTable do not support SeekToLast, so the smallest key is always the same with the largest

I try to write an unit test. But it's not easy to reproduce this error.
(This PR is similar to https://github.com/facebook/rocksdb/pull/11266. Sorry for open another PR)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11969

Reviewed By: ajkr

Differential Revision: D50933854

Pulled By: cbi42

fbshipit-source-id: 6c6af53c1388922cbabbe64ed3be1cdc58df5431
2023-11-02 13:45:37 -07:00
Jay Huh e230e4d248 Make OffpeakTimeInfo available in VersionSet (#12018)
Summary:
As mentioned in  https://github.com/facebook/rocksdb/issues/11893, we are going to use the offpeak time information to pre-process TTL-based compactions. To do so, we need to access `daily_offpeak_time_utc` in `VersionStorageInfo::ComputeCompactionScore()` where we pick the files to compact. This PR is to make the offpeak time information available at the time of compaction-scoring. We are not changing any compaction scoring logic just yet. Will follow up in a separate PR.

There were two ways to achieve what we want.
1.  Make `MutableDBOptions` available in `ColumnFamilyData` and `ComputeCompactionScore()` take `MutableDBOptions` along with `ImmutableOptions` and `MutableCFOptions`.
2. Make `daily_offpeak_time_utc` and `IsNowOffpeak()` available in `VersionStorageInfo`.

We chose the latter as it involves smaller changes.

This change includes the following
- Introduction of `OffpeakTimeInfo` and `IsNowOffpeak()` has been moved from `MutableDBOptions`
- `OffpeakTimeInfo` added to `VersionSet` and it can be set during construction and by `ChangeOffpeakTimeInfo()`
- During `SetDBOptions()`, if offpeak time info needs to change, it calls `MaybeScheduleFlushOrCompaction()` to re-compute compaction scores and process compactions as needed

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12018

Test Plan:
- `DBOptionsTest::OffpeakTimes` changed to include checks for `MaybeScheduleFlushOrCompaction()` calls and `VersionSet`'s OffpeakTimeInfo value change during `SetDBOptions()`.
- `VersionSetTest::OffpeakTimeInfoTest` added to test `ChangeOffpeakTimeInfo()`. `IsNowOffpeak()` tests moved from `DBOptionsTest::OffpeakTimes`

Reviewed By: pdillinger

Differential Revision: D50723881

Pulled By: jaykorean

fbshipit-source-id: 3cff0291936f3729c0e9c7750834b9378fb435f6
2023-10-27 15:56:48 -07:00
Changyu Bi d1ff401472 Delay bottommost level single file compactions (#11701)
Summary:
For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions.

* The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701

Test Plan:
* Add two unit tests to test when bottommost_file_compaction_delay > 0.
* Ran crash test with the new option.

Reviewed By: jaykorean, ajkr

Differential Revision: D48331564

Pulled By: cbi42

fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8
2023-08-16 17:45:44 -07:00
Yu Zhang 7521478b43 Record the `persist_user_defined_timestamps` flag in manifest (#11515)
Summary:
Start to record the value of the flag `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag's default value is true, it is only explicitly recorded if it's false.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11515

Test Plan:
```
make all check
./version_edit_test
```

Reviewed By: ltamasi

Differential Revision: D46920386

Pulled By: jowlyzhang

fbshipit-source-id: 075c20363d3d2cc1368422ecc805617ed135cc26
2023-06-21 21:49:01 -07:00
mayue.fight fa878a0107 Support to create a CF by importing multiple non-overlapping CFs (#11378)
Summary:
The original Feature Request is from [https://github.com/facebook/rocksdb/issues/11317](https://github.com/facebook/rocksdb/issues/11317).
Flink uses rocksdb as the state backend,  all DB options are the same, and the keys of each DB instance are adjacent and there is no key overlap between two db instances.
In the Flink rescaling scenario, it is necessary to quickly split the DB according to a certain key range or quickly merge multiple DBs into one.

This PR is mainly used to quickly merge multiple DBs into one.

We hope to extend the function of `CreateColumnFamilyWithImports` to support creating ColumnFamily by importing multiple ColumnFamily with no overlapping keys.

The import logic is almost the same as `CreateColumnFamilyWithImport`, but it will check whether there is key overlap between CF when importing. The import will fail if there are key overlaps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11378

Reviewed By: ajkr

Differential Revision: D46413709

Pulled By: cbi42

fbshipit-source-id: 846d0049fad11c59cf460fa846c345b26c658dfb
2023-06-15 12:25:04 -07:00
Hui Xiao 8f763bdeab Record and use the tail size to prefetch table tail (#11406)
Summary:
**Context:**
We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in https://github.com/facebook/rocksdb/pull/4156, which has caused small reads in unlucky case (e.g,  small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug.  Therefore we decide to record the exact tail size and use it directly  to prefetch tail of the SST instead of relying heuristics.

**Summary:**
- Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()`
   - For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more.
- Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11406

Test Plan:
1. New UT
2. db bench
Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
```
 diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc
index bd5669f0f..791484c1f 100644
 --- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
@@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail(
                            &tail_prefetch_size);

   // Try file system prefetch
-  if (!file->use_direct_io() && !force_direct_prefetch) {
+  if (false && !file->use_direct_io() && !force_direct_prefetch) {
     if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority)
              .IsNotSupported()) {
       prefetch_buffer->reset(new FilePrefetchBuffer(
 diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
index ea40f5fa0..39a0ac385 100644
 --- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@@ -4191,6 +4191,8 @@ class Benchmark {
           std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options));
     } else {
       BlockBasedTableOptions block_based_options;
+      block_based_options.metadata_cache_options.partition_pinning =
+      PinningTier::kAll;
       block_based_options.checksum =
           static_cast<ChecksumType>(FLAGS_checksum_type);
       if (FLAGS_use_hash_search) {
```
Create DB
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
ReadRandom
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
(a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 3395
rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614
```

(b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 14257
rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540
```

As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR

3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR.

Reviewed By: pdillinger

Differential Revision: D45413346

Pulled By: hx235

fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064
2023-05-08 13:14:28 -07:00
Changyu Bi 62fc15f009 Block per key-value checksum (#11287)
Summary:
add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
1. checksum construction and verification in block.cc/h
2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
3. unit tests/crash test updates

Tests:
* Added unit tests
* Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`

Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.

Performance:
Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):

```
SETUP
make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none

BENCHMARK
./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE

The readrandom ops/sec looks like the following:
Block cache size:  2GB        1.2GB * 0.9    1.2GB * 0.8     1.2GB * 0.5   8MB
Main              240805     223604         198176           161653       139040
PR prot_bytes=0   238691     226693         200127           161082       141153
PR prot_bytes=1   214983     193199         178532           137013       108211
prot_bytes=1 vs    -10%        -15%          -10.8%          -15%        -23%
prot_bytes=0
```

The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287

Reviewed By: ajkr

Differential Revision: D43970708

Pulled By: cbi42

fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
2023-04-25 12:08:23 -07:00
Hui Xiao 151242ce46 Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288)
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.

**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
   - Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288

Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
     - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689

// PlainTable
Does not apply
```
- **Db bench 2: performance**

**Read**

SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`

**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none

rocksdb.sst.read.micros  COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`

Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`

Reviewed By: ajkr

Differential Revision: D44007011

Pulled By: hx235

fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 09:07:18 -07:00
mayue.fight 4d72f48e57 Fix the wrong calculation of largest_key in import_column_family_job (#11381)
Summary:
When calculating the largest_key in ImportColumnFamilyJob::GetIngestedFileInfo, only the first element of range_del_iter is calculated. If range_del_iter has multiple elements, the largest_key will be wrong

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11381

Reviewed By: cbi42

Differential Revision: D44981450

Pulled By: ajkr

fbshipit-source-id: 584bc7da86295568a96984d2951644f289e578c7
2023-04-15 10:33:23 -07:00
mayue.fight 9500d90d1b Fix serval bugs in ImportColumnFamilyTest (#11372)
Summary:
**Context/Summary:**
ASSERT_EQ will only verify the code of Status, but will not check the state message of Status.

- Assert by checking Status state in `ImportColumnFamilyTest`
- Forgot to set db_comparator_name when creating ExportImportFilesMetaData in `ImportColumnFamilyNegativeTest`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11372

Reviewed By: ajkr

Differential Revision: D45004343

Pulled By: cbi42

fbshipit-source-id: a13d45521df17ead3d6d4c1c1fe1e4c95397ce8b
2023-04-14 10:44:42 -07:00
Changyu Bi 9aa3b6f9ae Support range deletion tombstones in `CreateColumnFamilyWithImport` (#11252)
Summary:
CreateColumnFamilyWithImport() did not support range tombstones for two reasons:
1. it uses point keys of a input file to determine its boundary (smallest and largest internal key), which means range tombstones outside of the point key range will be effectively dropped.
2. it does not handle files with no point keys.

Also included a fix in external_sst_file_ingestion_job.cc where the blocks read in `GetIngestedFileInfo()` can be added to block cache now (issue fixed in https://github.com/facebook/rocksdb/pull/6429).

This PR adds support for exporting and importing column family with range tombstones. The main change is to add smallest internal key and largest internal key to `SstFileMetaData` that will be part of the output of `ExportColumnFamily()`. Then during `CreateColumnFamilyWithImport(...,const ExportImportFilesMetaData& metadata,...)`, file boundaries can be set from `metadata` directly. This is needed since when file boundaries are extended by range tombstones, sometimes they cannot be deduced from a file's content alone.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11252

Test Plan:
- added unit tests that fails before this change

Closes https://github.com/facebook/rocksdb/issues/11245

Reviewed By: ajkr

Differential Revision: D43577443

Pulled By: cbi42

fbshipit-source-id: 6bff78e583cc50c44854994dea0a8dd519398f2f
2023-03-13 11:06:59 -07:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
Changyu Bi cc6f323705 Include estimated bytes deleted by range tombstones in compensated file size (#10734)
Summary:
compensate file sizes in compaction picking so files with range tombstones are preferred, such that they get compacted down earlier as they tend to delete a lot of data. This PR adds a `compensated_range_deletion_size` field in FileMeta that is computed during Flush/Compaction and persisted in MANIFEST. This value is added to `compensated_file_size` which will be used for compaction picking. Currently, for a file in level L, `compensated_range_deletion_size` is set to the estimated bytes deleted by range tombstone of this file in all levels > L. This helps to reduce space amp when data in older levels are covered by range tombstones in level L.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10734

Test Plan:
- Added unit tests.
- benchmark to check if the above definition `compensated_range_deletion_size` is reducing space amp as intended, without affecting write amp too much. The experiment set up favorable for this optimization: large range tombstone issued infrequently. Command used:
```
./db_bench -benchmarks=fillrandom,waitforcompaction,stats,levelstats -use_existing_db=false -avoid_flush_during_recovery=true -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -max_bytes_for_level_base=134217728 -target_file_size_base=33554432 -writes_per_range_tombstone=500000 -range_tombstone_width=5000000 -num=50000000 -benchmark_write_rate_limit=8388608 -threads=16 -duration=1800 --max_num_range_tombstones=1000000000
```

In this experiment, each thread wrote 16 range tombstones over the duration of 30 minutes, each range tombstone has width 5M that is the 10% of the key space width. Results shows this PR generates a smaller DB size.

Compaction stats from this PR:
```
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      2/0   31.54 MB   0.5      0.0     0.0      0.0       8.4      8.4       0.0   1.0      0.0     63.4    135.56            110.94       544    0.249       0      0       0.0       0.0
  L4      3/0   96.55 MB   0.8     18.5     6.7     11.8      18.4      6.6       0.0   2.7     65.3     64.9    290.08            284.03       108    2.686    284M  1957K       0.0       0.0
  L5     15/0   404.41 MB   1.0     19.1     7.7     11.4      18.8      7.4       0.3   2.5     66.6     65.7    292.93            285.34       220    1.332    293M  3808K       0.0       0.0
  L6    143/0    4.12 GB   0.0     45.0     7.5     37.5      41.6      4.1       0.0   5.5     71.2     65.9    647.00            632.66       251    2.578    739M    47M       0.0       0.0
 Sum    163/0    4.64 GB   0.0     82.6    21.9     60.7      87.2     26.5       0.3  10.4     61.9     65.4   1365.58           1312.97      1123    1.216   1318M    52M       0.0       0.0
```

Compaction stats from main:
```
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       8.4      8.4       0.0   1.0      0.0     60.5    142.12            115.89       569    0.250       0      0       0.0       0.0
  L4      3/0   85.68 MB   1.0     17.7     6.8     10.9      17.6      6.7       0.0   2.6     62.7     62.3    289.05            281.79       112    2.581    272M  2309K       0.0       0.0
  L5     11/0   293.73 MB   1.0     18.8     7.5     11.2      18.5      7.2       0.5   2.5     64.9     63.9    296.07            288.50       220    1.346    288M  4365K       0.0       0.0
  L6    130/0    3.94 GB   0.0     51.5     7.6     43.9      47.9      3.9       0.0   6.3     67.2     62.4    784.95            765.92       258    3.042    848M    51M       0.0       0.0
 Sum    144/0    4.31 GB   0.0     88.0    21.9     66.0      92.3     26.3       0.5  11.0     59.6     62.5   1512.19           1452.09      1159    1.305   1409M    58M       0.0       0.0```

Reviewed By: ajkr

Differential Revision: D39834713

Pulled By: cbi42

fbshipit-source-id: fe9341040b8704a8fbb10cad5cf5c43e962c7e6b
2022-12-29 13:28:24 -08:00
Hui Xiao 98d5db5c2e Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
-  File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
       - insert k1@1 to memtable m1
       - ingest file s1 with k2@2, ingest file s2 with k3@3
        - insert k4@4 to m1
       - compact files s1, s2 and  result in new file s3 of seqno range [2, 3]
       - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
    - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr  for this example)
        - an existing SST s1 contains only k1@1
        - insert k1@2 to memtable m1
        - ingest file s2 with k3@3, ingest file s3 with k4@4
        - insert single delete k5@5 in m1
        - flush m1 and result in new file s4 of seqno range [2, 5]
        - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
        - compact s4 and result in new file s6 of seqno range [2] due to single delete
    - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`

Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number`  ordering check. This will result in file ordering s1 <  s2 <  s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.

**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
  - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
  - Compaction output file  is assigned with the minimum `epoch_number` among input files'
      - Refit level: reuse refitted file's epoch_number
  -  Other paths needing `epoch_number` treatment:
     - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
     - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
  -  Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or  by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
  - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
   - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
   - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
   - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
   - update existing tests with `epoch_number` so make check will pass
   - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
   - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922

Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run 36a5686ec0 (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761

Reviewed By: ajkr

Differential Revision: D41063187

Pulled By: hx235

fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 13:29:37 -08:00
Andrew Kryczka 4d60cbc629 Use VersionBuilder for CF import ordering/validation (#11028)
Summary:
Besides the existing ordering and validation, more is coming to VersionBuilder/VersionStorageInfo, like migration of epoch_numbers from older RocksDB versions. We should start using those common classes for importing CFs, instead of duplicating their ordering, validation, and migration logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11028

Test Plan: rely on existing tests

Reviewed By: hx235

Differential Revision: D41865427

Pulled By: ajkr

fbshipit-source-id: 873f5cd87b8902a2380c3b71373ce0b0db3a0c50
2022-12-10 15:07:42 -08:00
Andrew Kryczka 5cf6ab6f31 Ran clang-format on db/ directory (#10910)
Summary:
Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910

Reviewed By: riversand963

Differential Revision: D40880683

Pulled By: ajkr

fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
2022-11-02 14:34:24 -07:00
sdong 9277569ba3 Add some missing headers (#10519)
Summary:
Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519

Reviewed By: jay-zhuang

Differential Revision: D38603291

fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1
2022-08-11 12:45:50 -07:00
Yanqin Jin fbfcf5cbcd Remove unused fields from FileMetaData (temporarily) (#10443)
Summary:
FileMetaData::[min|max]_timestamp are not currently being used or
tracked by RocksDB, even when user-defined timestamp is enabled. Each of
them is a std::string which can occupy 32 bytes. Remove them for now.
They may be added back when we have a pressing need for them. When we do
add them back, consider store them in a more compact way, e.g. one
boolean flag and a byte array of size 16.

Per file min/max timestamp bounds are available as table properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10443

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D38292275

Pulled By: riversand963

fbshipit-source-id: 841dc4e855ad8f8481c80cb020603de9607c9c94
2022-08-01 17:56:13 -07:00
Jay Zhuang c6d326d3d7 Track SST unique id in MANIFEST and verify (#9990)
Summary:
Start tracking SST unique id in MANIFEST, which is used to verify with
SST properties to make sure the SST file is not overwritten or
misplaced. A DB option `try_verify_sst_unique_id` is introduced to
enable/disable the verification, if enabled, it opens all SST files
during DB-open to read the unique_id from table properties (default is
false), so it's recommended to use it with `max_open_files = -1` to
pre-open the files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990

Test Plan: unittests, format-compatible test, mini-crash

Reviewed By: anand1976

Differential Revision: D36381863

Pulled By: jay-zhuang

fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f
2022-05-19 11:04:21 -07:00
sdong a74f14b550 Log error message when LinkFile() is not supported when ingesting files (#10010)
Summary:
Right now, whether moving file is skipped due to LinkFile() is not supported is opaque to users. Add a log message to help users debug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10010

Test Plan: Run existing test. Manual test verify the log message printed out.

Reviewed By: riversand963

Differential Revision: D36463237

fbshipit-source-id: b00bd5041bd5c11afa4e326819c8461ee2c98a91
2022-05-18 11:23:12 -07:00
Jay Zhuang d3a2f284d9 Add Temperature info in `NewSequentialFile()` (#9499)
Summary:
Add Temperature hints information from RocksDB in API
`NewSequentialFile()`. backup and checkpoint operations need to open the
source files with `NewSequentialFile()`, which will have the temperature
hints. Other operations are not covered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9499

Test Plan: Added unittest

Reviewed By: pdillinger

Differential Revision: D34006115

Pulled By: jay-zhuang

fbshipit-source-id: 568b34602b76520e53128672bd07e9d886786a2f
2022-02-18 18:23:07 -08:00
Peter Dillinger fc9d4071f0 Fast path for detecting unchanged prefix_extractor (#9407)
Summary:
Fixes a major performance regression in 6.26, where
extra CPU is spent in SliceTransform::AsString when reads involve
a prefix_extractor (Get, MultiGet, Seek). Common case performance
is now better than 6.25.

This change creates a "fast path" for verifying that the current prefix
extractor is unchanged and compatible with what was used to
generate a table file. This fast path detects the common case by
pointer comparison on the current prefix_extractor and a "known
good" prefix extractor (if applicable) that is saved at the time the
table reader is opened. The "known good" prefix extractor is saved
as another shared_ptr copy (in an existing field, however) to ensure
the pointer is not recycled.

When the prefix_extractor has changed to a different instance but
same compatible configuration (rare, odd), performance is still a
regression compared to 6.25, but this is likely acceptable because
of the oddity of such a case. The performance of incompatible
prefix_extractor is essentially unchanged.

Also fixed a minor case (ForwardIterator) where a prefix_extractor
could be used via a raw pointer after being freed as a shared_ptr,
if replaced via SetOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9407

Test Plan:
## Performance
Populate DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`

Running head-to-head comparisons simultaneously with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=seekrandom -num=10000000 -duration=20 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`

Below each is compared by ops/sec vs. baseline which is version 6.25 (multiple baseline runs because of variable machine load)

v6.26: 4833 vs. 6698 (<- major regression!)
v6.27: 4737 vs. 6397 (still)
New: 6704 vs. 6461 (better than baseline in common case)
Disabled fastpath: 4843 vs. 6389 (e.g. if prefix extractor instance changes but is still compatible)
Changed prefix size (no usable filter) in new: 787 vs. 5927
Changed prefix size (no usable filter) in new & baseline: 773 vs. 784

Reviewed By: mrambacher

Differential Revision: D33677812

Pulled By: pdillinger

fbshipit-source-id: 571d9711c461fb97f957378a061b7e7dbc4d6a76
2022-01-21 11:37:46 -08:00
Peter Dillinger 0050a73a4f New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).

The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.

This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)

The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.

Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.

Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126

Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.

### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)

### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime

We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:

```
Total cache or DBs size: 32TiB  Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```

These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key.  With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.

More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day

After enough data, we get a result at the end:

```
(keep 40 bits)  17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```

If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:

```
(keep 41 bits)  16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits)  19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```

The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:

```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```

I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.

Reviewed By: zhichao-cao

Differential Revision: D33171746

Pulled By: pdillinger

fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-16 17:15:13 -08:00
sdong 88875df821 File temperature information should be preserved when restart the DB (#9242)
Summary:
Fix a bug that causes file temperature not preserved after DB is restarted, or options.max_manifest_file_size is hit.
Also, pass temperature information to NewRandomAccessFile() to allow users to hack a solution where they don't preserve tiering information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9242

Test Plan: Add a unit test that would fail without the fix.

Reviewed By: jay-zhuang

Differential Revision: D32818150

fbshipit-source-id: 36aa3f148c60107f7b8e9d65b63b039f9e1a1eec
2021-12-03 14:43:14 -08:00
slk 937fbcbddc Track per-SST user-defined timestamp information in MANIFEST (#9092)
Summary:
Track per-SST user-defined timestamp information in MANIFEST https://github.com/facebook/rocksdb/issues/8957

Rockdb has supported user-defined timestamp feature. Application can specify a timestamp
when writing each k-v pair. When data flush from memory to disk file called SST files, file
creation activity will commit to MANIFEST. This commit is for tracking timestamp info in the
MANIFEST for each file. The changes involved are as follows:
1) Track max/min timestamp in FileMetaData, and fix invoved codes.
2) Add NewFileCustomTag::kMinTimestamp and NewFileCustomTag::kMinTimestamp in
    NewFileCustomTag ( in the kNewFile4 part ), and support invoved codes such as
    VersionEdit Encode and Decode etc.
3) Add unit test code for VersionEdit EncodeDecodeNewFile4, and fix invoved test codes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9092

Reviewed By: ajkr, akankshamahajan15

Differential Revision: D32252323

Pulled By: riversand963

fbshipit-source-id: d2642898d6e3ad1fef0eb866b98045408bd4e162
2021-11-10 10:49:04 -08:00
mrambacher 13ae16c315 Cleanup includes in dbformat.h (#8930)
Summary:
This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.

Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.

Hopefully, this sort of code hygiene cleanup will speed up the builds...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930

Reviewed By: pdillinger

Differential Revision: D31142788

Pulled By: mrambacher

fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
2021-09-29 04:04:40 -07:00
mrambacher 12f1137355 Add a SystemClock class to capture the time functions of an Env (#7858)
Summary:
Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.

Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.

There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).

Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858

Reviewed By: pdillinger

Differential Revision: D26006406

Pulled By: mrambacher

fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
2021-01-25 22:09:11 -08:00
Ramkumar Vadivelu 9a690a74e1 In ParseInternalKey(), include corrupt key info in Status (#7515)
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497

When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`

Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.

Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test

Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515

Reviewed By: ajkr

Differential Revision: D24240264

Pulled By: ramvadiv

fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
2020-10-28 10:12:58 -07:00
Ramkumar Vadivelu e04a50923d Change ParseInternalKey() to return Status instead of bool (#7457)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/7430

Change ParseInternalKey() to return Status instead of bool.

db_bench (seekrandom) based before/after results with value size of 100 bytes and 16 bytes can be found at (tests ran on an udb server):
https://www.dropbox.com/s/47bwamdy5ozngph/PIK_ret_Status_results.xlsx?dl=0

![db_bench_results](https://user-images.githubusercontent.com/62277872/94642825-2a21a800-029a-11eb-88f2-124136c83fd3.png)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7457

Reviewed By: ajkr

Differential Revision: D24002433

Pulled By: ramvadiv

fbshipit-source-id: ac253ecf577a29044c47c3fe254a01e71404c44c
2020-09-30 19:16:47 -07:00
Akanksha Mahajan 8e0df9050c Store FSRandomAccessPtr object in RandomAccessFileReader (#7192)
Summary:
Replace FSRandomAccessFile pointer with FSRandomAccessFilePtr
    object in RandomAccessFileReader.
    This new object wraps FSRandomAccessFile pointer.

    Objective: If tracing is enabled, FSRandomAccessFile Ptr returns
    FSRandomAccessFileTracingWrapper pointer that includes all necessary
    information in IORecord and calls underlying FileSystem and invokes
    IOTracer to dump that record in a binary file. If tracing is disabled
    then, underlying FileSystem pointer is returned directly.
    FSRandomAccessFilePtr wrapper class is added to bypass the FSRandomAccessFileWrapper when
    tracing is disabled.

    Test Plan: make check -j64

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7192

Reviewed By: anand1976

Differential Revision: D23356867

Pulled By: akankshamahajan15

fbshipit-source-id: 48f31168166a17a7444b40be44a9a9d4a5c7182c
2020-08-27 11:21:52 -07:00
Akanksha Mahajan cc24ac14eb Store FSSequentialFilePtr object in SequenceFileReader (#7190)
Summary:
This diff contains following changes:
    1. Replace `FSSequentialFile` pointer with `FSSequentialFilePtr` object that wraps `FSSequentialFile` pointer in `SequenceFileReader`.

Objective: If tracing is enabled, `FSSequentialFilePtr` returns `FSSequentialFileTracingWrapper` pointer that includes all necessary information in `IORecord` and calls underlying FileSystem and invokes `IOTracer` to dump that record in a binary file. If tracing is disabled then, underlying `FileSystem` pointer is returned directly. `FSSequentialFilePtr` wrapper class is added to bypass the `FSSequentialFileTracingWrapper` when tracing is disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7190

Test Plan:
make check -j64
          COMPILE_WITH_TSAN=1 make check -j64

Reviewed By: anand1976

Differential Revision: D23059616

Pulled By: akankshamahajan15

fbshipit-source-id: 1564b94dd1297cd0fbfe2ed5c9cc3e20f7395301
2020-08-18 16:20:54 -07:00
Akanksha Mahajan 1f9f630b27 Store FileSystemPtr object that contains FileSystem ptr (#7180)
Summary:
As part of the IOTracing project, this PR
    1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer.
    2. FileSystemPtr object is created using FileSystem pointer and IOTracer
    pointer.
    3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor.
    4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr
    returns FileSystemTracingWrapper pointer for tracing purpose and when
    it is disabled underlying FileSystem pointer is returned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180

Test Plan:
make check -j64
                COMPILE_WITH_TSAN=1 make check -j64

Reviewed By: anand1976

Differential Revision: D22987117

Pulled By: akankshamahajan15

fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1
2020-08-12 17:31:23 -07:00
Zhichao Cao 8c694025e9 Fix potential size_t overflow in import_column_family (#6762)
Summary:
The issue is reported in https://github.com/facebook/rocksdb/issues/6753 . size_t is unsigned and if sorted_file.size() is 0, the end condition of i will be extremely large, cause segment fault in sorted_files[i] and sorted_files[i+1]. Added condition to fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6762

Test Plan: make asan_check

Reviewed By: pdillinger

Differential Revision: D21323063

Pulled By: zhichao-cao

fbshipit-source-id: 56ce59201949ed319448228553202b8642c2cc3a
2020-04-30 08:40:42 -07:00
Andrew Kryczka 6717ada899 Fix CF import with overlapping SST files (#6663)
Summary:
Invariant checking should use internal key comparator rather than
`sstableKeyCompare()`. The latter was intended for checking whether a
compaction input file's neighboring files need to be included in the
same compaction. Using it for invariant checking was leading to false
positives for files with overlapping endpoints.

Fixes https://github.com/facebook/rocksdb/issues/6647.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6663

Test Plan: regression test

Reviewed By: zhichao-cao

Differential Revision: D20910466

Pulled By: ajkr

fbshipit-source-id: f0b70dad7c4096fce635cab7a36f16e14f74ae3f
2020-04-16 13:16:06 -07:00
sdong fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Zhichao Cao 4369f2c7bb Checksum for each SST file and stores in MANIFEST (#6216)
Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.

Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string).  A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216

Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.

Differential Revision: D19171461

Pulled By: zhichao-cao

fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
2020-02-10 15:52:52 -08:00
anand76 afa2420c2b Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:48:41 -08:00
sdong aa1857e2df Support options.max_open_files = -1 with periodic_compaction_seconds (#6090)
Summary:
options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090

Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2.

Differential Revision: D18702268

fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb
2019-11-26 21:39:56 -08:00
sdong d8c28e692a Support options.ttl with options.max_open_files = -1 (#6060)
Summary:
Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead.

Note that, this change will break forward compatibility for release 5.1 and older.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060

Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0.

Differential Revision: D18631623

fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830
2019-11-22 21:23:00 -08:00
Levi Tamasi 5f025ea832 BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903)
Summary:
This is groundwork for adding garbage collection support to BlobDB. The
patch adds logic that keeps track of the oldest blob file referred to by
each SST file. The oldest blob file is identified during flush/
compaction (similarly to how the range of keys covered by the SST is
identified), and persisted in the manifest as a custom field of the new
file edit record. Blob indexes with TTL are ignored for the purposes of
identifying the oldest blob file (since such blob files are cleaned up by the
TTL logic in BlobDB).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5903

Test Plan:
Added new unit tests; also ran db_bench in BlobDB mode, inspected the
manifest using ldb, and confirmed (by scanning the SST files using
sst_dump) that the value of the oldest blob file number field matches
the contents of the file for each SST.

Differential Revision: D17859997

Pulled By: ltamasi

fbshipit-source-id: 21662c137c6259a6af70446faaf3a9912c550e90
2019-10-14 15:21:01 -07:00
sdong e8263dbdaa Apply formatter to recent 200+ commits. (#5830)
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830

Test Plan: Run all existing tests.

Differential Revision: D17488031

fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
2019-09-20 12:04:26 -07:00
sdong b931f84e56 Divide file_reader_writer.h and .cc (#5803)
Summary:
file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803

Test Plan: Build whole project using make and cmake.

Differential Revision: D17374550

fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
2019-09-16 10:33:51 -07:00
Jeffrey Xiao d61d4507c0 Fix IngestExternalFile overlapping check (#5649)
Summary:
Previously, the end key of a range deletion tombstone was considered exclusive for the purposes of deletion, but considered inclusive when checking if two SSTables overlap. For example, an SSTable with a range deletion tombstone [a, b) would be considered overlapping with an SSTable with a range deletion tombstone [b, c). This commit fixes this check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5649

Differential Revision: D16808765

Pulled By: anand1976

fbshipit-source-id: 5c7ad1c027e4f778d35070e5dae1b8e6037e0d68
2019-08-14 21:02:28 -07:00
Venki Pallipadi 22ce462450 Export Import sst files (#5495)
Summary:
Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135

This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"

We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.

(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
//   is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
                                  const std::string& export_dir,
                                  ExportImportFilesMetaData** metadata);

Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.

(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
//     an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
    const ColumnFamilyOptions& options, const std::string& column_family_name,
    const ImportColumnFamilyOptions& import_options,
    const ExportImportFilesMetaData& metadata,
    ColumnFamilyHandle** handle);

Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.

If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.

Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495

Differential Revision: D16018881

fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
2019-07-17 12:27:14 -07:00