Summary:
Pull in HISTORY for 9.9.0, update version.h for next version, update check_format_compatible.sh, update git hash for folly
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13146
Test Plan: CI
Reviewed By: ltamasi
Differential Revision: D66142259
Pulled By: jowlyzhang
fbshipit-source-id: 90216b2d7cff2e0befb4f56567e3bd074f97c484
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13111
As a follow-up to https://github.com/facebook/rocksdb/pull/13105, the patch changes `BaseDeltaIterator` so that it honors the read option `allow_unprepared_value`. When the option is set and the `BaseDeltaIterator` lands on the base iterator, it defers calling `PrepareValue` on the base iterator and setting `value()` and `columns()` until `PrepareValue` is called on the `BaseDeltaIterator` itself.
Reviewed By: jowlyzhang
Differential Revision: D65344764
fbshipit-source-id: d79c77b5de7c690bf2deeff435e9b0a9065f6c5c
Summary:
Pull in HISTORY for 9.8.0, update version.h for next version, update check_format_compatible.sh
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13093
Test Plan: CI
Reviewed By: jowlyzhang
Differential Revision: D64987257
Pulled By: pdillinger
fbshipit-source-id: a7cec329e3d245e63767760aa0298c08c3281695
Summary:
In https://github.com/facebook/rocksdb/issues/13025 , we made a change to load the latest options file in the remote worker instead of serializing the entire set of options.
That was done under assumption that OPTIONS file do not get purged often. While testing, we learned that this happens more often than we want it to be, so we want to prevent the OPTIONS file from getting purged anytime between when the remote compaction is scheduled and the option is loaded in the remote worker.
Like how we are protecting new SST files from getting purged using `min_pending_output`, we are doing the same by keeping track of `min_options_file_number`. Any OPTIONS file with number greater than `min_options_file_number` will be protected from getting purged. Just like `min_pending_output`, `min_options_file_number` gets bumped when the compaction is done. This is only applicable when `options.compaction_service` is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13074
Test Plan:
```
./compaction_service_test --gtest_filter="*PreservedOptionsLocalCompaction*"
./compaction_service_test --gtest_filter="*PreservedOptionsRemoteCompaction*"
```
Reviewed By: anand1976
Differential Revision: D64433795
Pulled By: jaykorean
fbshipit-source-id: 0d902773f0909d9481dec40abf0b4c54ce5e86b2
Summary:
This PR assigns levels to files in separate batches if they overlap. This approach can potentially assign external files to lower levels.
In the prepare stage, if the input files' key range overlaps themselves, we divide them up in the user specified order into multiple batches. Where the files in the same batch do not overlap with each other, but key range could overlap between batches. If the input files' key range don't overlap, they always just make one default batch.
During the level assignment stage, we assign levels to files one batch after another. It's guaranteed that files within one batch are not overlapping, we assign level to each file one after another. If the previous batch's uppermost level is specified, all files in this batch will be assigned to levels that are higher than that level. The uppermost level used by this batch of files is also tracked, so that it can be used by the next batch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13064
Test Plan:
Updated test and added new test
Manually stress tested
Reviewed By: cbi42
Differential Revision: D64428373
Pulled By: jowlyzhang
fbshipit-source-id: 5aeff125c14094c87cc50088505010dfd2da3d6e
Summary:
When the input files are not overlapping, a.k.a `files_overlap_=false`, it's best to assign them to non L0 levels so that they are not one sorted run each. This can be done regardless of compaction style being leveled or universal without any side effects.
Just my guessing, this special handling may be there because universal compaction used to have an invariant that sequence number on higher levels should not be smaller than sequence number in lower levels. File ingestion used to try to keep up to that promise by doing "sequence number stealing" from the to be assigned level. However, that invariant is no longer true after deletion triggered compaction is added for universal compaction, and we also removed the sequence stealing logic from file ingestion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13059
Test Plan: Updated existing tests
Reviewed By: cbi42
Differential Revision: D64220100
Pulled By: jowlyzhang
fbshipit-source-id: 70a83afba7f4c52d502c393844e6b3273d5cf628
Summary:
a small CF can trigger parallel compaction that applies to the entire DB. This is because the bottommost file size of a small CF can be too small compared to l0 files when a l0->lbase compaction happens. We prevent this by requiring some minimum on the compaction debt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13054
Test Plan: updated unit test.
Reviewed By: hx235
Differential Revision: D63861042
Pulled By: cbi42
fbshipit-source-id: 43bbf327988ef0ef912cd2fc700e3d096a8d2c18
Summary:
Per customer request, we should not merge multiple SST files together during temperature change compaction, since this can cause FIFO TTL compactions to be delayed. This PR changes the compaction picking logic to pick one file at a time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13018
Test Plan: * updated some existing unit tests to test this new behavior.
Reviewed By: jowlyzhang
Differential Revision: D62883292
Pulled By: cbi42
fbshipit-source-id: 6a9fc8c296b5d9b17168ef6645f25153241c8b93
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13022
Currently, `blob_garbage_collection_force_threshold` applies to the oldest batch of blob files, which is typically only a small subset of the blob files currently eligible for garbage collection. This can result in a form of head-of-line blocking: no GC-triggered compactions will be scheduled if the oldest batch does not currently exceed the threshold, even if a lot of higher-numbered blob files do. This can in turn lead to high space amplification that exceeds the soft bound implicit in the force threshold (e.g. 50% would suggest a space amp of <2 and 75% would imply a space amp of <4). The patch changes the semantics of this configuration threshold to apply to the entire set of blob files that are eligible for garbage collection based on `blob_garbage_collection_age_cutoff`. This provides more intuitive semantics for the option and can provide a better write amp/space amp trade-off. (Note that GC-triggered compactions still pick the same SST files as before, so triggered GC still targets the oldest the blob files.)
Reviewed By: jowlyzhang
Differential Revision: D62977860
fbshipit-source-id: a999f31fe9cdda313de513f0e7a6fc707424d4a3
Summary:
* Set write_dbid_to_manifest=true by default
* Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file
* Refactor related DB open code to minimize code duplication
_Recommend hiding whitespace changes for review_
Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019
Test Plan: unit tests and stress test updated for new functionality
Reviewed By: anand1976
Differential Revision: D62898229
Pulled By: pdillinger
fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13015
`Close()`ing a database now releases tracked files in `SstFileManager`. Previously this space would be leaked until the database was later reopened.
Reviewed By: jowlyzhang
Differential Revision: D62590773
fbshipit-source-id: 5461bd253d974ac4967ad52fee92e2650f8a9a28
Summary:
Add option `IngestExternalFileOptions::link_files` that hard links input files and preserves original file links after ingestion, unlike `move_files` which will unlink input files after ingestion. This can be useful when being used together with `allow_db_generated_files` to ingest files from another DB. Also reverted the change to `move_files` in https://github.com/facebook/rocksdb/issues/12959 to simplify the contract so that it will always unlink input files without exception.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12980
Test Plan: updated unit test `ExternSSTFileLinkFailFallbackTest.LinkFailFallBackExternalSst` to test that input files will not be unlinked.
Reviewed By: pdillinger
Differential Revision: D61925111
Pulled By: cbi42
fbshipit-source-id: eadaca72e1ae5288bdd195d57158466e5656fa62
Summary:
so `IngestExternalFileOptions::move_files` and `IngestExternalFileOptions::allow_db_generated_files` are now compatible. The original file links won't be removed if `allow_db_generated_files` is true. This is to prevent deleting files from another DB.
There was a [comment](https://github.com/facebook/rocksdb/pull/12750#discussion_r1684509620) in https://github.com/facebook/rocksdb/issues/12750 about how exactly-once ingestion would work with `move_files`. I've discussed with customer and decided that it can be done by reading the target DB to see if it contains any ingested key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12959
Test Plan: updated unit tests `IngestDBGeneratedFileTest*` to enable `move_files`.
Reviewed By: jowlyzhang
Differential Revision: D61703480
Pulled By: cbi42
fbshipit-source-id: 6b4294369767f989a2f36bbace4ca3c0257aeaf7
Summary:
Main branch cut at defd97bc9.
Updated HISTORY.md, version and format compatibility test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12945
Test Plan: CI
Reviewed By: jowlyzhang
Differential Revision: D61482149
Pulled By: jaykorean
fbshipit-source-id: 4edf7c0a8c6e4df8fcc938bc778dfd02981d0c55
Summary:
In leveled compaction, we pick intra-L0 compaction instead of L0->Lbase whenever L0 size is small. When L0 files contain many deletions, it makes more sense to compact then down instead of accumulating tombstones in L0. This PR uses compensated_file_size when computing L0 size for determining intra-L0 compaction. Also scale down the limit on total L0 size further to be more cautious about accumulating data in L0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12878
Test Plan: updated unit test.
Reviewed By: hx235
Differential Revision: D59932421
Pulled By: cbi42
fbshipit-source-id: 9de973ac51eb7df81b38b8c68110072b1aa06321
Summary:
**Context/Summary:**
It seems unreasonable to take the archived log size into account when calculating log size **for flush** in method CreateCheckpoint. If the user sets WAL_ttl_seconds or WAL_size_limit_MB, the argument _log_size_for_flush_ can easily be reached due to the size of the archived dir. As a result, the flush may always be triggered.
**Test**
corverd by ./checkpoint_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12680
Reviewed By: jaykorean
Differential Revision: D59097904
Pulled By: ajkr
fbshipit-source-id: 0ed29c1b078d8f40b85288541b008e00dbc517d3
Summary:
POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731
to bring more hygiene to our handling of WAL files in Checkpoint.
Specifically, we now Close WAL files as soon as they are either
(a) inactive and fully synced, or (b) inactive and obsolete (so maybe
never fully synced), rather than letting Close() happen in handling
obsolete files (maybe a background thread). This should not be a
performance issue as Close() should be trivial cost relative to other
IO ops, but just in case:
* We don't Close() while holding a mutex, to avoid blocking, and
* The old behavior is available with a new kill switch option
`background_close_inactive_wals`.
Stacked on https://github.com/facebook/rocksdb/issues/12731
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734
Test Plan:
Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.
Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)
Reviewed By: cbi42
Differential Revision: D58295284
Pulled By: pdillinger
fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6
Summary:
**Context/Summary:** a better API design is decided lately so we decided to revert these two changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12738
Test Plan: - CI
Reviewed By: ajkr
Differential Revision: D58162165
Pulled By: hx235
fbshipit-source-id: 9bbe4d2fe9fbe39213f4cf137a2d419e6ffb8e16
Summary:
**Context/Summary:** https://github.com/facebook/rocksdb/pull/12556 `avoid_sync_during_shutdown=false` missed an edge case where `manual_wal_flush == true` so WAL sync will still miss unflushed WAL. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12684
Test Plan: modified UT to include this case `manual_wal_flush==true`
Reviewed By: cbi42
Differential Revision: D57655861
Pulled By: hx235
fbshipit-source-id: c9f49fe260e8b38b3ea387558432dcd9a3dbec19
Summary:
For manual compaction, FIFO compaction will always skip key range overlapping checking with SST files. If CompactRange() is called with CompactionRangeOptions::change_level=true, a CF with FIFO compaction will now return Status::NotSupported.
For file ingestion, we will always ingest into L0. Previously, it's possible to ingest files into non-L0 levels with FIFO compaction.
These changes also help to fix [this](a178d15baf/db/db_impl/db_impl_compaction_flush.cc (L1269)) assertion failure in crash tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12618
Test Plan: added unit tests to verify the new behavior.
Reviewed By: hx235
Differential Revision: D56962401
Pulled By: cbi42
fbshipit-source-id: 19812a1509650b4162b379ca5bee02f2e9d9569d
Summary:
This feature has been around for a couple of years and users haven't reported any problems with it.
Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377
Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in.
Reviewed By: jowlyzhang
Differential Revision: D54129517
Pulled By: pdillinger
fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c
Summary:
Made `BlockBasedTableOptions::block_align` incompatible (i.e., APIs will return `Status::InvalidArgument`) with more ways of enabling compression: `CompactionOptions::compression`, `ColumnFamilyOptions::compression_per_level`, and `ColumnFamilyOptions::bottommost_compression`. Previously it was only incompatible with `ColumnFamilyOptions::compression`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12592
Reviewed By: hx235
Differential Revision: D56650862
Pulled By: ajkr
fbshipit-source-id: f5201602c2ce436e6d8d30893caa6a161a61f141
Summary:
I had a TODO to complete `CompactionOptions`'s compression API but never did it: d610e14f93/db/compaction/compaction_picker.cc (L371-L373)
Without solving that TODO, the API remains incomplete and unsafe. Now, however, I don't think it's worthwhile to complete it. I think we should instead delete the API entirely. This PR deprecates it in preparation for deletion in a future major release. The `ColumnFamilyOptions` settings for compression should be good enough for `CompactFiles()` since they are apparently good enough for every other compaction, including `CompactRange()`.
In the meantime, I also changed the default `CompressionType`. Having callers of `CompactFiles()` use Snappy compression by default does not make sense when the default could be to simply use the same compression type that is used for every other compaction. As a bonus, this change makes the default `CompressionType` consistent with the `CompressionOptions` that will be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12587
Reviewed By: hx235
Differential Revision: D56619273
Pulled By: ajkr
fbshipit-source-id: 1477de49f14b06c72d6f0045616a8ce91d97e66e
Summary:
This PR is a counterpart of https://github.com/facebook/rocksdb/issues/12427 . On file systems that support storage level data checksum and reconstruction, retry opening the DB if a corruption is detected when reading the MANIFEST. This could be done in `log::Reader`, but its a little complicated since the sequential file would have to be reopened in order to re-read the same data, and we may miss some subtle corruptions that don't result in checksum mismatch. The approach chosen here instead is to make the decision to retry in `DBImpl::Recover`, based on either an explicit corruption in the MANIFEST file, or missing SST files due to bad data in the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12518
Reviewed By: ajkr
Differential Revision: D55932155
Pulled By: anand1976
fbshipit-source-id: 51755a29b3eb14b9d8e98534adb2e7d54b12ced9
Summary:
**Context/Summary:**
We recently discovered that `CompactRange(change_level=true, target_level=0)` can possibly refit more than 1 files to L0. This refitting can cause read performance regression as we need to go through every file in L0, corruption in some edge case and false positive corruption caught by force consistency check. We decided to explicitly disallow such behavior.
A related change to OptionChangeMigration():
- When migrating to FIFO with `compaction_options_fifo.max_table_files_size > 0`, RocksDB will [CompactRange() all the to-be-migrate data into a couple L0 files](https://github.com/facebook/rocksdb/blob/main/utilities/option_change_migration/option_change_migration.cc#L164-L169) to avoid dropping all the data upon migration finishes when the migrated data is larger than max_table_files_size. This is achieved by first compacting all the data into a couple non-L0 files and refitting those files from non-L0 to L0 if needed. In that way, only some data instead of all data will be dropped immediately after migration to FIFO with a max_table_files_size.
- Since this type of refitting behavior is disallowed from now on, we won't do this trick anymore and explicitly state such risk in API comment.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12481
Test Plan:
- New UT
- Modified UT
Reviewed By: cbi42
Differential Revision: D55351178
Pulled By: hx235
fbshipit-source-id: 9d8854f2f81d7e8aff859c3a4e53b7d688048e80
Summary:
When the rate limiter does not have any waiting requests, the first request to arrive may consume all of the available bandwidth, despite potentially having lower priority than requests that arrive later in the same refill interval. Then, those higher priority requests must wait for a refill. So even in scenarios in which we have an overall bandwidth surplus, the highest priority requests can be sporadically delayed up to a whole refill period.
Alone, this isn't necessarily problematic as the refill period is configurable via `refill_period_us` and can be tuned down as needed until the max sporadic delay is tolerable. However, tuning down `refill_period_us` had a side effect of reducing burst size. Some users require a certain burst size to issue optimal I/O sizes to the underlying storage system.
To satisfy those users, this PR decouples the refill period from the burst size. That way, the max sporadic delay can be limited without impacting I/O sizes issued to the underlying storage system.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12379
Test Plan:
The goal is to show we can now limit the max sporadic delay without impacting compaction's I/O size.
The benchmark runs compaction with a large I/O size, while user reads simultaneously run at a low rate that does not consume all of the available bandwidth. The max sporadic delay is measured using the P100 of rocksdb.file.read.get.micros. I just used strace to verify the compaction reads follow `rate_limiter_single_burst_bytes`
Setup: `./db_bench -benchmarks=fillrandom,flush -write_buffer_size=67108864 -disable_auto_compactions=true -value_size=256 -num=1048576`
Benchmark: `./db_bench -benchmarks=readrandom -use_existing_db=true -num=1048576 -duration=10 -benchmark_read_rate_limit=4096 -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -rate_limit_user_ops=true -statistics=true -cache_size=0 -stats_level=5 -compaction_readahead_size=16777216 -use_direct_reads=true`
Results:
refill_micros | rocksdb.file.read.get.micros (P100)
-- | --
10000 | 10802
100000 | 100240
1000000 | 922061
For verifying compaction read sizes: `strace -fye pread64 ./db_bench -benchmarks=compact -use_existing_db=true -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -compaction_readahead_size=16777216 -use_direct_reads=true`
Reviewed By: hx235
Differential Revision: D54165675
Pulled By: ajkr
fbshipit-source-id: c5968486316cbfb7ff8e5b7d75d3589883dd1105
Summary:
This occasional filesystem read in the write path has caused user pain. It doesn't seem very useful considering it only limits one component's merge chain length, and only helps merge uncached (i.e., infrequently read) values. This PR proposes allowing `max_successive_merges` to be exceeded when the value cannot be read from in-memory components. I included a rollback flag (`strict_max_successive_merges`) just in case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12365
Test Plan:
"rocksdb.block.cache.data.add" is number of data blocks read from filesystem. Since the benchmark is write-only, compaction is disabled, and flush doesn't read data blocks, any nonzero value means the user write issued the read.
```
$ for s in false true; do echo -n "strict_max_successive_merges=$s: " && ./db_bench -value_size=64 -write_buffer_size=131072 -writes=128 -num=1 -benchmarks=mergerandom,flush,mergerandom -merge_operator=stringappend -disable_auto_compactions=true -compression_type=none -strict_max_successive_merges=$s -max_successive_merges=100 -statistics=true |& grep 'block.cache.data.add COUNT' ; done
strict_max_successive_merges=false: rocksdb.block.cache.data.add COUNT : 0
strict_max_successive_merges=true: rocksdb.block.cache.data.add COUNT : 1
```
Reviewed By: hx235
Differential Revision: D53982520
Pulled By: ajkr
fbshipit-source-id: e40f761a60bd601f232417ac0058e4a33ee9c0f4
Summary:
with release notes for 9.0.fb, format_compatible test update, and version.h update.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12360
Test Plan: CI
Reviewed By: cbi42
Differential Revision: D53879416
Pulled By: jowlyzhang
fbshipit-source-id: 29598893d9ce2d0bb181345ddb78f9b1529aee75
Summary:
It's in production for a large storage service, and it was initially released 6 months ago (8.6.0). IMHO that's enough room for "easy downgrade" to most any user's previously integrated version, even if they only update a few times a year.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12352
Test Plan:
tests updated, including format capatibility test
table_test: ApproximateOffsetOfCompressed is affected because adding index block to metaindex adds about 13 bytes
to SST files in format_version 6. This test has historically been problematic and one reason is that, apparently, not only
could it pass/fail depending on snappy compression version, but also how long your host name is, because of db_host_id.
I've cleared that out for the test, which takes care of format_version=6 and hopefully improves long-term reliability.
Suggested follow-up: FinishImpl in table_test.cc takes a table_options that is ignored in some cases and might not match
the ioptions.table_factory configuration unless the caller is very careful. This should be cleaned up somehow.
Reviewed By: anand1976
Differential Revision: D53786884
Pulled By: pdillinger
fbshipit-source-id: 1964cbd40d3ab0a821fdc01c458031df716fcf51
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. This PR adds pressure detection based on the number of files marked for compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12306
Reviewed By: cbi42
Differential Revision: D53200559
Pulled By: ajkr
fbshipit-source-id: 63402ee336881a4539204d255960f04338ab7a0e
Summary:
introduce a new option `intra_l0_compaction_size` to allow more intra-L0 compaction when total L0 size is under a threshold. This option applies only to leveled compaction. It is enabled by default and set to `max_bytes_for_level_base / max_bytes_for_level_multiplier` only for atomic_flush users. When atomic_flush=true, it is more likely that some CF's total L0 size is small when it's eligible for compaction. This option aims to reduce write amplification in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12214
Test Plan:
- new unit test
- benchmark:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --write_buffer_size=51200 --max_bytes_for_level_base=5242880 --level0_file_num_compaction_trigger=4 --statistics=1
main:
fillrandom : 234.499 micros/op 4264 ops/sec 234.499 seconds 1000000 operations; 0.5 MB/s
rocksdb.compact.read.bytes COUNT : 1490756235
rocksdb.compact.write.bytes COUNT : 1469056734
rocksdb.flush.write.bytes COUNT : 71099011
branch:
fillrandom : 128.494 micros/op 7782 ops/sec 128.494 seconds 1000000 operations; 0.9 MB/s
rocksdb.compact.read.bytes COUNT : 807474156
rocksdb.compact.write.bytes COUNT : 781977610
rocksdb.flush.write.bytes COUNT : 71098785
```
Reviewed By: ajkr
Differential Revision: D52637771
Pulled By: cbi42
fbshipit-source-id: 4f2c7925d0c3a718635c948ea0d4981ed9fabec3
Summary:
with release notes for 8.11.fb, format_compatible test update, and version.h update.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12256
Test Plan: CI
Reviewed By: cbi42
Differential Revision: D52926051
Pulled By: pdillinger
fbshipit-source-id: adcf7119b065758599e904c16cbdf1d28811e0b4
Summary:
This PR significantly reduces the compaction pressure threshold introduced in https://github.com/facebook/rocksdb/issues/12130 by a factor of 64x. The original number was too high to trigger in scenarios where compaction parallelism was needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12236
Reviewed By: cbi42
Differential Revision: D52765685
Pulled By: ajkr
fbshipit-source-id: 8298e966933b485de24f63165a00e672cb9db6c4
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
Summary:
Currently, the data are always compacted to the same level if exceed periodic_compaction_seconds which may confuse users, so we change it to allow trigger compaction to the next level here. It's a behavior change to users, and may affect users
who have disabled their ttl or ttl > periodic_compaction_seconds.
Relate issue: https://github.com/facebook/rocksdb/issues/12165
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12175
Reviewed By: ajkr
Differential Revision: D52446722
Pulled By: cbi42
fbshipit-source-id: ccd3d2c6434ed77055735a03408d4a62d119342f
Summary:
When ranking file by compaction priority in a level, prioritize files marked for compaction over files that are not marked. This only applies to default CompactPri kMinOverlappingRatio for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12187
Test Plan: * New unit tests
Reviewed By: ajkr
Differential Revision: D52437194
Pulled By: cbi42
fbshipit-source-id: 65ea9ce5bb421e598d539a55c8219b70844b82b3
Summary:
HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem.
However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system.
To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC, `eviction_effort_cap`. See the comments on the new config parameter in the public API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12141
Test Plan:
unit test included
## Performance test
We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior:
```
./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7
```
```
Before, LRU (alternate baseline not exhibiting bad behavior):
Rough parallel ops/sec = 2290997
1088060 maxresident
Before, AutoHCC (bad behavior):
Rough parallel ops/sec = 141011 <- Yes, more than 10x slower
1083932 maxresident
```
Now let us sample a range of values in the solution space:
```
After, AutoHCC, eviction_effort_cap = 1:
Rough parallel ops/sec = 3212586
2402216 maxresident
After, AutoHCC, eviction_effort_cap = 10:
Rough parallel ops/sec = 2371639
1248884 maxresident
After, AutoHCC, eviction_effort_cap = 30:
Rough parallel ops/sec = 1981092
1131596 maxresident
After, AutoHCC, eviction_effort_cap = 100:
Rough parallel ops/sec = 1446188
1090976 maxresident
After, AutoHCC, eviction_effort_cap = 1000:
Rough parallel ops/sec = 549568
1084064 maxresident
```
I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default.
```
Change to -pinned_ratio=0.85
Before, LRU:
Rough parallel ops/sec = 2108373
1078232 maxresident
Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2164910
1077312 maxresident
After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2145542
1077216 maxresident
```
The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress.
```
Change to -pinned_ratio=0.25 (low stress)
Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2221149
1076540 maxresident
After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2224521
1076664 maxresident
```
No measurable difference under normal circumstances.
Some tests repeated with FixedHCC, with similar results.
Reviewed By: anand1976
Differential Revision: D52174755
Pulled By: pdillinger
fbshipit-source-id: d278108031b1220c1fa4c89c5a9d34b7cf4ef1b8
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. The pressure detection based on pending compaction bytes was only comparing against the slowdown trigger (`soft_pending_compaction_bytes_limit`). Online services tend to set that extremely high to avoid stalling at all costs. Perhaps they should have set it to zero, but we never documented that zero disables stalling so I have been telling everyone to increase it for years.
This PR adds pressure detection based on pending compaction bytes relative to the size of bottommost data. The size of bottommost data should be fairly stable and proportional to the logical data size
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12130
Reviewed By: hx235
Differential Revision: D52000746
Pulled By: ajkr
fbshipit-source-id: 7e1fd170901a74c2d4a69266285e3edf6e7631c7