Summary:
Enabling time PerfCounter stats in RocksDB is currently very expensive, as it enables all sorts of relatively uninteresting stats, such as iteration, point lookup breakdown etc. This PR adds a new perf level between `kEnableCount` and `kEnableTimeExceptForMutex` to enable stats for time spent by user (i.e a RocksDB user) threads blocked by other RocksDB threads or events, such as a write group leader, write delay or stalls etc. It does not include time spent waiting to acquire mutexes, or waiting for IO.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12368
Test Plan: Add a unit test for write_thread_wait_nanos
Reviewed By: ajkr
Differential Revision: D54021583
Pulled By: anand1976
fbshipit-source-id: 3f6fcf71010132ffffca0391a5565f3b59fddd48
Summary:
This PR expands on the capabilities added in https://github.com/facebook/rocksdb/issues/12343. It adds sanity checks for external file's comparator name and user-defined timestamps related flag. With this, it now supports ingesting files to a column family that enables user-defined timestamps in Memtable only feature.
Two fields in the table properties are used for aformentioned check: 1) the comparator name, it records what comparator is used to create this external sst file, 2) the flag `user_defined_timestamps_persisted`. We compare these two fields with the column family's settings. The details are in util function `ValidateUserDefinedTimestampsOptions`.
To optimize for the majority of the cases where sanity check should pass and the table properties read should not affect how `TableReader` is constructed, instead of read the table properties block separately and use it for sanity check before creating a `TableReader`. We continue using the current flow to first create a `TableReader`, use it for reading table properties and do sanity checks, and reset the`TableReader` for the case where the column family enables UDTs in memtable only feature, and the external file does not contain user-defined timestamps.
This PR also groups other table properties related sanity check in function `GetIngestedFileInfo` into the newly added `SanityCheckTableProperties` function.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12356
Test Plan:
added unit test
existing unit test
Reviewed By: cbi42
Differential Revision: D54025116
Pulled By: jowlyzhang
fbshipit-source-id: a918276c15f9908bd9df8513ce667638882e1554
Summary:
This occasional filesystem read in the write path has caused user pain. It doesn't seem very useful considering it only limits one component's merge chain length, and only helps merge uncached (i.e., infrequently read) values. This PR proposes allowing `max_successive_merges` to be exceeded when the value cannot be read from in-memory components. I included a rollback flag (`strict_max_successive_merges`) just in case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12365
Test Plan:
"rocksdb.block.cache.data.add" is number of data blocks read from filesystem. Since the benchmark is write-only, compaction is disabled, and flush doesn't read data blocks, any nonzero value means the user write issued the read.
```
$ for s in false true; do echo -n "strict_max_successive_merges=$s: " && ./db_bench -value_size=64 -write_buffer_size=131072 -writes=128 -num=1 -benchmarks=mergerandom,flush,mergerandom -merge_operator=stringappend -disable_auto_compactions=true -compression_type=none -strict_max_successive_merges=$s -max_successive_merges=100 -statistics=true |& grep 'block.cache.data.add COUNT' ; done
strict_max_successive_merges=false: rocksdb.block.cache.data.add COUNT : 0
strict_max_successive_merges=true: rocksdb.block.cache.data.add COUNT : 1
```
Reviewed By: hx235
Differential Revision: D53982520
Pulled By: ajkr
fbshipit-source-id: e40f761a60bd601f232417ac0058e4a33ee9c0f4
Summary:
with release notes for 9.0.fb, format_compatible test update, and version.h update.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12360
Test Plan: CI
Reviewed By: cbi42
Differential Revision: D53879416
Pulled By: jowlyzhang
fbshipit-source-id: 29598893d9ce2d0bb181345ddb78f9b1529aee75
Summary:
A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327
Reviewed By: pdillinger
Differential Revision: D53828151
Pulled By: anand1976
fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050
Summary:
It's in production for a large storage service, and it was initially released 6 months ago (8.6.0). IMHO that's enough room for "easy downgrade" to most any user's previously integrated version, even if they only update a few times a year.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12352
Test Plan:
tests updated, including format capatibility test
table_test: ApproximateOffsetOfCompressed is affected because adding index block to metaindex adds about 13 bytes
to SST files in format_version 6. This test has historically been problematic and one reason is that, apparently, not only
could it pass/fail depending on snappy compression version, but also how long your host name is, because of db_host_id.
I've cleared that out for the test, which takes care of format_version=6 and hopefully improves long-term reliability.
Suggested follow-up: FinishImpl in table_test.cc takes a table_options that is ignored in some cases and might not match
the ioptions.table_factory configuration unless the caller is very careful. This should be cleaned up somehow.
Reviewed By: anand1976
Differential Revision: D53786884
Pulled By: pdillinger
fbshipit-source-id: 1964cbd40d3ab0a821fdc01c458031df716fcf51
Summary:
.. for public api change related to sst_dump.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12353
Reviewed By: jaykorean
Differential Revision: D53791123
Pulled By: cbi42
fbshipit-source-id: 3fbe9c7a3eb0a30dc1a00d39bc8a46028baa3779
Summary:
This PR adds support in `SstFileWriter` to create SST files without persisting timestamps when the column family has enabled UDTs in Memtable only feature. The sst files created from flush and compaction do not contain timestamps, we want to make the sst files created by `SstFileWriter` to follow the same pattern and not persist timestamps. This is to prepare for ingesting external SST files for this type of column family.
There are timestamp-aware APIs and non timestamp-aware APIs in `SstFileWriter`. The former are exclusively used for when the column family's comparator is timestamp-aware, a.k.a `Comparator::timestamp_size() > 0`, while the latter are exclusively used for the column family's comparator is non timestamp-aware, a.k.a `Comparator::timestamp_size() == 0`. There are sanity checks to make sure these APIs are correctly used.
In this PR, the APIs usage continue with above enforcement, where even though timestamps are not eventually persisted, users are still asked to use only the timestamp-aware APIs. But because data points will logically all have minimum timestamps, we don't allow multiple versions of the same user key (without timestamp) to be added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12348
Test Plan:
Added unit tests
Manual inspection of generated sst files with `sst_dump`
Reviewed By: ltamasi
Differential Revision: D53732667
Pulled By: jowlyzhang
fbshipit-source-id: e43beba0d3a1736b94ee5c617163a6280efd65b7
Summary:
There is no strong reason for user to need this mode while on the other hand, its behavior is destructive.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337
Reviewed By: hx235
Differential Revision: D53630393
Pulled By: jowlyzhang
fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12347
`DBImpl::disable_delete_obsolete_files_` should only be accessed while holding the DB mutex to prevent data races. There's a piece of logic in `DBImpl::RenameTempFileToOptionsFile` where this synchronization was previously missing. The patch fixes this issue similarly to how it's handled in `DisableFileDeletions` and `EnableFileDeletions`, that is, by saving the counter value while holding the mutex and then performing the actual file deletion outside the critical section. Note: this PR only fixes the race itself; as a followup, we can also look into cleaning up and optimizing the file deletion logic (which is currently inefficient on multiple different levels).
Reviewed By: jowlyzhang
Differential Revision: D53675153
fbshipit-source-id: 5358e894ee6829d3edfadac50a93d97f8819e481
Summary:
(as title)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12336
Test Plan: in use at Meta for a large service; in crash test
Reviewed By: hx235
Differential Revision: D53537628
Pulled By: pdillinger
fbshipit-source-id: 69e7ac9ab7b59b928d1144105667a7fde8a55a5a
Summary:
Introduce some different range classes `UserKeyRange` and `UserKeyRangePtr` to be used by internal implementation. The `Range` class is used in both public APIs like `DB::GetApproximateSizes`, `DB::GetApproximateMemTableStats`, `DB::GetPropertiesOfTablesInRange` etc and internal implementations like `ColumnFamilyData::RangesOverlapWithMemtables`, `VersionSet::GetPropertiesOfTablesInRange`.
These APIs have different expectations of what keys this range class contain. Public API users are supposed to populate the range with the user keys without timestamp, in the same way that point lookup and range scan APIs' key input only expect the user key without timestamp. The internal APIs implementation expect a user key whose format is compatible with the user comparator, a.k.a a user key with the timestamp.
This PR contains:
1) introducing counterpart range class `UserKeyRange` `UserKeyRangePtr` for internal implementation while leave the existing `Range` and `RangePtr` class only for public APIs. Internal implementations are updated to use this new class instead.
2) add user-defined timestamp support for `DB::GetPropertiesOfTablesInRange` API and `DeleteFilesInRanges` API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12071
Test Plan:
existing tests
Added test for `DB::GetPropertiesOfTablesInRange` and `DeleteFilesInRanges` APIs for when user-defined timestamp is enabled.
The change in external_file_ingestion_job doesn't have a user-defined timestamp enabled test case coverage, will add one in a follow up PR that adds file ingestion support for UDT.
Reviewed By: ltamasi
Differential Revision: D53292608
Pulled By: jowlyzhang
fbshipit-source-id: 9a9279e23c640a6d8f8232636501a95aef7638b8
Summary:
The option is introduced in https://github.com/facebook/rocksdb/issues/10835 to allow disabling the new compaction behavior if it's not safe. The option is enabled by default and there has not been a need to disable it. So it should be safe to remove now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12323
Reviewed By: ajkr
Differential Revision: D53330336
Pulled By: cbi42
fbshipit-source-id: 36eef4664ac96b3a7ed627c48bd6610b0a7eafc5
Summary:
The option is introduced in https://github.com/facebook/rocksdb/issues/10655 to allow reverting to old behavior. The option is enabled by default and there has not been a need to disable it. Remove it for 9.0 release. Also fixed and improved a few unit tests that depended on setting this option to false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12325
Test Plan: existing tests.
Reviewed By: hx235
Differential Revision: D53369430
Pulled By: cbi42
fbshipit-source-id: 0ec2440ca8d88db7f7211c581542c7581bd4d3de
Summary:
The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met -
1. Duplicate keys in a MultiGet batch
2. Key matches the last key in a non-zero, non-bottommost level file
3. Final value is not in the file (merge operand, not snapshot visible etc)
4. Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block
5. Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates
6. Value or merge operand for the key is present in the very next level
The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the conditions https://github.com/facebook/rocksdb/issues/1 and https://github.com/facebook/rocksdb/issues/2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12295
Test Plan: Add unit tests that fail an assertion or return wrong result without the fix
Reviewed By: hx235
Differential Revision: D53187634
Pulled By: anand1976
fbshipit-source-id: a5eadf4fede9bbdec784cd993b15e3341436d1ea
Summary:
`check_flush_compaction_key_order` option was introduced for the key order checking online validation. It gave users the ability to disable the validation without downgrade in case the validation caused inefficiencies or false positives. Over time this validation has shown to be cheap and correct, so the option to disable it can now be removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12311
Reviewed By: cbi42
Differential Revision: D53233379
Pulled By: ajkr
fbshipit-source-id: 1384361104021d6e3e580dce2ec123f9f99ce637
Summary:
As titled. This changes public API behavior, and subclasses of `WritableFile` and `FSWritableFile` need to explicitly provide an implementation for the `GetFileSize` method after this change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12303
Reviewed By: ajkr
Differential Revision: D53205769
Pulled By: jowlyzhang
fbshipit-source-id: 2e613ca3650302913821b33159b742bdf1d24bc7
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. This PR adds pressure detection based on the number of files marked for compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12306
Reviewed By: cbi42
Differential Revision: D53200559
Pulled By: ajkr
fbshipit-source-id: 63402ee336881a4539204d255960f04338ab7a0e
Summary:
As titled, the replacement tickers have been introduced in https://github.com/facebook/rocksdb/issues/11509 and in use since release 8.4. This PR completely removes the misspelled ones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12302
Test Plan: CI tests
Reviewed By: jaykorean
Differential Revision: D53196935
Pulled By: jowlyzhang
fbshipit-source-id: 9c9d0d321247690db5edfdc52b4fecb2f1218979
Summary:
Provide support for FSBuffer for point lookups
It also add support for compaction and scan reads that goes through BlockFetcher when readahead/prefetching is not enabled.
Some of the compaction/Scan reads goes through FilePrefetchBuffer and some through BlockFetcher. This PR add support to use underlying file system scratch buffer for reads that go through BlockFetcher as for FilePrefetch reads, design is complicated to support this feature.
Design - In order to use underlying FileSystem provided scratch for Reads, it uses MultiRead with 1 request instead of Read API which required API change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12266
Test Plan: Stress test using underlying file system scratch buffer internally.
Reviewed By: anand1976
Differential Revision: D53019089
Pulled By: akankshamahajan15
fbshipit-source-id: 4fe3d090d77363320e4b67186fd4d51c005c0961
Summary:
introduce a new option `intra_l0_compaction_size` to allow more intra-L0 compaction when total L0 size is under a threshold. This option applies only to leveled compaction. It is enabled by default and set to `max_bytes_for_level_base / max_bytes_for_level_multiplier` only for atomic_flush users. When atomic_flush=true, it is more likely that some CF's total L0 size is small when it's eligible for compaction. This option aims to reduce write amplification in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12214
Test Plan:
- new unit test
- benchmark:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --write_buffer_size=51200 --max_bytes_for_level_base=5242880 --level0_file_num_compaction_trigger=4 --statistics=1
main:
fillrandom : 234.499 micros/op 4264 ops/sec 234.499 seconds 1000000 operations; 0.5 MB/s
rocksdb.compact.read.bytes COUNT : 1490756235
rocksdb.compact.write.bytes COUNT : 1469056734
rocksdb.flush.write.bytes COUNT : 71099011
branch:
fillrandom : 128.494 micros/op 7782 ops/sec 128.494 seconds 1000000 operations; 0.9 MB/s
rocksdb.compact.read.bytes COUNT : 807474156
rocksdb.compact.write.bytes COUNT : 781977610
rocksdb.flush.write.bytes COUNT : 71098785
```
Reviewed By: ajkr
Differential Revision: D52637771
Pulled By: cbi42
fbshipit-source-id: 4f2c7925d0c3a718635c948ea0d4981ed9fabec3
Summary:
with release notes for 8.11.fb, format_compatible test update, and version.h update.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12256
Test Plan: CI
Reviewed By: cbi42
Differential Revision: D52926051
Pulled By: pdillinger
fbshipit-source-id: adcf7119b065758599e904c16cbdf1d28811e0b4
Summary:
The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios.
* Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc.
* Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.)
* Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior.
* Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.)
* Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries.
* Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.`
* Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior.
* Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file.
Intended follow-up (me or others):
* Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense.
* There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL.
* The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253
Test Plan: unit tests updated
Reviewed By: jowlyzhang
Differential Revision: D52913733
Pulled By: pdillinger
fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593
Summary:
These options were added for users to roll back a behavior change without downgrading. To our knowledge they were not needed so can now be removed.
- `level_compaction_dynamic_file_size`
- `ignore_max_compaction_bytes_for_input`
These options were added for users to disable an online validation in case it is expensive or has false positives. Those validations have shown to be cheap, correct, and are enabled by default, so these options can be removed.
- `check_flush_compaction_key_order`
- `flush_verify_memtable_count`
- `compaction_verify_record_count`
- `fail_if_options_file_error`
This option was added for users to violate API contracts or run old databases that used to violate API contracts. It appears to be set by MyRocks so it is unclear whether we can remove it. In any case we should discourage it until it can be removed.
- `enforce_single_del_contracts`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12249
Reviewed By: cbi42
Differential Revision: D52886651
Pulled By: ajkr
fbshipit-source-id: e0d5a35144ce048505899efb1ca68c3948050aa4
Summary:
Add ```CompressionOptions``` to ```CompressedSecondaryCacheOptions``` to allow users to set options such as compression level. It allows performance to be fine tuned.
Tests -
Run db_bench and verify compression options in the LOG file
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12234
Reviewed By: ajkr
Differential Revision: D52758133
Pulled By: anand1976
fbshipit-source-id: af849fbffce6f84704387c195d8edba40d9548f6
Summary:
This PR significantly reduces the compaction pressure threshold introduced in https://github.com/facebook/rocksdb/issues/12130 by a factor of 64x. The original number was too high to trigger in scenarios where compaction parallelism was needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12236
Reviewed By: cbi42
Differential Revision: D52765685
Pulled By: ajkr
fbshipit-source-id: 8298e966933b485de24f63165a00e672cb9db6c4
Summary:
We often need to read the table properties of an SST file when taking a backup. However, we currently do not check checksums for this step, and even with that enabled, we ignore failures. This change ensures we fail creating a backup if corruption is detected in that step of reading table properties.
To get this working properly (with existing unit tests), we also add some temperature handling logic like already exists in
BackupEngineImpl::ReadFileAndComputeChecksum and elsewhere in BackupEngine. Also, SstFileDumper needed a fix to its error handling logic.
This was originally intended to help diagnose some mysterious failures (apparent corruptions) seen in taking backups in the crash test, though that is now fixed in https://github.com/facebook/rocksdb/pull/12206
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12200
Test Plan: unit test added that corrupts table properties, along with existing tests
Reviewed By: ajkr
Differential Revision: D52520674
Pulled By: pdillinger
fbshipit-source-id: 032cfc0791428f3b8147d34c7d424ab128e28f42
Summary:
FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200.
In this change,
* Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption.
* Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior.
* Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer
* Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues.
Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206
Test Plan:
* Manually verified that the new check aids in debugging.
* Unit test added, that fails if either fix is missed.
* Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200
Reviewed By: akankshamahajan15
Differential Revision: D52551701
Pulled By: pdillinger
fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
Summary:
Currently, the data are always compacted to the same level if exceed periodic_compaction_seconds which may confuse users, so we change it to allow trigger compaction to the next level here. It's a behavior change to users, and may affect users
who have disabled their ttl or ttl > periodic_compaction_seconds.
Relate issue: https://github.com/facebook/rocksdb/issues/12165
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12175
Reviewed By: ajkr
Differential Revision: D52446722
Pulled By: cbi42
fbshipit-source-id: ccd3d2c6434ed77055735a03408d4a62d119342f
Summary:
When ranking file by compaction priority in a level, prioritize files marked for compaction over files that are not marked. This only applies to default CompactPri kMinOverlappingRatio for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12187
Test Plan: * New unit tests
Reviewed By: ajkr
Differential Revision: D52437194
Pulled By: cbi42
fbshipit-source-id: 65ea9ce5bb421e598d539a55c8219b70844b82b3
Summary:
Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test:
```
db_flush_test: db/column_family.cc:1648:
rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176
Test Plan: unit test added
Reviewed By: cbi42
Differential Revision: D52417071
Pulled By: pdillinger
fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5
Summary:
HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem.
However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system.
To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC, `eviction_effort_cap`. See the comments on the new config parameter in the public API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12141
Test Plan:
unit test included
## Performance test
We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior:
```
./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7
```
```
Before, LRU (alternate baseline not exhibiting bad behavior):
Rough parallel ops/sec = 2290997
1088060 maxresident
Before, AutoHCC (bad behavior):
Rough parallel ops/sec = 141011 <- Yes, more than 10x slower
1083932 maxresident
```
Now let us sample a range of values in the solution space:
```
After, AutoHCC, eviction_effort_cap = 1:
Rough parallel ops/sec = 3212586
2402216 maxresident
After, AutoHCC, eviction_effort_cap = 10:
Rough parallel ops/sec = 2371639
1248884 maxresident
After, AutoHCC, eviction_effort_cap = 30:
Rough parallel ops/sec = 1981092
1131596 maxresident
After, AutoHCC, eviction_effort_cap = 100:
Rough parallel ops/sec = 1446188
1090976 maxresident
After, AutoHCC, eviction_effort_cap = 1000:
Rough parallel ops/sec = 549568
1084064 maxresident
```
I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default.
```
Change to -pinned_ratio=0.85
Before, LRU:
Rough parallel ops/sec = 2108373
1078232 maxresident
Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2164910
1077312 maxresident
After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2145542
1077216 maxresident
```
The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress.
```
Change to -pinned_ratio=0.25 (low stress)
Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2221149
1076540 maxresident
After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2224521
1076664 maxresident
```
No measurable difference under normal circumstances.
Some tests repeated with FixedHCC, with similar results.
Reviewed By: anand1976
Differential Revision: D52174755
Pulled By: pdillinger
fbshipit-source-id: d278108031b1220c1fa4c89c5a9d34b7cf4ef1b8
Summary:
`Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)
However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time);
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147
Test Plan: Existing CircleCI
Reviewed By: cbi42
Differential Revision: D52173481
Pulled By: akankshamahajan15
fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. The pressure detection based on pending compaction bytes was only comparing against the slowdown trigger (`soft_pending_compaction_bytes_limit`). Online services tend to set that extremely high to avoid stalling at all costs. Perhaps they should have set it to zero, but we never documented that zero disables stalling so I have been telling everyone to increase it for years.
This PR adds pressure detection based on pending compaction bytes relative to the size of bottommost data. The size of bottommost data should be fairly stable and proportional to the logical data size
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12130
Reviewed By: hx235
Differential Revision: D52000746
Pulled By: ajkr
fbshipit-source-id: 7e1fd170901a74c2d4a69266285e3edf6e7631c7
Summary:
There is a bug in the `TieredSecondaryCache` that can result in a false negative. This can happen when a MultiGet does a cache lookup that gets a hit in the `TieredSecondaryCache` local nvm cache tier, and the result is available before MultiGet calls `WaitAll` (i.e the nvm cache `SecondaryCacheResultHandle` `IsReady` returns true).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12134
Test Plan: Add a new unit test in tiered_secondary_cache_test
Reviewed By: akankshamahajan15
Differential Revision: D52023309
Pulled By: anand1976
fbshipit-source-id: e5ae681226a0f12753fecb2f6acc7e5f254ae72b
Summary:
As part of building another feature, I wanted this:
* Custom implementations of `TablePropertiesCollectorFactory` may now return a `nullptr` collector to decline processing a file, reducing callback overheads in such cases.
* Polished, clarified some related API comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12129
Test Plan: unit test added
Reviewed By: ltamasi
Differential Revision: D51966667
Pulled By: pdillinger
fbshipit-source-id: 2991c08fe6ce3a8c9f14c68f1495f5a17bca2770
Summary:
### Implement new Java API get()/put()/merge() methods, and transactional variants.
The Java API methods are very inconsistent in terms of how they pass parameters (byte[], ByteBuffer), and what variants and defaulted parameters they support. We try to bring some consistency to this.
* All APIs should support calls with ByteBuffer parameters.
* Similar methods (RocksDB.get() vs Transaction.get()) should support as similar as possible sets of parameters for predictability.
* get()-like methods should provide variants where the caller supplies the target buffer, for the sake of efficiency. Allocation costs in Java can be significant when large buffers are repeatedly allocated and freed.
### API Additions
1. RockDB.get implement indirect ByteBuffers. Added indirect ByteBuffers and supporting native methods for get().
2. RocksDB.Iterator implement missing (byte[], offset, length) variants for key() and value() parameters.
3. Transaction.get() implement missing methods, based on RocksDB.get. Added ByteBuffer.get with and without column family. Added byte[]-as-target get.
4. Transaction.iterator() implement a getIterator() which defaults ReadOptions; as per RocksDB.iterator(). Rationalize support API for this and RocksDB.iterator()
5. RocksDB.merge implement ByteBuffer methods; both direct and indirect buffers. Shadow the methods of RocksDB.put; RocksDB.put only offers ByteBuffer API with explicit WriteOptions. Duplicated this with RocksDB.merge
6. Transaction.merge implement methods as per RocksDB.merge methods. Transaction is already constructed with WriteOptions, so no explicit WriteOptions methods required.
7. Transaction.mergeUntracked implement the same API methods as Transaction.merge except the ones that use assumeTracked, because that’s not a feature of merge untracked.
### Support Changes (C++)
The current JNI code in C++ supports multiple variants of methods through a number of helper functions. There are numerous TODO suggestions in the code proposing that the helpers be re-factored/shared.
We have taken a different approach for the new methods; we have created wrapper classes `JDirectBufferSlice`, `JDirectBufferPinnableSlice`, `JByteArraySlice` and `JByteArrayPinnableSlice` RAII classes which construct slices from JNI parameters and can then be passed directly to RocksDB methods. For instance, the `Java_org_rocksdb_Transaction_getDirect` method is implemented like this:
```
try {
ROCKSDB_NAMESPACE::JDirectBufferSlice key(env, jkey_bb, jkey_off,
jkey_part_len);
ROCKSDB_NAMESPACE::JDirectBufferPinnableSlice value(env, jval_bb, jval_off,
jval_part_len);
ROCKSDB_NAMESPACE::KVException::ThrowOnError(
env, txn->Get(*read_options, column_family_handle, key.slice(),
&value.pinnable_slice()));
return value.Fetch();
} catch (const ROCKSDB_NAMESPACE::KVException& e) {
return e.Code();
}
```
Notice the try/catch mechanism with the `KVException` class, which combined with RAII and the wrapper classes means that there is no ad-hoc cleanup necessary in the JNI methods.
We propose to extend this mechanism to existing JNI methods as further work.
### Support Changes (Java)
Where there are multiple parameter-variant versions of the same method, we use fewer or just one supporting native method for all of them. This makes maintenance a bit easier and reduces the opportunity for coding errors mixing up (untyped) object handles.
In order to support this efficiently, some classes need to have default values for column families and read options added and cached so that they are not re-constructed on every method call.
This PR closes https://github.com/facebook/rocksdb/issues/9776
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11019
Reviewed By: ajkr
Differential Revision: D52039446
Pulled By: jowlyzhang
fbshipit-source-id: 45d0140a4887e42134d2e56520e9b8efbd349660
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12061.
We were double counting the `BYTES_WRITTEN` ticker when doing writes with transactions. During transactions, after writing, a client can call `Prepare()`, which writes the values to WAL but not to the Memtable. After that, they can call `Commit()`, which writes a commit marker to the WAL and the values to Memtable.
The cause of this bug is previously during writes, we didn't take into account `writer->ShouldWriteToMemtable()` before adding to `total_byte_size`, so it is still added to during the `Prepare()` phase even though we're not writing to the Memtable, which was why we saw the value to be double of what's written to WAL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12111
Test Plan: Added a test in `db/db_statistics_test.cc` that tests writes with and without transactions, by comparing the values of `BYTES_WRITTEN` and `WAL_FILE_BYTES` after doing writes.
Reviewed By: jaykorean
Differential Revision: D51954327
Pulled By: jowlyzhang
fbshipit-source-id: 57a0986a14e5b94eb5188715d819212529110d2c