Summary:
The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met -
1. Duplicate keys in a MultiGet batch
2. Key matches the last key in a non-zero, non-bottommost level file
3. Final value is not in the file (merge operand, not snapshot visible etc)
4. Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block
5. Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates
6. Value or merge operand for the key is present in the very next level
The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the conditions https://github.com/facebook/rocksdb/issues/1 and https://github.com/facebook/rocksdb/issues/2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12295
Test Plan: Add unit tests that fail an assertion or return wrong result without the fix
Reviewed By: hx235
Differential Revision: D53187634
Pulled By: anand1976
fbshipit-source-id: a5eadf4fede9bbdec784cd993b15e3341436d1ea
Summary:
`check_flush_compaction_key_order` option was introduced for the key order checking online validation. It gave users the ability to disable the validation without downgrade in case the validation caused inefficiencies or false positives. Over time this validation has shown to be cheap and correct, so the option to disable it can now be removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12311
Reviewed By: cbi42
Differential Revision: D53233379
Pulled By: ajkr
fbshipit-source-id: 1384361104021d6e3e580dce2ec123f9f99ce637
Summary:
We should be consistent in how we check key range overlap in memtables and in sst files. While all the sst file key range overlap check compares the user key without timestamp, for example:
377eee77f8/db/version_set.cc (L129-L130)
This key range overlap check for memtable is comparing the whole user key. Currently it happen to achieve the same effect because this function is only called by `ExternalSstFileIngestionJob` and `DBImpl::CompactRange`, which takes a user key without timestamp as the range end, pad a max or min timestamp to it depending on whether the end is exclusive. So use `Compartor::Compare` here is working too, but we should update it to `Comparator::CompareWithoutTimestamp` to be consistent with all the other file key range overlapping check functions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12315
Test Plan: existing tests
Reviewed By: ltamasi
Differential Revision: D53273456
Pulled By: jowlyzhang
fbshipit-source-id: c094ae1f0c195d52542124c4fb03fdca14241e85
Summary:
To stop spamming our warning logs with normal behavior.
Also fix comment on `DisableFileDeletions()`.
In response to https://github.com/facebook/rocksdb/issues/12001 I've indicated my objection to granting legitimacy to force=true, but I'm not addressing that here and now. In short, the user shouldn't be asked to think about whether they want to use the *wrong* behavior. ;)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12310
Test Plan: existing tests
Reviewed By: jowlyzhang
Differential Revision: D53233117
Pulled By: pdillinger
fbshipit-source-id: 5d2aedb76b02b30f8a5fa5b436fc57fde5d40d6e
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. This PR adds pressure detection based on the number of files marked for compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12306
Reviewed By: cbi42
Differential Revision: D53200559
Pulled By: ajkr
fbshipit-source-id: 63402ee336881a4539204d255960f04338ab7a0e
Summary:
and also fix comment/label on some MacOS CI jobs. Motivated by a crash test failure missing a definitive indicator of the genesis of the status:
```
file ingestion error: Operation failed. Try again.:
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12307
Test Plan: just cosmetic changes. These statuses should not arise frequently enough to be a performance issue (copying messages).
Reviewed By: jaykorean
Differential Revision: D53199529
Pulled By: pdillinger
fbshipit-source-id: ad83daaa5d80f75c9f81158e90fb6d9ecca33fe3
Summary:
As titled, the replacement tickers have been introduced in https://github.com/facebook/rocksdb/issues/11509 and in use since release 8.4. This PR completely removes the misspelled ones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12302
Test Plan: CI tests
Reviewed By: jaykorean
Differential Revision: D53196935
Pulled By: jowlyzhang
fbshipit-source-id: 9c9d0d321247690db5edfdc52b4fecb2f1218979
Summary:
For the user defined timestamps in memtable only feature, some special handling for range deletion blocks are needed since both the key (start_key) and the value (end_key) of a range tombstone can contain user-defined timestamps. Handling for the key is taken care of in the same way as the other data blocks in the block based table. This PR adds the special handling needed for the value (end_key) part. This includes:
1) On the write path, when L0 SST files are first created from flush, user-defined timestamps are removed from an end key of a range tombstone. There are places where it's logically removed (replaced with a min timestamp) because there is still logic with the running comparator that expects a user key that contains timestamp. And in the block based builder, it is eventually physically removed before persisted in a block.
2) On the read path, when range deletion block is being read, we artificially pad a min timestamp to the end key of a range tombstone in `BlockBasedTableReader`.
3) For file boundary `FileMetaData.largest`, we artificially pad a max timestamp to it if it contains a range deletion sentinel. Anytime when range deletion end_key is used to update file boundaries, it's using max timestamp instead of the range tombstone's actual timestamp to mark it as an exclusive end. d69628e6ce/db/dbformat.h (L923-L935)
This max timestamp is removed when in memory `FileMetaData.largest` is persisted into Manifest, we pad it back when it's read from Manifest while handling related `VersionEdit` in `VersionEditHandler`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12254
Test Plan: Added unit test and enabled this feature combination's stress test.
Reviewed By: cbi42
Differential Revision: D52965527
Pulled By: jowlyzhang
fbshipit-source-id: e8315f8a2c5268e2ae0f7aec8012c266b86df985
Summary:
In C++, `extern` is redundant in a number of cases:
* "Global" function declarations and definitions
* "Global" variable definitions when already declared `extern`
For consistency and simplicity, I've removed these in code that *we own*. In a couple of cases, I removed obsolete declarations, and for MagicNumber constants, I have consolidated the declarations into a header file (format.h)
as standard best practice would prescribe.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12300
Test Plan: no functional changes, CI
Reviewed By: ajkr
Differential Revision: D53148629
Pulled By: pdillinger
fbshipit-source-id: fb8d927959892e03af09b0c0d542b0a3b38fd886
Summary:
... to include the actual numbers of processed and expected records, and the file number for input files. The purpose is to be able to find the offending files even when the relevant LOG file is gone.
Another change is to check the record count even when `compaction_verify_record_count` is false, and log a warning message without setting corruption status if there is a mismatch. This is consistent with how we check the record count for flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12297
Test Plan:
print the status message in `DBCompactionTest.VerifyRecordCount`
```
before
Corruption: Compaction number of input keys does not match number of keys processed.
after
Compaction number of input keys does not match number of keys processed. Expected 20 but processed 10. Compaction summary: Base version 4 Base level 0, inputs: [11(2156B) 9(2156B)]
```
Reviewed By: ajkr
Differential Revision: D53110130
Pulled By: cbi42
fbshipit-source-id: 6325cbfb8f71f25ce37f23f8277ebe9264863c3b
Summary:
https://github.com/facebook/rocksdb/issues/12267 apparently introduced a data race in test code where a background read of estimated_compaction_needed_bytes while holding the DB mutex could race with forground write for testing purposes. This change adds the DB mutex to those writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12294
Test Plan: 1000 TSAN runs of test (massively fails before change, passes after)
Reviewed By: ajkr
Differential Revision: D53095483
Pulled By: pdillinger
fbshipit-source-id: 13fcb383ebad313dabe39eb8f9085c34d370b54a
Summary:
**Context/Summary:**
We recently found out some code paths in flush and compaction aren't rate-limited when they should.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12290
Test Plan: existing UT**
Reviewed By: anand1976
Differential Revision: D53066103
Pulled By: hx235
fbshipit-source-id: 9dc4cab5f841230d18e5504dc480ac523e9d3950
Summary:
After https://github.com/facebook/rocksdb/issues/12253 this function has crashed in the crash test, in its call to `std::copy`. I haven't reproduced the crash directly, but `std::copy` probably has undefined behavior if the starting iterator is after the ending iterator, which was possible. I've fixed the logic to deal with that case and to add an assertion to check that precondition of `std::copy` (which appears can be unchecked by `std::copy` itself even with UBSAN+ASAN).
Also added some unit tests etc. that were unfinished for https://github.com/facebook/rocksdb/issues/12253, and slightly tweak SeqnoToTimeMapping::EnforceMaxTimeSpan handling of zero time span case.
This is intended for patching 8.11.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12293
Test Plan: tests added. Will trigger ~20 runs of the crash test job that saw the crash. https://fburl.com/ci/5iiizvfa
Reviewed By: jowlyzhang
Differential Revision: D53090422
Pulled By: pdillinger
fbshipit-source-id: 69d60b1847d9c7e4ae62b153011c2040405db461
Summary:
The test has been failing with
```
[ RUN ] DBCompactionTest.BottomPriCompactionCountsTowardConcurrencyLimit
db/db_compaction_test.cc:9661: Failure
Expected equality of these values:
0u
Which is: 0
env_->GetThreadPoolQueueLen(Env::Priority::LOW)
Which is: 1
```
This can happen when thread pool queue len is checked before `test::SleepingBackgroundTask::DoSleepTask` is scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12289
Reviewed By: ajkr
Differential Revision: D53064300
Pulled By: cbi42
fbshipit-source-id: 9ed1b714243880f82bd1cc1584b402ac9cf57507
Summary:
While ingesting multiple external files with key range overlap, current flow go through the lsm tree to do a search for a target level and later discard that result by defaulting back to L0. This PR improves this by just skip the search altogether.
The other change is to remove default to L0 for the combination of universal compaction + force global sequence number, which was initially added to meet a pre https://github.com/facebook/rocksdb/issues/7421 invariant.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12284
Test Plan:
Added unit test:
./external_sst_file_test --gtest_filter="*IngestFileWithGlobalSeqnoAssignedUniversal*"
Reviewed By: ajkr
Differential Revision: D53072238
Pulled By: jowlyzhang
fbshipit-source-id: 30943e2e284a7f23b495c0ea4c80cb166a34a8ac
Summary:
Seen in build-macos-cmake:
```
Received signal 11 (Segmentation fault: 11)
https://github.com/facebook/rocksdb/issues/1 rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0::operator()(void*) const (in seqno_time_test) (mock_time_env.cc:29)
https://github.com/facebook/rocksdb/issues/2 decltype(std::declval<rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0&>()(std::declval<void*>())) std::__1::__invoke[abi:v15006]<rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0&, void*>(rocksdb::MockSystemClock::InstallTimedWait ixCallback()::$_0&, void*&&) (in seqno_time_test) (invoke.h:394)
...
```
This is presumably because the std::function from the lambda only saves a copy of the SeqnoTimeTest* this pointer, which doesn't prevent it from being reclaimed on parallel shutdown. If we instead save a copy of the `std::shared_ptr<MockSystemClock>` in the std::function, this should prevent the crash. (Note that in `SyncPoint::Data::Process()` copies the std::function before releasing the mutex for calling the callback.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12282
Test Plan: watch CI
Reviewed By: cbi42
Differential Revision: D53027136
Pulled By: pdillinger
fbshipit-source-id: 26cd9c0352541d806d42bb061dd349d3b47171a5
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12278
`-Wextra-semi` or `-Wextra-semi-stmt`
If the code compiles, this is safe to land.
Reviewed By: jaykorean
Differential Revision: D52969116
fbshipit-source-id: 8cb28dafdbede54e8cb59c2b8d461b1eddb3de68
Summary:
The test is [flaky](https://github.com/facebook/rocksdb/actions/runs/7616272304/job/20742657041?pr=12257&fbclid=IwAR1vNI1rSRVKnOsXs0WCPklqTkBXxlwS1GMJgWWe7D8dtAvh6e6wxk067FY) but I could not reproduce the test failure. Add some debug print to make the next failure more helpful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12268
Test Plan:
```
check print works when test fails:
[ RUN ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
thread id: 6134067200, thread status:
thread id: 6133493760, thread status: Compaction
db/db_test.cc:4680: Failure
Expected equality of these values:
op_count
Which is: 1
expected_count
Which is: 0
```
Reviewed By: hx235
Differential Revision: D52987503
Pulled By: cbi42
fbshipit-source-id: 33b369796f9b97155578b45167e722ddcde93594
Summary:
This PR adds estimated pending compaction bytes in two places:
- The "Level summary", which is printed to the info LOG after every flush or compaction
- The "rocksdb.cfstats" property, which is printed to the info LOG periodically according to `stats_dump_period_sec`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12267
Test Plan:
Ran `./db_bench -benchmarks=filluniquerandom -stats_dump_period_sec=1 -statistics=true -write_buffer_size=524288` and looked at the LOG.
```
** Compaction Stats [default] **
...
Estimated pending compaction bytes: 12117691
...
2024/01/22-13:15:12.283563 1572872 (Original Log Time 2024/01/22-13:15:12.283540) [/db_impl/db_impl_compaction_flush.cc:371] [default] Level summary: files[10 1 0 0 0 0 0] max score 0.50, estimated pending compaction bytes 12359137
```
Reviewed By: cbi42
Differential Revision: D52973337
Pulled By: ajkr
fbshipit-source-id: c4e546bd9bdac387eebeeba303d04125212037b8
Summary:
This is a non functional refactor, mostly for deduplicating the stats recording logic in error handler. Plus some documentation update and simple code dedupe.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11992
Test Plan: existing tests
Reviewed By: hx235
Differential Revision: D52967713
Pulled By: jowlyzhang
fbshipit-source-id: d584eae1a06410438f5a4c59c2cb67666ea7de1a
Summary:
introduce a new option `intra_l0_compaction_size` to allow more intra-L0 compaction when total L0 size is under a threshold. This option applies only to leveled compaction. It is enabled by default and set to `max_bytes_for_level_base / max_bytes_for_level_multiplier` only for atomic_flush users. When atomic_flush=true, it is more likely that some CF's total L0 size is small when it's eligible for compaction. This option aims to reduce write amplification in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12214
Test Plan:
- new unit test
- benchmark:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --write_buffer_size=51200 --max_bytes_for_level_base=5242880 --level0_file_num_compaction_trigger=4 --statistics=1
main:
fillrandom : 234.499 micros/op 4264 ops/sec 234.499 seconds 1000000 operations; 0.5 MB/s
rocksdb.compact.read.bytes COUNT : 1490756235
rocksdb.compact.write.bytes COUNT : 1469056734
rocksdb.flush.write.bytes COUNT : 71099011
branch:
fillrandom : 128.494 micros/op 7782 ops/sec 128.494 seconds 1000000 operations; 0.9 MB/s
rocksdb.compact.read.bytes COUNT : 807474156
rocksdb.compact.write.bytes COUNT : 781977610
rocksdb.flush.write.bytes COUNT : 71098785
```
Reviewed By: ajkr
Differential Revision: D52637771
Pulled By: cbi42
fbshipit-source-id: 4f2c7925d0c3a718635c948ea0d4981ed9fabec3
Summary:
The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios.
* Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc.
* Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.)
* Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior.
* Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.)
* Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries.
* Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.`
* Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior.
* Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file.
Intended follow-up (me or others):
* Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense.
* There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL.
* The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253
Test Plan: unit tests updated
Reviewed By: jowlyzhang
Differential Revision: D52913733
Pulled By: pdillinger
fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593
Summary:
We saw failures like
```
db/perf_context_test.cc:952: Failure
Expected: (next_count) > (count), actual: 26699 vs 26699
```
I can repro by running the test repeatedly and the test fails with different seek keys. So
the cause is likely not with Seek() implementation. I found that
`clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);` can return the same time when
called repeatedly. However, I don't know if Seek() is fast enough that this happened during
continuous test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12252
Test Plan: `gtest_parallel.py --repeat=10000 --workers=1 ./perf_context_test --gtest_filter="PerfContextTest.CPUTimer"`
Reviewed By: ajkr
Differential Revision: D52912751
Pulled By: cbi42
fbshipit-source-id: 8985ae93baa99cdf4b9136ea38addd2e41f4b202
Summary:
Add asserts to help debug a crash test failure. The test fails as wollows -
```rocksdb::FilePickerMultiGet::PrepareNextLevel(): Assertion `fp_ctx.search_right_bound == -1 || fp_ctx.search_right_bound == FileIndexer::kLevelMaxIndex' failed```
Also add a unit test to verify an edge case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12241
Reviewed By: cbi42
Differential Revision: D52819029
Pulled By: anand1976
fbshipit-source-id: 33316985c8ace1aed9ecc2400da8b777aec488ff
Summary:
Fix issue https://github.com/facebook/rocksdb/issues/12208.
After all the SSTs have been deleted, all the blob files will become unreferenced.
These files should be considered obsolete and thus, should not be saved to the vstorage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12235
Reviewed By: jowlyzhang
Differential Revision: D52806441
Pulled By: ltamasi
fbshipit-source-id: 62f94d4f2544ed2822c764d8ace5bf7f57efe42d
Summary:
This PR significantly reduces the compaction pressure threshold introduced in https://github.com/facebook/rocksdb/issues/12130 by a factor of 64x. The original number was too high to trigger in scenarios where compaction parallelism was needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12236
Reviewed By: cbi42
Differential Revision: D52765685
Pulled By: ajkr
fbshipit-source-id: 8298e966933b485de24f63165a00e672cb9db6c4
Summary:
- **Context**:
In ClipColumnFamily, the DeleteRange API will be used to delete data, and then CompactRange will be called for physical deletion. But now However, the ColumnFamilyHandle is not passed , so by default only the DefaultColumnFamily will be CompactRanged. Therefore, it may cause that the data in some sst files of CompactionRange cannot be physically deleted.
- **In this change**
Pass the ColumnFamilyHandle when call CompactRange
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12219
Reviewed By: ajkr
Differential Revision: D52665162
Pulled By: cbi42
fbshipit-source-id: e8e997aa25ec4ca40e347be89edc7e84a7a0edce
Summary:
Summary - Refactor FilePrefetchBuffer code
- Implementation:
FilePrefetchBuffer maintains a deque of free buffers (free_bufs_) of size num_buffers_ and buffers (bufs_) which contains the prefetched data. Whenever a buffer is consumed or is outdated (w.r.t. to requested offset), that buffer is cleared and returned to free_bufs_.
If a buffer is available in free_bufs_, it's moved to bufs_ and is sent for prefetching. num_buffers_ defines how many buffers are maintained that contains prefetched data.
If num_buffers_ == 1, it's a sequential read flow. Read API will be called on that one buffer whenever the data is requested and is not in the buffer.
If num_buffers_ > 1, then the data is prefetched asynchronosuly in the buffers whenever the data is consumed from the buffers and that buffer is freed.
If num_buffers > 1, then requested data can be overlapping between 2 buffers. To return the continuous buffer overlap_bufs_ is used. The requested data is copied from 2 buffers to the overlap_bufs_ and overlap_bufs_ is returned to
the caller.
- Merged Sync and Async code flow into one in FilePrefetchBuffer.
Test Plan -
- Crash test passed
- Unit tests
- Pending - Benchmarks
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12097
Reviewed By: ajkr
Differential Revision: D51759552
Pulled By: akankshamahajan15
fbshipit-source-id: 69a352945affac2ed22be96048d55863e0168ad5
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
Summary:
Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174
Test Plan: Unit test, and some manual verification
Reviewed By: ajkr
Differential Revision: D52451184
Pulled By: anand1976
fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83
Summary:
Currently, the data are always compacted to the same level if exceed periodic_compaction_seconds which may confuse users, so we change it to allow trigger compaction to the next level here. It's a behavior change to users, and may affect users
who have disabled their ttl or ttl > periodic_compaction_seconds.
Relate issue: https://github.com/facebook/rocksdb/issues/12165
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12175
Reviewed By: ajkr
Differential Revision: D52446722
Pulled By: cbi42
fbshipit-source-id: ccd3d2c6434ed77055735a03408d4a62d119342f
Summary:
When ranking file by compaction priority in a level, prioritize files marked for compaction over files that are not marked. This only applies to default CompactPri kMinOverlappingRatio for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12187
Test Plan: * New unit tests
Reviewed By: ajkr
Differential Revision: D52437194
Pulled By: cbi42
fbshipit-source-id: 65ea9ce5bb421e598d539a55c8219b70844b82b3
Summary:
Now that `level_compaction_dynamic_level_bytes`'s default value is true, users who do not touch that setting and use non-leveled compaction will also see this log message. It can be info level rather than warning since, in the case mentioned, there is nothing the user needs to be warned about.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12186
Reviewed By: cbi42
Differential Revision: D52422499
Pulled By: ajkr
fbshipit-source-id: 8dbfcd102aab671b881ba047fb4a0a555b3e0a78
Summary:
Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test:
```
db_flush_test: db/column_family.cc:1648:
rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176
Test Plan: unit test added
Reviewed By: cbi42
Differential Revision: D52417071
Pulled By: pdillinger
fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5
Summary:
`Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)
However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time);
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147
Test Plan: Existing CircleCI
Reviewed By: cbi42
Differential Revision: D52173481
Pulled By: akankshamahajan15
fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. The pressure detection based on pending compaction bytes was only comparing against the slowdown trigger (`soft_pending_compaction_bytes_limit`). Online services tend to set that extremely high to avoid stalling at all costs. Perhaps they should have set it to zero, but we never documented that zero disables stalling so I have been telling everyone to increase it for years.
This PR adds pressure detection based on pending compaction bytes relative to the size of bottommost data. The size of bottommost data should be fairly stable and proportional to the logical data size
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12130
Reviewed By: hx235
Differential Revision: D52000746
Pulled By: ajkr
fbshipit-source-id: 7e1fd170901a74c2d4a69266285e3edf6e7631c7