Summary:
The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios.
* Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc.
* Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.)
* Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior.
* Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.)
* Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries.
* Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.`
* Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior.
* Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file.
Intended follow-up (me or others):
* Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense.
* There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL.
* The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253
Test Plan: unit tests updated
Reviewed By: jowlyzhang
Differential Revision: D52913733
Pulled By: pdillinger
fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593
Summary:
Currently, we treat the long-running whitebox_crash_test as passing. However, we were not cleaning up after ourselves when we killed the running test for running too long, which often caused out-of-space errors in subsequent tests (e.g., blackbox_crash_test after whitebox_crash_test).
Unless we want to start treating these timeouts as failures and need the DB output for investigation now, we should properly clean up the tmp dir.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12248
Test Plan:
```
$> make crash_test -j
```
Reviewed By: ajkr
Differential Revision: D52885342
Pulled By: jaykorean
fbshipit-source-id: 7c1f2ca7cf03d0705bb14155ee44d5d7a411c132
Summary:
These options were added for users to roll back a behavior change without downgrading. To our knowledge they were not needed so can now be removed.
- `level_compaction_dynamic_file_size`
- `ignore_max_compaction_bytes_for_input`
These options were added for users to disable an online validation in case it is expensive or has false positives. Those validations have shown to be cheap, correct, and are enabled by default, so these options can be removed.
- `check_flush_compaction_key_order`
- `flush_verify_memtable_count`
- `compaction_verify_record_count`
- `fail_if_options_file_error`
This option was added for users to violate API contracts or run old databases that used to violate API contracts. It appears to be set by MyRocks so it is unclear whether we can remove it. In any case we should discourage it until it can be removed.
- `enforce_single_del_contracts`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12249
Reviewed By: cbi42
Differential Revision: D52886651
Pulled By: ajkr
fbshipit-source-id: e0d5a35144ce048505899efb1ca68c3948050aa4
Summary:
We saw failures like
```
db/perf_context_test.cc:952: Failure
Expected: (next_count) > (count), actual: 26699 vs 26699
```
I can repro by running the test repeatedly and the test fails with different seek keys. So
the cause is likely not with Seek() implementation. I found that
`clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);` can return the same time when
called repeatedly. However, I don't know if Seek() is fast enough that this happened during
continuous test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12252
Test Plan: `gtest_parallel.py --repeat=10000 --workers=1 ./perf_context_test --gtest_filter="PerfContextTest.CPUTimer"`
Reviewed By: ajkr
Differential Revision: D52912751
Pulled By: cbi42
fbshipit-source-id: 8985ae93baa99cdf4b9136ea38addd2e41f4b202
Summary:
After refactoring of FilePrefetchBuffer, PREFETCH_BYTES_USEFUL was miscalculated. Instead of calculating how many requested bytes are already in the buffer, it took into account alignment as well because aligned_useful_len takes into consideration alignment too.
Also refactored the naming of chunk_offset_in_buffer to make it similar to aligned_useful_len
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12251
Test Plan:
1. Validated internally through release validation benchmarks.
2. Updated unit test that fails without the fix.
Reviewed By: ajkr
Differential Revision: D52891112
Pulled By: akankshamahajan15
fbshipit-source-id: 2526a0b0572d473beaf8b841f2f9c2f6275d9779
Summary:
## Overview
In this PR, we introduce support for setting the RocksDB native logger through Java. As mentioned in the discussion on the [Google Group discussion](https://groups.google.com/g/rocksdb/c/xYmbEs4sqRM/m/e73E4whJAQAJ), this work is primarily motivated by the JDK 17 [performance regression in JNI thread attach/detach calls](https://bugs.openjdk.org/browse/JDK-8314859): the only existing RocksJava logging configuration call, `setLogger`, invokes the provided logger over the JNI.
## Changes
Specifically, these changes add support for the `devnull` and `stderr` native loggers. For the `stderr` logger, we add the ability to prefix every log with a `logPrefix`, so that it becomes possible know which database a particular log is coming from (if multiple databases are in use). The API looks like the following:
```java
Options opts = new Options();
NativeLogger stderrNativeLogger = NativeLogger.newStderrLogger(
InfoLogLevel.DEBUG_LEVEL, "[my prefix here]");
options.setLogger(stderrNativeLogger);
try (final RocksDB db = RocksDB.open(options, ...)) {...}
// Cleanup
stderrNativeLogger.close()
opts.close();
```
Note that the API to set the logger is the same, via `Options::setLogger` (or `DBOptions::setLogger`). However, it will set the RocksDB logger to be native when the provided logger is an instance of `NativeLogger`.
## Testing
Two tests have been added in `NativeLoggerTest.java`. The first test creates both the `devnull` and `stderr` loggers, and sets them on the associated `Options`. However, to avoid polluting the testing output with logs from `stderr`, only the `devnull` logger is actually used in the test. The second test does the same logic, but for `DBOptions`.
It is possible to manually verify the `stderr` logger by modifying the tests slightly, and observing that the console indeed gets cluttered with logs from `stderr`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12213
Reviewed By: cbi42
Differential Revision: D52772306
Pulled By: ajkr
fbshipit-source-id: 4026895f78f9cc250daf6bfa57427957e2d8b053
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12243
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Reviewed By: jowlyzhang
Differential Revision: D52847993
fbshipit-source-id: 221da13c6ca9967e3b934f98f318a832a144df39
Summary:
Add asserts to help debug a crash test failure. The test fails as wollows -
```rocksdb::FilePickerMultiGet::PrepareNextLevel(): Assertion `fp_ctx.search_right_bound == -1 || fp_ctx.search_right_bound == FileIndexer::kLevelMaxIndex' failed```
Also add a unit test to verify an edge case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12241
Reviewed By: cbi42
Differential Revision: D52819029
Pulled By: anand1976
fbshipit-source-id: 33316985c8ace1aed9ecc2400da8b777aec488ff
Summary:
This PR fixes this type of stress test failure that could happen in either checkpoint or backup. Example failure messages are like this:
`Failure in a backup/restore operation with: Corruption: 0x00000000000001D5000000000000012B00000000000000FD exists in original db but not in restore`
`A checkpoint operation failed with: Corruption: 0x0000000000000365000000000000012B0000000000000067 exists in original db but not in checkpoint /...`
The internal task has an example test command to quickly reproduce this type of error.
The common symptom of these test failures are these expected keys do not exist in the original db either. The root cause is `TestCheckpoint` and `TestBackupRestore` both use the expected state as a proxy for the state of the original db when it comes to check a key's existence. 0758271d51/db_stress_tool/db_stress_test_base.cc (L1838)
This `ExpectedState::Exists` API returns true if a key has a pending write, such as a pending put. In usual case, this pending put should either soon materialize to an actual write when `PendingExpectedValue::Commit` is called to reflect a successful write to the DB, or test should be safely terminated if write to DB fails. All of which happens while a key is locked. So checkpoint and backup usually won't see the discrepancy between db and expected state caused by pending writes. However, the external file ingestion test currently has a path that will proceed the test after a failed ingestion caused by injected errors, leaving the pending put in the expected state. 0758271d51/db_stress_tool/no_batched_ops_stress.cc (L1577-L1589)
I think a proper and future proof fix for this is to explicitly rollback a pending state when a db write operation failed so that expected state do not diverge from db in the first place. I added a `PendingExpectedValue::Rollback` API so that we don't implicitly depend on thread termination to prevent test failures. Another place that could cause same divergence as external file ingestion is `PreloadDbAndReopenAsReadOnly`.
0758271d51/db_stress_tool/db_stress_test_base.cc (L616-L619)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12227
Reviewed By: hx235
Differential Revision: D52705470
Pulled By: jowlyzhang
fbshipit-source-id: b21586b037caeeba29a2cff8c2fdc6f1d0bda9cf
Summary:
Add ```CompressionOptions``` to ```CompressedSecondaryCacheOptions``` to allow users to set options such as compression level. It allows performance to be fine tuned.
Tests -
Run db_bench and verify compression options in the LOG file
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12234
Reviewed By: ajkr
Differential Revision: D52758133
Pulled By: anand1976
fbshipit-source-id: af849fbffce6f84704387c195d8edba40d9548f6
Summary:
Fix issue https://github.com/facebook/rocksdb/issues/12208.
After all the SSTs have been deleted, all the blob files will become unreferenced.
These files should be considered obsolete and thus, should not be saved to the vstorage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12235
Reviewed By: jowlyzhang
Differential Revision: D52806441
Pulled By: ltamasi
fbshipit-source-id: 62f94d4f2544ed2822c764d8ace5bf7f57efe42d
Summary:
This PR significantly reduces the compaction pressure threshold introduced in https://github.com/facebook/rocksdb/issues/12130 by a factor of 64x. The original number was too high to trigger in scenarios where compaction parallelism was needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12236
Reviewed By: cbi42
Differential Revision: D52765685
Pulled By: ajkr
fbshipit-source-id: 8298e966933b485de24f63165a00e672cb9db6c4
Summary:
### Summary: The sst_dump tool occur IO Error when reading data in PlainTable, as shown in the follow
```bash
❯ ./sst_dump --file=/tmp/write_example --command=scan --show_properties --verify_checksum
options.env is 0x60000282dc00
Process /tmp/write_example/001630.sst
Sst file format: plain table
/tmp/filepicker_example/001630.sst: IO error: While pread offset 0 len 758: /tmp/filepicker_example/001630.sst: Bad address
Process /tmp/filepicker_example/001624.sst
```
#### Reason
The root cause is that `fopts.use_mmap_reads` is false, `NewRandomAccessFile` will produce an `PosixRandomAccessFile` file. but `soptions_.use_mmap_reads` is true, This will result in unexpected calls in the `MmapDataIfNeeded` function.
```c++
Status SstFileDumper::GetTableReader(const std::string& file_path) {
...
if (s.ok()) {
if (magic_number == kPlainTableMagicNumber ||
magic_number == kLegacyPlainTableMagicNumber ||
magic_number == kCuckooTableMagicNumber) {
soptions_.use_mmap_reads = true;
...
// WARN: fopts.use_mmap_reads is false
fs->NewRandomAccessFile(file_path, fopts, &file, nullptr);
file_.reset(new RandomAccessFileReader(std::move(file), file_path));
}
...
}
if (s.ok()) {
// soptions_.use_mmap_reads is true
s = NewTableReader(ioptions_, soptions_, internal_comparator_, file_size,
&table_reader_);
}
return s;
}
```
The following read logic was executed on a `PosixRandomAccessFile` file, Eventually, `PosixRandomAccessFile::Read` will be called with a `nullptr` `scratch`
```c++
Status PlainTableReader::MmapDataIfNeeded() {
if (file_info_.is_mmap_mode) {
// Get mmapped memory.
// Executing the following logic on the PosixRandomAccessFile file is incorrect
return file_info_.file->Read(
IOOptions(), 0, static_cast<size_t>(file_size_), &file_info_.file_data,
nullptr, nullptr, Env::IO_TOTAL /* rate_limiter_priority */);
}
return Status::OK();
}
```
#### Fix:
When parsing PlainTable, set the variable `fopts.use_mmap_reads` equal `soptions_.use_mmap_reads`, When the `soptions_.use_mmap_reads` is true, `NewRandomAccessFile` will produce an `PosixMmapReadableFile` file. This will work correctly in the `MmapDataIfNeeded` function
```
❯ ./sst_dump --file=/tmp/write_example --command=scan --show_properties --verify_checksum
options.env is 0x6000009323e0
Process /tmp/write_example/001630.sst
Sst file format: plain table
from [] to []
'keys496' seq:0, type:1 => values1496
'keys497' seq:0, type:1 => values1497
'keys498' seq:0, type:1 => values1498
Table Properties:
------------------------------
# data blocks: 1
# entries: 3
# deletions: 0
# merge operands: 0
# range deletions: 0
raw key size: 45
raw average key size: 15.000000
raw value size: 42
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12223
Reviewed By: cbi42
Differential Revision: D52706238
Pulled By: ajkr
fbshipit-source-id: 2f9f518ec81d1cbde00bd65ab6bd304796836c0a
Summary:
In the current flow, the verification will pass and continue the test when db return non Ok(NotFound) status while expected state has pending writes.
fdfd044bb2/db_stress_tool/no_batched_ops_stress.cc (L2054-L2065)
We can just abort when such a db status is ever encountered. This can prevent follow up tests like `TestCheckpoint` and `TestBackupRestore` to consider such a key as existing in the db via the `ExpectedState::Exists` API. This could be a reason for some recent test failures in this path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12232
Reviewed By: cbi42
Differential Revision: D52737393
Pulled By: jowlyzhang
fbshipit-source-id: f2658c5332ccd42f6190783960e2dc6fcd81ccc5
Summary:
We test LockWAL() and UnlockWAL() by checking that latest sequence number is not changed: 1a1f9f1660/db_stress_tool/db_stress_test_base.cc (L920-L937). With writeprepared transaction, sequence number can be advanced in SwitchMemtable::WriteRecoverableState() when writing recoverable state: 1a1f9f1660/db/db_impl/db_impl_write.cc (L1560)
This PR disables LockWAL() tests for writeprepared transaction for now. We probably need to change how we test LockWAL() for writeprepared before re-enabling this test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12221
Reviewed By: ajkr
Differential Revision: D52677076
Pulled By: cbi42
fbshipit-source-id: 27ee694878edf63e8f4ad52f769d4db401f511bc
Summary:
Similar to https://github.com/facebook/rocksdb/issues/11249 , we started to get failures from `TestGetEntity` when the User-defined-timestamp was enabled. Applying the same fix as the `TestGet`
_Scenario copied from #11249_
<table>
<tr>
<th>TestGet thread</th>
<th> A writing thread</th>
</tr>
<tr>
<td>read_opts.timestamp = GetNow()</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Lock key, do write</td>
</tr>
<tr>
<td>Lock key, read(read_opts) return NotFound</td>
<td></td>
</tr>
</table>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12222
Reviewed By: jowlyzhang
Differential Revision: D52678830
Pulled By: jaykorean
fbshipit-source-id: 6e154f67bb32968add8fea0b7ae7c4858ea64ee7
Summary:
- **Context**:
In ClipColumnFamily, the DeleteRange API will be used to delete data, and then CompactRange will be called for physical deletion. But now However, the ColumnFamilyHandle is not passed , so by default only the DefaultColumnFamily will be CompactRanged. Therefore, it may cause that the data in some sst files of CompactionRange cannot be physically deleted.
- **In this change**
Pass the ColumnFamilyHandle when call CompactRange
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12219
Reviewed By: ajkr
Differential Revision: D52665162
Pulled By: cbi42
fbshipit-source-id: e8e997aa25ec4ca40e347be89edc7e84a7a0edce
Summary:
Currently, when `block_cache_trace_analyzer` analyzes the cache miss ratio, it only analyzes the total miss ratio.
But it seems also important to analyze the cache miss ratio of each caller. To achieve this, we can calculate and print the miss ratio of each caller in the analyzer.
## Before modification
```
Running for 1 seconds: Processed 85732 records/second. Trace duration 58 seconds. Observed miss ratio 7.97
```
## After modification
```
Running for 1 seconds: Processed 85732 records/second. Trace duration 58 seconds. Observed miss ratio 7.97
Caller Get: Observed miss ratio 6.31
Caller Iterator: Observed miss ratio 11.86
***************************************************************
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10823
Reviewed By: ajkr
Differential Revision: D52632764
Pulled By: hx235
fbshipit-source-id: 40994d6039b73dc38fe78ea1b4adce187bb98909
Summary:
This feature combination is not fully working yet. Disable them so the stress tests have less noise.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12218
Reviewed By: cbi42
Differential Revision: D52643957
Pulled By: jowlyzhang
fbshipit-source-id: 8815a18a3b5814cad4f7ec41f3fb94869302081e
Summary:
This should print more helpful message when a non-ok status like Corruption is returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12217
Test Plan: CI passes.
Reviewed By: jaykorean
Differential Revision: D52637595
Pulled By: cbi42
fbshipit-source-id: e810eeb4cba633d4d4c5d198da4468995e4ed427
Summary:
Fix heap use after free error in FilePrefetchBuffer
Fix heap use after free error in FilePrefetchBuffer
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12211
Test Plan:
Ran db_stress in ASAN mode
```
==652957==ERROR: AddressSanitizer: heap-use-after-free on address 0x6150006d8578 at pc 0x7f91f74ae85b bp 0x7f91c25f90c0 sp 0x7f91c25f90b8
READ of size 8 at 0x6150006d8578 thread T48
#0 0x7f91f74ae85a in void __gnu_cxx::new_allocator<rocksdb::BufferInfo*>::construct<rocksdb::BufferInfo*, rocksdb::BufferInfo*&>(rocksdb::BufferInfo**, rocksdb::BufferInfo*&) /mnt/gvfs/third-party2/libgcc/c00dcc6a3e4125c7e8b248e9a79c14b78ac9e0ca/11.x/platform010/5684a5a/include/c++/trunk/ext/new_allocator.h:163
https://github.com/facebook/rocksdb/issues/1 0x7f91f74ae85a in void std::allocator_traits<std::allocator<rocksdb::BufferInfo*> >::construct<rocksdb::BufferInfo*, rocksdb::BufferInfo*&>(std::allocator<rocksdb::BufferInfo*>&, rocksdb::BufferInfo**, rocksdb::BufferInfo*&) /mnt/gvfs/third-party2/libgcc/c00dcc6a3e4125c7e8b248e9a79c14b78ac9e0ca/11.x/platform010/5684a5a/include/c++/trunk/bits/alloc_traits.h:512
https://github.com/facebook/rocksdb/issues/2 0x7f91f74ae85a in rocksdb::BufferInfo*& std::deque<rocksdb::BufferInfo*, std::allocator<rocksdb::BufferInfo*> >::emplace_back<rocksdb::BufferInfo*&>(rocksdb::BufferInfo*&) /mnt/gvfs/third-party2/libgcc/c00dcc6a3e4125c7e8b248e9a79c14b78ac9e0ca/11.x/platform010/5684a5a/include/c++/trunk/bits/deque.tcc:170
https://github.com/facebook/rocksdb/issues/3 0x7f91f74b93d8 in rocksdb::FilePrefetchBuffer::FreeAllBuffers() file/file_prefetch_buffer.h:557
```
Reviewed By: ajkr
Differential Revision: D52575217
Pulled By: akankshamahajan15
fbshipit-source-id: 6811ec10a393f5a62fedaff0fab5fd6e823c2687
Summary:
We often need to read the table properties of an SST file when taking a backup. However, we currently do not check checksums for this step, and even with that enabled, we ignore failures. This change ensures we fail creating a backup if corruption is detected in that step of reading table properties.
To get this working properly (with existing unit tests), we also add some temperature handling logic like already exists in
BackupEngineImpl::ReadFileAndComputeChecksum and elsewhere in BackupEngine. Also, SstFileDumper needed a fix to its error handling logic.
This was originally intended to help diagnose some mysterious failures (apparent corruptions) seen in taking backups in the crash test, though that is now fixed in https://github.com/facebook/rocksdb/pull/12206
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12200
Test Plan: unit test added that corrupts table properties, along with existing tests
Reviewed By: ajkr
Differential Revision: D52520674
Pulled By: pdillinger
fbshipit-source-id: 032cfc0791428f3b8147d34c7d424ab128e28f42
Summary:
Summary - Refactor FilePrefetchBuffer code
- Implementation:
FilePrefetchBuffer maintains a deque of free buffers (free_bufs_) of size num_buffers_ and buffers (bufs_) which contains the prefetched data. Whenever a buffer is consumed or is outdated (w.r.t. to requested offset), that buffer is cleared and returned to free_bufs_.
If a buffer is available in free_bufs_, it's moved to bufs_ and is sent for prefetching. num_buffers_ defines how many buffers are maintained that contains prefetched data.
If num_buffers_ == 1, it's a sequential read flow. Read API will be called on that one buffer whenever the data is requested and is not in the buffer.
If num_buffers_ > 1, then the data is prefetched asynchronosuly in the buffers whenever the data is consumed from the buffers and that buffer is freed.
If num_buffers > 1, then requested data can be overlapping between 2 buffers. To return the continuous buffer overlap_bufs_ is used. The requested data is copied from 2 buffers to the overlap_bufs_ and overlap_bufs_ is returned to
the caller.
- Merged Sync and Async code flow into one in FilePrefetchBuffer.
Test Plan -
- Crash test passed
- Unit tests
- Pending - Benchmarks
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12097
Reviewed By: ajkr
Differential Revision: D51759552
Pulled By: akankshamahajan15
fbshipit-source-id: 69a352945affac2ed22be96048d55863e0168ad5
Summary:
FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200.
In this change,
* Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption.
* Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior.
* Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer
* Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues.
Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206
Test Plan:
* Manually verified that the new check aids in debugging.
* Unit test added, that fails if either fix is missed.
* Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200
Reviewed By: akankshamahajan15
Differential Revision: D52551701
Pulled By: pdillinger
fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
Summary:
Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174
Test Plan: Unit test, and some manual verification
Reviewed By: ajkr
Differential Revision: D52451184
Pulled By: anand1976
fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83
Summary:
Currently, the data are always compacted to the same level if exceed periodic_compaction_seconds which may confuse users, so we change it to allow trigger compaction to the next level here. It's a behavior change to users, and may affect users
who have disabled their ttl or ttl > periodic_compaction_seconds.
Relate issue: https://github.com/facebook/rocksdb/issues/12165
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12175
Reviewed By: ajkr
Differential Revision: D52446722
Pulled By: cbi42
fbshipit-source-id: ccd3d2c6434ed77055735a03408d4a62d119342f
Summary:
When ranking file by compaction priority in a level, prioritize files marked for compaction over files that are not marked. This only applies to default CompactPri kMinOverlappingRatio for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12187
Test Plan: * New unit tests
Reviewed By: ajkr
Differential Revision: D52437194
Pulled By: cbi42
fbshipit-source-id: 65ea9ce5bb421e598d539a55c8219b70844b82b3
Summary:
Currently, some numbers in the `tracer_analyzer_tool` may be a little confusing and unfriendly for people who want to add new query types.
It may be better to replace them with the existing enumeration type to improve readability.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10827
Reviewed By: ajkr
Differential Revision: D40576023
Pulled By: hx235
fbshipit-source-id: 0eb16820a15f365d53e848a3a8efd92928420429
Summary:
Now that `level_compaction_dynamic_level_bytes`'s default value is true, users who do not touch that setting and use non-leveled compaction will also see this log message. It can be info level rather than warning since, in the case mentioned, there is nothing the user needs to be warned about.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12186
Reviewed By: cbi42
Differential Revision: D52422499
Pulled By: ajkr
fbshipit-source-id: 8dbfcd102aab671b881ba047fb4a0a555b3e0a78
Summary:
The hardcoded nullptr argument for SystemClock to PERF_CPU_TIMER_GUARD ignored any SystemClock instance provided by the env; this was probably an oversight.
In practice, the defaulting SystemClock could lead to excessive `clock_gettime(CLOCK_THREAD_CPUTIME_ID)` syscalls if `report_bg_io_stats=true` which cannot be mitigated by the embedder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12180
Reviewed By: hx235
Differential Revision: D52421750
Pulled By: ajkr
fbshipit-source-id: 92f8a93cebe9f8030ea5f6c3bf35398078e6bdfe
Summary:
**Description**
This PR passes along the native `LiveFileMetaData#file_checksum` field from the C++ class to the Java API as a copied byte array. If there is no file checksum generator factory set beforehand, then the array will empty. Please advise if you'd rather it be null - an empty array means one extra allocation, but it avoids possible null pointer exceptions.
> **Note**
> This functionality complements but does not supersede https://github.com/facebook/rocksdb/issues/11736
It's outside the scope here to add support for Java based `FileChecksumGenFactory` implementations. As a workaround, users can already use the built-in one by creating their initial `DBOptions` via properties:
```java
final Properties props = new Properties();
props.put("file_checksum_gen_factory", "FileChecksumGenCrc32cFactory");
try (final DBOptions dbOptions = DBOptions.getDBOptionsFromProps(props);
final ColumnFamilyOptions cfOptions = new ColumnFamilyOptions();
final Options options = new Options(dbOptions, cfOptions).setCreateIfMissing(true)) {
// do stuff
}
```
I wanted to add a better test, but unfortunately there's no available CRC32C implementation available in Java 8 without adding a dependency or adding a JNI helper for RocksDB's own implementation (or bumping the minimum version for tests to Java 9). That said, I understand the test is rather poor, so happy to change it to whatever you'd like.
**Context**
To give some context, we replicate RocksDB checkpoints to other nodes. Part of this is verifying the integrity of each file during replication. With a large enough RocksDB, computing the checksum ourselves is prohibitively expensive. Since SST files comprise the bulk of the data, we'd much rather delegate this to RocksDB on file write, and read it back after to compare.
It's likely we will provide a follow up to read the file checksum list directly from the manifest without having to open the DB, but this was the easiest first step to get it working for us.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11770
Reviewed By: hx235
Differential Revision: D52420729
Pulled By: ajkr
fbshipit-source-id: a873de35a48aaf315e125733091cd221a97b9073
Summary:
Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test:
```
db_flush_test: db/column_family.cc:1648:
rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176
Test Plan: unit test added
Reviewed By: cbi42
Differential Revision: D52417071
Pulled By: pdillinger
fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5