Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/11631 introduced an undesired fallback behavior to RocksDB internal prefetching even when FS prefetching return non-OK status other than "Unsupported". We only want to fall back when FS prefetching is not supported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11897
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D49667055
Pulled By: hx235
fbshipit-source-id: fa36e4e5d6dc9507080217035f9d6ff8e4abda28
Summary:
Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity.
With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy.
Tests:
New unit tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873
Reviewed By: akankshamahajan15
Differential Revision: D49562250
Pulled By: anand1976
fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
Summary:
With atomic_flush=true, a flush job with younger memtables wait for older memtables to be installed before install its memtables. If the flush for older memtables failed, auto-recovery starts a resume thread which can becomes stuck waiting for all background work to finish (including the flush for younger memtables). If a non-recovery flush starts now and tries to flush, it can make the situation worse since it will fail due to background error but never rollback its memtable: 269478ee46/db/db_impl/db_impl_compaction_flush.cc (L725) This prevents any future flush to pick old memtables.
A more detailed repro is in unit test.
This PR fixes this issue by
1. Ensure we rollback memtables if an atomic flush fails due to background error
2. When there is a background error, abort atomic flushes that are waiting for older memtables to be installed
3. Do not schedule non-recovery flushes when there is a background error that stops background work
There was another issue with atomic_flush=true where DB can hang during DB close, see more in #11867. The fix in this PR, specifically fix 2 above, should be enough to resolve it too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11872
Test Plan: new unit test.
Reviewed By: jowlyzhang
Differential Revision: D49556867
Pulled By: cbi42
fbshipit-source-id: 4a0210ff28a8552a99ece7fbb0f574fd24b4da3f
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an error message like the following:
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```
This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: f42e70bf56/db/db_impl/db_impl_compaction_flush.cc (L3161)
This PR fixes the issue by rollback these pending flushes.
There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.
There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865
Test Plan: * Added repro unit tests.
Reviewed By: anand1976
Differential Revision: D49484922
Pulled By: cbi42
fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
Summary:
This PR makes disabling the compressed secondary cache by setting capacity to 0 a bit more efficient. Previously, inserts/lookups would go to the backing LRUCache before getting rejected due to 0 capacity. With this change, insert/lookup would return from ```CompressedSecondaryCache``` itself.
Tests:
Existing tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11863
Reviewed By: akankshamahajan15
Differential Revision: D49476248
Pulled By: anand1976
fbshipit-source-id: f0f17a5e3df7d8bfc06709f8f23c1302056ba590
Summary:
When auto_readahead_size is enabled in async_io, during seek, first buffer will prefetch the data - (current block + readahead till upper_bound). There can be cases where
1. first buffer prefetched all the data till upper bound, or
2. first buffer already has the data from prev seek call
and second buffer prefetch further leading to alignment issues.
This PR fixes that assertion and second buffer won't go for prefetching if first buffer has already prefetched till upper_bound.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11852
Test Plan:
- Added new unit test that failed without this fix.
- crash tests passed locally
Reviewed By: pdillinger
Differential Revision: D49384138
Pulled By: akankshamahajan15
fbshipit-source-id: 54417e909e4d986f1e5a17dbaea059cd4962fd4d
Summary:
**Summary:**
When row cache hits and a timestamp is being set in read_options, even though ROW_CACHE entry is hit, the return status is kNotFound.
**Cause of error:**
If timestamp is provided in readoptions, a callback for sequence number checking is registered [here](8fc78a3a9e/db/db_impl/db_impl.cc (L2112)).
Hence the default value set at this [line](694e49cbb1/table/get_context.cc (L611)) prevents get_context from saving value found in cache. Causing the final status to be kNotFound even though the entry exist in both cache and SST file.
**Proposed Solution**
Row cache key contains a sequence number in it. If the key for row cache lookup matches the key in cache, this cache entry should be good to be exposed to user and hence we reuse the sequence number in cache key rather than passing kMaxSequenceNumber.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11816
Reviewed By: ajkr
Differential Revision: D49419029
Pulled By: jowlyzhang
fbshipit-source-id: 6c77e9e751628d7d8e6c389f299e29a11ea824c6
Summary:
Fixes https://github.com/facebook/rocksdb/issues/10257 (also see [here](https://github.com/facebook/rocksdb/pull/10355#issuecomment-1684308556)) by releasing compaction files earlier when writing to manifest in LogAndApply(). This is done by passing in a [callback](ba59751430/db/version_set.h (L1199)) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11764
Test Plan:
* Existing unit tests
* A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite`
* `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released
Reviewed By: ajkr
Differential Revision: D48742152
Pulled By: cbi42
fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
Summary:
Update the logic in FilePrefetchBuffer to update `upper_bound_offset_` during reseek. During Reseek, `iterate_upper_bound` can be changed dynamically. So added an API to update that in FilePrefetchBuffer.
Added unit test to confirm the behavior.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11775
Test Plan:
- Check stress tests in case there is any failure after this diff.
- make crash_test -j32 with auto_readahead_size=1 passed locally
Reviewed By: anand1976
Differential Revision: D48815177
Pulled By: akankshamahajan15
fbshipit-source-id: 5f44fbb3af06c86a1c38f139c5fa4543891837f4
Summary:
When `MultiGet` acquires `SuperVersion` via locking the db mutex and get the current `ColumnFamilyData::super_version_`, its corresponding cleanup logic is not correctly done.
It's currently doing this:
`MultiGetColumnFamilyData::cfd->GetSuperVersion().Unref()`
This operates on the most recent `SuperVersion` without locking db mutex , which is not thread safe by itself. And this unref operation is intended for the originally acquired `SuperVersion` instead of the current one. Because a race condition could happen where a new `SuperVersion` is installed in between this `MultiGet`'s ref and unref. When this race condition does happen, it's not sufficient to just unref the `SuperVersion`, `DBImpl::CleanupSuperVersion` should be called instead to properly clean up the `SuperVersion` had this `MultiGet` call be its last reference holder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11830
Test Plan:
`make all check`
Added a unit test that would originally fail
Reviewed By: ltamasi
Differential Revision: D49287715
Pulled By: jowlyzhang
fbshipit-source-id: 8353636ee11b2e90d85c677a96a92360072644b0
Summary:
During Seek, the iterator seeks every file on L0. In async_io, it submit the requests to seek on every file on L0 asynchronously using RocksDB FilePrefetchBuffer.
However, FilePrefetchBuffer does alignment and reads extra bytes then needed that can increase the throughput.
In case of non direct io, the alignment can be avoided.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11793
Test Plan:
- Added a unit test that fails without this PR.
- make crash_test -j32 completed successfully
Reviewed By: anand1976
Differential Revision: D48985051
Pulled By: akankshamahajan15
fbshipit-source-id: 2d130a9e7c3df9c4fcd0408406e6277ab75a4389
Summary:
This should only affect iterator when
- user uses DeleteRange(),
- An iterator from level L has a non-ok status (such non-ok status may not be caught before the bug fix in https://github.com/facebook/rocksdb/pull/11783), and
- A range tombstone covers a key from level > L and triggers a reseek sets the status_ to OK in SeekImpl()/SeekPrevImpl() e.g. bd6a8340c3/table/merging_iterator.cc (L801)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11786
Differential Revision: D48908830
Pulled By: cbi42
fbshipit-source-id: eb564be375af4e33dc27542eff753260186e6d5d
Summary:
This happens in (Compaction)MergingIterator layer, and can cause data loss during compaction or read/scan return incorrect result
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11782
Reviewed By: ajkr
Differential Revision: D48880575
Pulled By: cbi42
fbshipit-source-id: 2294ad284a6d653d3674bebe55380f12ee4b645b
Summary:
For a SST file that uses user-defined timestamp aware comparators, if a lower or upper bound is set, sst_dump tool doesn't handle it well. This PR adds support for that. While working on this `MaybeAddTimestampsToRange` is moved to the udt_util.h file to be shared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11757
Test Plan:
make all check
for changes in db_impl.cc and db_impl_compaction_flush.cc
for changes in sst_file_dumper.cc, I manually tested this change handles specifying bounds for UDT use cases. It probably should have a unit test file eventually.
Reviewed By: ltamasi
Differential Revision: D48668048
Pulled By: jowlyzhang
fbshipit-source-id: 1560465f40e44668d6d82a7439fe9012be0e74a8
Summary:
When `num_levels` > 65, we may be shifting more than 63 bits in FileTtlBooster. This can give errors like: `runtime error: shift exponent 98 is too large for 64-bit type 'uint64_t' (aka 'unsigned long')`. This PR makes a quick fix for this issue by taking a min in the shifting component. This issue should be rare since it requires a user using a large `num_levels`. I'll follow up with a more complex fix if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11673
Test Plan: * Add a unit test that produce the above error before this PR. Need to compile it with ubsan: `COMPILE_WITH_UBSAN=1 OPT="-fsanitize-blacklist=.circleci/ubsan_suppression_list.txt" ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 compaction_picker_test`
Reviewed By: hx235
Differential Revision: D48074386
Pulled By: cbi42
fbshipit-source-id: 25e59df7e93f20e0793cffb941de70ac815d9392
Summary:
Updated the main branch for the 8.5.fb branch cut. Also made unreleased_history/release.sh backdate to the last commit instead of the current date in case the release manager is a laggard like myself.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11642
Reviewed By: cbi42
Differential Revision: D47783574
Pulled By: ajkr
fbshipit-source-id: 4e2a80f5ccd542dc7dd0d22dfd7e59cb136325a1
Summary:
Fix use_after_free bug in async_io MultiReads when underlying FS enabled kFSBuffer. kFSBuffer is when underlying FS pass their own buffer instead of using RocksDB scratch in FSReadRequest
Since it's an experimental feature, added a hack for now to fix the bug.
Planning to make public API change to remove const from the callback as it doesn't make sense to use const.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11645
Test Plan: tested locally
Reviewed By: ltamasi
Differential Revision: D47819907
Pulled By: akankshamahajan15
fbshipit-source-id: 1faf5ef795bf27e2b3a60960374d91274931df8d
Summary:
When num_file_reads_for_auto_readahead = 1, during seek, it would go for prefetchingextra data in second buffer along with seek data, that would lead to increase in read data and
discarded bytes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11560
Test Plan: Added unit test
Reviewed By: anand1976
Differential Revision: D47008102
Pulled By: akankshamahajan15
fbshipit-source-id: 566c6131cb5f968d5efb81fd0ab233ff7e534ab0
Summary:
Use another static object to join threads instead.
This change is motivated by a case in which some code using NewLRUCache() -> ShardedCacheBase -> SemiStructuredUniqueIdGen -> GenerateRawUniqueId() -> Env::Default() was happening
during static destruction.
I didn't see anything else in PosixEnv or base classes that would cause a problem by not
destroying. (WinEnv is already not destroyed; see env_default.cc)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11538UndefinedBehaviorSanitizer: undefined-behavior env/env_test.cc:3561:23 in
$
```
Test Plan:
test added, which would previously fail with UBSAN:
```
$ ./env_test --gtest_filter=*Destruct*
Note: Google Test filter = *Destruct*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from EnvTestMisc
[ RUN ] EnvTestMisc.StaticDestruction
[ OK ] EnvTestMisc.StaticDestruction (0 ms)
[----------] 1 test from EnvTestMisc (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[ PASSED ] 1 test.
env/env_test.cc:3561:23: runtime error: member call on address 0x7f7b96671ca8 which does not point to an object of type 'rocksdb::Env'
0x7f7b96671ca8: note: object is of type 'N7rocksdb12ConfigurableE'
00 00 00 00 90 a7 f7 95 7b 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^~~~~~~~~~~~~~~~~~~~~~~
vptr for 'N7rocksdb12ConfigurableE'
Reviewed By: jowlyzhang
Differential Revision: D46737389
Pulled By: pdillinger
fbshipit-source-id: 0f80a443bf799ffc5641e898cf3a75f7d10a987b
Summary:
Fix a use-after-move issue in block.cc and added some unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11505
Test Plan:
```
make all check
./block_test
```
Reviewed By: ltamasi
Differential Revision: D46506188
Pulled By: jowlyzhang
fbshipit-source-id: 316ed8ddd221c00b2bce2cf9fd47eea686cd74a5
Summary:
See new NOTE in HISTORY.md and unreleased_history/README.txt
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11481
Test Plan: some manual testing on my CentOS 8 system
Reviewed By: jaykorean
Differential Revision: D46233342
Pulled By: pdillinger
fbshipit-source-id: daf59cf3dc907f450b469090dcc481a30a7d7c0d