Summary:
With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773
Test Plan:
Reproduce: D39993203 has detailed steps.
Run the test with and without the fix.
Reviewed By: cbi42
Differential Revision: D40077955
Pulled By: cbi42
fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240
Summary:
fix for https://github.com/facebook/rocksdb/issues/10752 where RocksDB could be in an infinite compaction loop (with compaction reason kBottommostFiles) if allow_ingest_behind is enabled and the bottommost level is unfilled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10767
Test Plan: Added a unit test to reproduce the compaction loop.
Reviewed By: ajkr
Differential Revision: D40031861
Pulled By: ajkr
fbshipit-source-id: 71c4b02931fbe507a847632905404c9b8fa8c96b
Summary:
Fix a bug in Iterator::Refresh() where the local SV it obtained could be obsolete upon return, and should be cleaned up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10770
Test Plan: added a unit test to reproduce the issue.
Reviewed By: ajkr
Differential Revision: D40063809
Pulled By: ajkr
fbshipit-source-id: 619e728eb0f1ac9540b4d0ad38e43acc37a514b2
Summary:
When `FlushOptions::wait` is set to false, manual flush should not stall forever.
If the database has already stopped writes, then the thread calling `DB::Flush()` with
`FlushOptions::wait=false` should not enter the `DBImpl::write_thread_`.
To prevent this, we should do a check at the beginning and return `TryAgain()`
Resolves: https://github.com/facebook/rocksdb/issues/9892
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10001
Reviewed By: siying
Differential Revision: D36422303
Pulled By: siying
fbshipit-source-id: 723bd3065e8edc4f17c82449d0d6b95a2381ac0a
Summary:
For RoundRobin compaction, the data should be mostly sorted per level and within level. Use normal compaction picker for RR until all expired data is compacted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10725
Reviewed By: ajkr
Differential Revision: D39771069
Pulled By: jay-zhuang
fbshipit-source-id: 7ccf88d7c093fad5673bda73a7b08cc4757780cd
Summary:
This is intended as a step toward possibly separating secondary cache integration from the
Cache implementation as much as possible, to (hopefully) minimize code duplication in
adding secondary cache support to HyperClockCache.
* Major clarifications to API docs of secondary cache compatible parts of Cache. For example, previously the docs seemed to suggest that Wait() was not needed if IsReady()==true. And it wasn't clear what operations were actually supported on pending handles.
* Add some assertions related to these requirements, such as that we don't Release() before Wait() (which would leak a secondary cache handle).
* Fix a leaky abstraction with dummy handles, which are supposed to be internal to the Cache. Previously, these just used value=nullptr to indicate dummy handle, which meant that they could be confused with legitimate value=nullptr cases like cache reservations. Also fixed blob_source_test which was relying on this leaky abstraction.
* Drop "incomplete" terminology, which was another name for "pending".
* Split handle flags into "mutable" ones requiring mutex and "immutable" ones which do not. Because of single-threaded access to pending handles, the "Is Pending" flag can be in the "immutable" set. This allows removal of a TSAN work-around and removing a mutex acquire-release in IsReady().
* Remove some unnecessary handling of charges on handles of failed lookups. Keeping total_charge=0 means no special handling needed. (Removed one unnecessary mutex acquire/release.)
* Simplify handling of dummy handle in Lookup(). There is no need to explicitly Ref & Release w/Erase if we generally overwrite the dummy anyway. (Removed one mutex acquire/release, a call to Release().)
Intended follow-up:
* Clarify APIs in secondary_cache.h
* Doesn't SecondaryCacheResultHandle transfer ownership of the Value() on success (implementations should not release the value in destructor)?
* Does Wait() need to be called if IsReady() == true? (This would be different from Cache.)
* Do Value() and Size() have undefined behavior if IsReady() == false?
* Why have a custom API for what is essentially a std::future<std::pair<void*, size_t>>?
* Improve unit testing of standalone handle case
* Apparent null `e` bug in `free_standalone_handle` case
* Clean up secondary cache testing in lru_cache_test
* Why does TestSecondaryCacheResultHandle hold on to a Cache::Handle?
* Why does TestSecondaryCacheResultHandle::Wait() do nothing? Shouldn't it establish the post-condition IsReady() == true?
* (Assuming that is sorted out...) Shouldn't TestSecondaryCache::WaitAll simply wait on each handle in order (no casting required)? How about making that the default implementation?
* Why does TestSecondaryCacheResultHandle::Size() check Value() first? If the API is intended to be returning 0 before IsReady(), then that is weird but should at least be documented. Otherwise, if it's intended to be undefined behavior, we should assert IsReady().
* Consider replacing "standalone" and "dummy" entries with a single kind of "weak" entry that deletes its value when it reaches zero refs. Suppose you are using compressed secondary cache and have two iterators at similar places. It will probably common for one iterator to have standalone results pinned (out of cache) when the second iterator needs those same blocks and has to re-load them from secondary cache and duplicate the memory. Combining the dummy and the standalone should fix this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10730
Test Plan:
existing tests (minor update), and crash test with sanitizers and secondary cache
Performance test for any regressions in LRUCache (primary only):
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test before & after (run at same time) with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X100] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=233000000 -duration 30 -threads=16
```
Before: readrandom [AVG 100 runs] : 22234 (± 63) ops/sec; 1.6 (± 0.0) MB/sec
After: readrandom [AVG 100 runs] : 22197 (± 64) ops/sec; 1.6 (± 0.0) MB/sec
That's within 0.2%, which is not significant by the confidence intervals.
Reviewed By: anand1976
Differential Revision: D39826010
Pulled By: anand1976
fbshipit-source-id: 3202b4a91f673231c97648ae070e502ae16b0f44
Summary:
Add new property "do_not_recurse" in IOOptions for underlying file system to skip iteration of directories during DB::Open if there are no sub directories and list only files.
By default this property is set to false. This property is set true currently in the code where RocksDB is sure only files are needed during DB::Open.
Provided support in PosixFileSystem to use "do_not_recurse".
TestPlan:
- Existing tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10668
Reviewed By: anand1976
Differential Revision: D39471683
Pulled By: akankshamahajan15
fbshipit-source-id: 90e32f0b86d5346d53bc2714d3a0e7002590527f
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661
Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case.
| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|
Reviewed By: riversand963
Differential Revision: D39441192
Pulled By: cbi42
fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
Summary:
**Context/Summary:**
Introduce `manual_wal_flush_one_in` as titled.
- When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in`
Incompatibilities fixed along the way:
```
db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed.
```
- It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false.
- As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698
Test Plan:
- Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3`
- Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs
Reviewed By: ajkr
Differential Revision: D39593752
Pulled By: ajkr
fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce
Summary:
currently, there are places in compaction_picker where we add up `compensated_file_size` of files being compacted and limit the sum to be under `max_compaction_bytes`. `compensated_file_size` contains booster for point tombstones and should be used only for determining file's compaction priority. This PR replaces `compensated_file_size` with actual file size in such places.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10728
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D39789427
Pulled By: cbi42
fbshipit-source-id: 1f89fb6c0159c53bf01d8dc783f465959f442c81
Summary:
Try to align the compaction output file boundaries to the next level ones
(grandparent level), to reduce the level compaction write-amplification.
In level compaction, there are "wasted" data at the beginning and end of the
output level files. Align the file boundary can avoid such "wasted" compaction.
With this PR, it tries to align the non-bottommost level file boundaries to its
next level ones. It may cut file when the file size is large enough (at least
50% of target_file_size) and not too large (2x target_file_size).
db_bench shows about 12.56% compaction reduction:
```
TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432
# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds
# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
```
The compaction simulator shows a similar result (14% with 100G random data).
As a side effect, with this PR, the SST file size can exceed the
target_file_size, but is capped at 2x target_file_size. And there will be
smaller files. Here are file size statistics when loading 100GB with the target
file size 32MB:
```
baseline this_PR
count 1.656000e+03 1.705000e+03
mean 3.116062e+07 3.028076e+07
std 7.145242e+06 8.046139e+06
```
The feature is enabled by default, to revert to the old behavior disable it
with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`
Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
use case like user adding 2 or more non-overlapping data range at the same
time, it can reduce the overlapping of 2 datasets in the lower levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655
Reviewed By: cbi42
Differential Revision: D39552321
Pulled By: jay-zhuang
fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
Summary:
when a new internal iterator is constructed during iterator refresh, pointer to the previous memtable range tombstone iterator was not cleared. This could cause segfault for future `Refresh()` calls when they try to free the memtable range tombstones. This PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10739
Test Plan: added a unit test in db_range_del_test.cc to reproduce this issue.
Reviewed By: ajkr, riversand963
Differential Revision: D39825283
Pulled By: cbi42
fbshipit-source-id: 3b59a2b73865aed39e28cdd5c1b57eed7991b94c
Summary:
Currently, without this fix, DBImpl::GetLatestSequenceForKey() may not return the latest sequence number for merge operands of the key. This can cause conflict checking during optimistic transaction commit phase to fail. Fix it by always returning the latest sequence number of the key, also considering range tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10724
Test Plan: make check
Reviewed By: cbi42
Differential Revision: D39756847
Pulled By: riversand963
fbshipit-source-id: 0764c3dd4cb24960b37e18adccc6e7feed0e6876
Summary:
We have a lot of confusing code because of mixed, sometimes
completely opposite uses of of the term "raw block" or "raw contents",
sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
`raw_block_contents` and `raw_size` generally referred to uncompressed block
contents and size, while `WriteRawBlock` referred to writing a block that
is already compressed if it is going to be. Meanwhile, in
`BlockBasedTable`, `raw_block_contents` either referred to a (maybe
compressed) block with trailer, or a maybe compressed block maybe
without trailer. (Note: left as follow-up work to use C++ typing to
better sort out the various kinds of BlockContents.)
This change primarily tries to apply some consistent terminology around
the kinds of block representations, avoiding the unclear "raw". (Any
meaning of "raw" assumes some bias toward the storage layer or toward
the logical data layer.) Preferred terminology:
* **Serialized block** - bytes that go into storage. For block-based table
(usually the case) this includes the block trailer. WART: block `size` may or
may not include the trailer; need to be clear about whether it does or not.
* **Maybe compressed block** - like a serialized block, but without the
trailer (or no promise of including a trailer). Must be accompanied by a
CompressionType.
* **Uncompressed block** - "payload" bytes that are either stored with no
compression, used as input to compression function, or result of
decompression function.
* **Parsed block** - an in-memory form of a block in block cache, as it is
used by the table reader. Different C++ types are used depending on the
block type (see block_like_traits.h).
Other refactorings:
* Misc corrections/improvements of internal API comments
* Remove a few misleading / unhelpful / redundant comments.
* Use move semantics in some places to simplify contracts
* Use better parameter names to indicate which parameters are used for
outputs
* Remove some extraneous `extern`
* Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408
Test Plan: existing tests
Reviewed By: akankshamahajan15
Differential Revision: D38172617
Pulled By: pdillinger
fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c
Summary:
Change the library order in PLATFORM_LDFLAGS to enable fbcode platform 10 build with folly. This PR also has a few fixes for platform 10 compiler errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10708
Test Plan:
ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_COROUTINES=1 make -j64 check
ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_FOLLY=1 make -j64 check
Reviewed By: ajkr
Differential Revision: D39666590
Pulled By: anand1976
fbshipit-source-id: 256a1127ef561399cd6299a6a392ca29bd68ca44
Summary:
when there is a single memtable without range tombstones and no SST files in the database, DBIter should wrap memtable iterator directly. Currently we create a merging iterator on top of the memtable iterator, and have DBIter wrap around it. This causes iterator regression and this PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10705
Test Plan:
- `make check`
- Performance:
- Set up: `./db_bench -benchmarks=filluniquerandom -write_buffer_size=$((1 << 30)) -num=10000`
- Benchmark: `./db_bench -benchmarks=seekrandom -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=$((1 << 30)) -num=10000 -threads=16 -duration=60 -seek_nexts=$seek_nexts`
```
seek_nexts main op/sec https://github.com/facebook/rocksdb/issues/10705 RocksDB v7.6
0 5746568 5749033 5786180
30 2411690 3006466 2837699
1000 102556 128902 124667
```
Reviewed By: ajkr
Differential Revision: D39644221
Pulled By: cbi42
fbshipit-source-id: 8063ff611ba31b0e5670041da3927c8c54b2097d
Summary:
The background compaction may still running while the test end, which would cause ASAN stack-use-after-scope error.
Explicitly close the DB before test end.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10697
Test Plan:
able to reproduce with:
```
gtest-parallel ./compaction_service_test --gtest_filter=CompactionServiceTest.BasicCompactions -r 10000 -w 100
```
Reviewed By: gitbw95
Differential Revision: D39590974
Pulled By: jay-zhuang
fbshipit-source-id: da264b2e6a276afbda7d5ff7adb9d7b8d4213d90
Summary:
Fix invalid reference in MultiGet due to resizing of the ```batches``` autovector.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10702
Test Plan: Run asan crash test
Reviewed By: riversand963
Differential Revision: D39608753
Pulled By: anand1976
fbshipit-source-id: 7a9e7fc6f436f08eb22003d0e6b0e1e4dcdc1a2a
Summary:
The assertion in ```FilePickerMultiGet::ReplaceRange()``` was incorrect. The function should only be called to replace the range after finishing the search in the current level, which is indicated by ```hit_file_ == nullptr``` i.e no more overlapping files in this level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10695
Reviewed By: gitbw95
Differential Revision: D39583217
Pulled By: anand1976
fbshipit-source-id: d4cedfb2b62fb9f3a083e9848a403ae6342f0519
Summary:
This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
* We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
* We want to highlight the key feature: it's fast (especially under parallel load)
* Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
* We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)
Some API detail:
* To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
* Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.
For performance tests see https://github.com/facebook/rocksdb/pull/10626
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684
Test Plan: no interesting functional changes; tests updated
Reviewed By: anand1976
Differential Revision: D39547800
Pulled By: pdillinger
fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
Summary:
The stats were not accurate for the coroutine version of MultiGet. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10673
Reviewed By: akankshamahajan15
Differential Revision: D39492615
Pulled By: anand1976
fbshipit-source-id: b46c04e15ea27e66f4c31f00c66497aa283bf9d3
Summary:
Fix a bug in the async IO/coroutine version of MultiGet that may cause a segfault or assertion failure due to accessing an invalid file index in a LevelFilesBrief. The bug is that when a MultiGetRange is split into two, we may re-process keys in the original range that were already marked to be skipped (in ```current_level_range_```) due to not overlapping the level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10688
Reviewed By: gitbw95
Differential Revision: D39556131
Pulled By: anand1976
fbshipit-source-id: 65e79438508a283cb19e64eca5c91d0714b81458
Summary:
When user-defined timestamp is enabled, subcompaction bounds should be set up properly. When creating InputIterator for the compaction, the `start` and `end` should have their timestamp portions set to kMaxTimestamp, which is the highest possible timestamp. This is similar to what we do with setting up their sequence numbers to `kMaxSequenceNumber`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10658
Test Plan:
```bash
make check
rm -rf /dev/shm/rocksdb/* && mkdir
/dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress
--allow_data_in_errors=True --clear_column_family_one_in=0
--continuous_verification_interval=0 --data_block_index_type=1
--db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5
--delrangepercent=0
--expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected
--iterpercent=0 --max_background_compactions=20
--max_bytes_for_level_base=10485760 --max_key=25000000
--max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1
--ops_per_thread=300000 --paranoid_file_checks=1 --partition_filters=0
--prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0
--snapshot_hold_ops=100000 --subcompactions=4
--target_file_size_base=65536 --target_file_size_multiplier=2
--test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1
--user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1
--write_buffer_size=65536 --writepercent=60 -disable_wal=1
-column_families=1
```
Reviewed By: akankshamahajan15
Differential Revision: D39393402
Pulled By: riversand963
fbshipit-source-id: f276e35b19fce51a175c368a502fb0718d1f3871
Summary:
fix a data race introduced in https://github.com/facebook/rocksdb/issues/10547 (P5295241720), first reported by pdillinger. The race is between the `std::atomic_load_explicit` in NewRangeTombstoneIteratorInternal and the `std::atomic_store_explicit` in MemTable::Add() that operate on `cached_range_tombstone_`. P5295241720 shows that `atomic_store_explicit` initializes some mutex which `atomic_load_explicit` could be trying to call `lock()` on at the same time. This fix moves the initialization to memtable constructor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10680
Test Plan: `USE_CLANG=1 COMPILE_WITH_TSAN=1 make -j24 whitebox_crash_test`
Reviewed By: ajkr
Differential Revision: D39528696
Pulled By: cbi42
fbshipit-source-id: ee740841044438e18ad2b8ea567444dd542dd8e2
Summary:
The patch extends the iterator API with a new `columns` method which
can be used to retrieve all wide columns for the current key. Similarly to
the `Get` and `GetEntity` APIs, the classic `value` API returns the value
of the default (anonymous) column for wide-column entities, and `columns`
returns an entity with a single default column for plain old key-values.
(The goal here is to maintain the invariant that `value()` is the same as
the value of the default column in `columns()`.) The patch also involves a
smaller refactoring: historically, `value()` was implemented using a bunch
of conditions, that is, the `Slice` to be returned was decided based on the
direction of the iteration, whether a merge had been done etc. when the
method was called; with the patch, the value to be exposed is stored in a
member `Slice value_` when the iterator lands on a new key, and `value()`
simply returns this `Slice`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10670
Test Plan: Ran `make check` and a simple blackbox crash test.
Reviewed By: riversand963
Differential Revision: D39475551
Pulled By: ltamasi
fbshipit-source-id: 29e7a6ed9ef340841aab36803b832b7c8f668b0b
Summary:
Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. https://github.com/facebook/rocksdb/issues/10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (https://github.com/facebook/rocksdb/issues/10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables.
The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (https://github.com/facebook/rocksdb/issues/10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10547
Test Plan:
- TestGet() in stress test is updated in https://github.com/facebook/rocksdb/issues/10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones.
```
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000
```
Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8e172dc5f1ef65c9d0a6eedbd9987b2c72 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used.
micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom.
| |fillrandom main | write path caching | read path caching |memtable V3 (https://github.com/facebook/rocksdb/issues/10308) | readrandom main | write path caching | read path caching |memtable V3 |
|--- |--- |--- |--- |--- | --- | --- | --- | --- |
| 0 |6.35 |6.15 |5.82 |6.12 |2.24 |2.26 |2.03 |2.07 |
| 1 |5.99 |5.88 |5.77 |6.28 |2.65 |2.27 |2.24 |2.5 |
| 10 |6.15 |6.02 |5.92 |5.95 |5.15 |2.61 |2.31 |2.53 |
| 100 |5.95 |5.78 |5.88 |6.23 |28.31 |2.34 |2.45 |2.94 |
| 100 25 threads |52.01 |45.85 |46.18 |47.52 |35.97 |3.34 |3.34 |3.56 |
| 1000 |6.0 |7.07 |5.98 |6.08 |333.18 |2.86 |2.7 |3.6 |
| 1000 25 threads |52.6 |148.86 |79.06 |45.52 |473.49 |3.66 |3.48 |4.38 |
- Benchmark performance of`readwhilewriting` from https://github.com/facebook/rocksdb/issues/10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes`
readrandom micros/op:
| |main |write path caching |read path caching |memtable V3 |
|---|---|---|---|---|
| single thread |48.28 |1.55 |1.52 |1.96 |
| 25 threads |64.3 |2.55 |2.67 |2.64 |
Reviewed By: ajkr
Differential Revision: D38895410
Pulled By: cbi42
fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559
Summary:
**Context/Summary:**
Previous experience with bugs and flaky tests taught us there exist features in RocksDB vulnerable to race condition caused by acquiring db mutex at a particular timing. This PR aggressively exposes those vulnerable features by injecting spurious wakeup and sleep to cause acquiring db mutex at various timing in order to expose such race condition
**Testing:**
- `COERCE_CONTEXT_SWITCH=1 make -j56 check / make -j56 db_stress` should reveal
- flaky tests caused by db mutex related race condition
- Reverted https://github.com/facebook/rocksdb/pull/9528
- A/B testing on `COMPILE_WITH_TSAN=1 make -j56 listener_test` w/ and w/o `COERCE_CONTEXT_SWITCH=1` followed by `./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10`
- `COERCE_CONTEXT_SWITCH=1` can cause expected test failure (i.e, expose target TSAN data race error) within 10 run while the other couldn't.
- This proves our injection can expose flaky tests caused by db mutex related race condition faster.
- known or new race-condition-type of internal bug by continuously running this PR
- Performance
- High ops-threads time: COERCE_CONTEXT_SWITCH=1 regressed by 4 times slower (2:01.16 vs 0:22.10 elapsed ). This PR will be run as a separate CI job so this regression won't affect any existing job.
```
TEST_TMPDIR=$db /usr/bin/time ./db_stress \
--ops_per_thread=100000 --expected_values_dir=$exp --clear_column_family_one_in=0 \
--write_buffer_size=524288 —target_file_size_base=524288 —ingest_external_file_one_in=100 —compact_files_one_in=1000 —compact_range_one_in=1000
```
- Start-up time: COERCE_CONTEXT_SWITCH=1 didn't regress by 25% (0:01.51 vs 0:01.29 elapsed)
```
TEST_TMPDIR=$db ./db_stress -ops_per_thread=100000000 -expected_values_dir=$exp --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress
TEST_TMPDIR=$db /usr/bin/time ./db_stress \
--ops_per_thread=1 -reopen=0 --expected_values_dir=$exp --clear_column_family_one_in=0 --destroy_db_initially=0
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10291
Reviewed By: ajkr
Differential Revision: D39231182
Pulled By: hx235
fbshipit-source-id: 7ab6695430460e0826727fd8c66679b32b3e44b6
Summary:
Historically, `BlobFileReader` has returned the blob(s) read from the file
in the `PinnableSlice` provided by the client. This interface was
preserved when caching was implemented for blobs, which meant that
the blob data was copied multiple times when caching was in use: first,
into the client-provided `PinnableSlice` (by `BlobFileReader::SaveValue`),
and then, into the object stored in the cache (by `BlobSource::PutBlobIntoCache`).
The patch eliminates these copies and the related allocations by changing
`BlobFileReader` so it returns its results in the form of heap-allocated `BlobContents`
objects that can be directly inserted into the cache. The allocations backing
these `BlobContents` objects are made using the blob cache's allocator if the
blobs are to be inserted into the cache (i.e. if a cache is configured and the
`fill_cache` read option is set). Note: this PR focuses on the common case when
blobs are compressed; some further small optimizations are possible for uncompressed
blobs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10647
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D39335185
Pulled By: ltamasi
fbshipit-source-id: 464503d60a5520d654c8273ffb8efd5d1bcd7b36
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
Summary:
**Summary:**
When a block is firstly `Lookup` from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from `Lookup`. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache.
When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache.
**Implementation Details**
Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0. (refs >= 1 && in_cache == false && IS_STANDALONE == true)
The behaviors of `LRUCacheShard::Lookup()` are updated if the secondary_cache is CompressedSecondaryCache:
1. If a handle is found in primary cache:
1.1. If the handle's value is not nullptr, it is returned immediately.
1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache.
- 1.2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
- 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache.
2. If a handle is not found in primary cache:
2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
2.2. If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block) and return a standalone handle.
The behaviors of `LRUCacheShard::Promote()` are updated as follows:
1. If `e->sec_handle` has value, one of the following steps can happen:
1.1. Insert a dummy handle and return a standalone handle to caller when `secondary_cache_` is `CompressedSecondaryCache` and e is a standalone handle.
1.2. Insert the item into the primary cache and return the handle to caller.
1.3. Exception handling.
3. If `e->sec_handle` has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released.
The behavior of `CompressedSecondaryCache::Insert()` is updated:
1. If a block is evicted from the primary cache for the first time, a dummy item is inserted.
4. If a dummy item is found for a block, the block is inserted into the secondary cache.
The behavior of `CompressedSecondaryCache:::Lookup()` is updated:
1. If a handle is not found or it is a dummy item, a nullptr is returned.
2. If `erase_handle` is true, the handle is erased.
The behaviors of `LRUCacheShard::Release()` are adjusted for the standalone handles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10527
Test Plan:
1. stress tests.
5. unit tests.
6. CPU profiling for db_bench.
Reviewed By: siying
Differential Revision: D38747613
Pulled By: gitbw95
fbshipit-source-id: 74a1eba7e1957c9affb2bd2ae3e0194584fa6eca
Summary:
Example failure:
```
db/blob/db_blob_basic_test.cc:226: Failure
Expected equality of these values:
i
Which is: 1
num_blobs
Which is: 5
```
I can't repro locally, but it looks like the 2KB cache is too small to guarantee no eviction happens between loading all the data into cache and reading from `kBlockCacheTier`. This 2KB setting appears to have come from a test where the cached entries are pinned, where it makes sense to have a small setting. However, such a small setting makes less sense when the blocks are evictable but must remain cached per the test's expectation. This PR increases the capacity setting to 2MB for those cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10636
Reviewed By: cbi42
Differential Revision: D39250976
Pulled By: ajkr
fbshipit-source-id: 769309f9a19cfac20b67b927805c8df5c1d2d1f5
Summary:
Example flake: https://app.circleci.com/pipelines/github/facebook/rocksdb/17660/workflows/7a891875-f07b-4a67-b204-eaa7ca9f9aa2/jobs/467496
The test could get stuck in out-of-space due to a callback executing `SetFilesystemActive(false /* active */)` after the test executed `SetFilesystemActive(true /* active */)`. This could happen because background info logging went through the SyncPoint callback "WritableFileWriter::Append:BeforePrepareWrite", probably unintentionally. The solution of this PR is to call `ClearAllCallBacks()` to wait for any such pending callbacks to drain before calling `SetFilesystemActive(true /* active */)`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10642
Reviewed By: cbi42
Differential Revision: D39265381
Pulled By: ajkr
fbshipit-source-id: 9a2f4916ab19726c8fb4b3a3b590b1b9ed93de1b
Summary:
Previously, automatic compaction could be triggered prior to the test invoking CompactRange(). It could lead to the following flaky failure:
```
/root/project/db/db_block_cache_test.cc:753: Failure
Expected equality of these values:
1 + kNumBlocks
Which is: 11
options.statistics->getTickerCount(BLOCK_CACHE_INDEX_ADD)
Which is: 10
```
A sequence leading to this failure was:
* Automatic compaction
* files [1] [2] trivially moved
* files [3] [4] [5] [6] trivially moved
* CompactRange()
* files [7] [8] [9] trivially moved
* file [10] trivially moved
In such a case, the index/filter block adds that the test expected did not happen since there were no new files.
This PR just tweaks settings to ensure the `CompactRange()` produces one new file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10635
Reviewed By: cbi42
Differential Revision: D39250869
Pulled By: ajkr
fbshipit-source-id: a3c94c49069e28c49c40b4b80dae0059739d19fd
Summary:
Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`.
With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator:
- in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys.
- in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L.
This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail.
One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`.
Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449
Test Plan:
- Added many unit tests in db_range_del_test
- Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2`
- Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913.
```
python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1
```
- Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written". As expected, the performance with this PR depends on the range tombstone width.
```
# Setup:
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50
# Scan entire DB
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true
# Short range scan (10 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true
# Long range scan(1000 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true
```
Avg over of 10 runs (some slower tests had fews runs):
For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones.
- Scan entire DB
| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- | ------------- |
| 0 range tombstone |2525600 (± 43564) |2486917 (± 33698) |-1.53% |
| 100 |1853835 (± 24736) |2073884 (± 32176) |+11.87% |
| 1000 |422415 (± 7466) |1115801 (± 22781) |+164.15% |
| 10000 |22384 (± 227) |227919 (± 6647) |+918.22% |
| 1 range tombstone |2176540 (± 39050) |2434954 (± 24563) |+11.87% |
- Short range scan
| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- | ------------- |
| 0 range tombstone |35398 (± 533) |35338 (± 569) |-0.17% |
| 100 |28276 (± 664) |31684 (± 331) |+12.05% |
| 1000 |7637 (± 77) |25422 (± 277) |+232.88% |
| 10000 |1367 |28667 |+1997.07% |
| 1 range tombstone |32618 (± 581) |32748 (± 506) |+0.4% |
- Long range scan
| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- | ------------- |
| 0 range tombstone |2262 (± 33) |2353 (± 20) |+4.02% |
| 100 |1696 (± 26) |1926 (± 18) |+13.56% |
| 1000 |410 (± 6) |1255 (± 29) |+206.1% |
| 10000 |25 |414 |+1556.0% |
| 1 range tombstone |1957 (± 30) |2185 (± 44) |+11.65% |
- Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61
Reviewed By: ajkr
Differential Revision: D38450331
Pulled By: cbi42
fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca
Summary:
With the current code, when a blob isn't found in the cache and gets read
from the blob file and then inserted into the cache, the application gets
passed the self-contained `PinnableSlice` resulting from the blob file read.
The patch changes this so that the `PinnableSlice` pins the cache entry
instead in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10625
Test Plan: `make check`
Reviewed By: pdillinger
Differential Revision: D39220904
Pulled By: ltamasi
fbshipit-source-id: cb9c62881e3523b1e9f614e00bf503bac2fe3b0a
Summary:
This stat was only getting updated in the async (coroutine) version of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10622
Reviewed By: akankshamahajan15
Differential Revision: D39188790
Pulled By: anand1976
fbshipit-source-id: 7e231507f65fc94c8a006c38f79dfba182a2c24a
Summary:
sanity check value for option `memtable_protection_bytes_per_key` in `ColumnFamilyData::ValidateOptions()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10621
Test Plan: `make check`, added unit test in ColumnFamilyTest.
Reviewed By: ajkr
Differential Revision: D39180133
Pulled By: cbi42
fbshipit-source-id: 009e0da3ccb332d1c9e14d20193304610bd4eb8a
Summary:
**Context:**
Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.
This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.
The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped (which otherwise will internally call`Env/FS::SyncDic()` )
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
```
```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
```
**Summary:**
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
- Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
- Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
- Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
- Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
- `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573
Test Plan:
- `make check`
- Passed the repro db_stress command
- For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.
Reviewed By: ajkr
Differential Revision: D39005886
Pulled By: hx235
fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a
Summary:
The patch adds a dedicated cache entry role for blob values and switches
to a registered deleter so that blobs show up as a separate bucket
(as opposed to "Misc") in the cache occupancy statistics, e.g.
```
Block cache entry stats(count,size,portion): DataBlock(133515,531.73 MB,13.6866%) BlobValue(1824855,3.10 GB,81.7071%) Misc(1,0.00 KB,0%)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10601
Test Plan: Ran `make check` and tested the cache occupancy statistics using `db_bench`.
Reviewed By: riversand963
Differential Revision: D39107915
Pulled By: ltamasi
fbshipit-source-id: 8446c3b190a41a144030df73f318eeda4398c125
Summary:
Some APIs for getting live files, which are used by Checkpoint
and BackupEngine, can optionally trigger and wait for a flush. These
would deadlock when used on a read-only DB. Here we fix that by assuming
the user wants the overall operation to succeed and is OK without
flushing (because the DB is read-only).
Follow-up work: the same or other issues can be hit by directly invoking
some DB functions that are clearly not appropriate for read-only
instance, but are not covered by overrides in DBImplReadOnly and
CompactedDBImpl. These should be fixed to avoid similar problems on
accidental misuse. (Long term, it would be nice to have a DBReadOnly
class without those members, like BackupEngineReadOnly.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10569
Test Plan: tests updated to catch regression (hang before the fix)
Reviewed By: riversand963
Differential Revision: D38995759
Pulled By: pdillinger
fbshipit-source-id: f5f8bc7123e13cb45bd393dd974d7d6eda20bc68
Summary:
The patch improves the bookkeeping around the memory usage of
cached blobs in two ways: 1) it uses `malloc_usable_size`, which accounts
for allocator bin sizes etc., and 2) it also considers the memory usage
of the `BlobContents` object in addition to the blob itself. Note: some unit
tests had been relying on the cache charge being equal to the size of the
cached blob; these were updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10583
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D39060680
Pulled By: ltamasi
fbshipit-source-id: 3583adce2b4ce6e84861f3fadccbfd2e5a3cc482
Summary:
Timer has a limitation that it cannot re-register a task with the same name,
because the cancel only mark the task as invalid and wait for the Timer thread
to clean it up later, before the task is cleaned up, the same task name cannot
be added. Which makes the task option update likely to fail, which basically
cancel and re-register the same task name. Change the periodic task name to a
random unique id and store it in periodic_task_scheduler.
Also refactor the `periodic_work` to `periodic_task` to make each job function
as a `task`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10379
Test Plan: unittests
Reviewed By: ajkr
Differential Revision: D38000615
Pulled By: jay-zhuang
fbshipit-source-id: e4135f9422e3b53aaec8eda54f4e18ce633a279e
Summary:
The patch introduces a new class called `BlobContents`, which represents
a single uncompressed blob value. We currently use `std::string` for this
purpose; `BlobContents` is somewhat smaller but the primary reason for a
dedicated class is that it enables certain improvements and optimizations
like eliding a copy when inserting a blob into the cache, using custom
allocators, or more control over and better accounting of the memory usage
of cached blobs (see https://github.com/facebook/rocksdb/issues/10484).
(We plan to implement these in subsequent PRs.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10571
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D39000965
Pulled By: ltamasi
fbshipit-source-id: f296eddf9dec4fc3e11cad525b462bdf63c78f96
Summary:
WAL append and switch can both happen between `FlushWAL(true /* sync */)`'s sync operations and its call to `MarkLogsSynced()`. We permit this since locks need to be released for the sync operations. Such an appended/switched WAL is both inactive and incompletely synced at the time `MarkLogsSynced()` processes it.
Prior to this PR, `MarkLogsSynced()` assumed all inactive WALs were fully synced and removed them from consideration for future syncs. That was wrong in the scenario described above and led to the latest append(s) never being synced. This PR changes `MarkLogsSynced()` to only remove inactive WALs from consideration for which all flushed data has been synced.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10560
Test Plan: repro unit test for the scenario described above. Without this PR, it fails on "key2" not found
Reviewed By: riversand963
Differential Revision: D38957391
Pulled By: ajkr
fbshipit-source-id: da77175eba97ff251a4219b227b3bb2d4843ed26
Summary:
In UniversalCompactionBuilder::PickCompactionToReduceSortedRuns, we passed start_level to get compression type and options. I think that is wrong and we should use output_level instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10515
Reviewed By: hx235
Differential Revision: D38611335
Pulled By: ajkr
fbshipit-source-id: bb860caed4b6c6bbde8f75fc50cf875a9f04723d
Summary:
This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535
Test Plan:
1. Ensure existing MultiGet unit tests pass, updating them as necessary
2. New unit tests - TODO
3. Run stress test - TODO
No noticeable regression (<1%) without async IO -
Without PR: `multireadrandom : 7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations; 571.6 MB/s (8168992 of 8168992 found)`
With PR: `multireadrandom : 7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations; 568.2 MB/s (8271992 of 8271992 found)`
For a fully cached DB, but with async IO option on, no regression observed (<1%) -
Without PR: `multireadrandom : 5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations; 797.9 MB/s (11540992 of 11540992 found) `
With PR: `multireadrandom : 5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations; 790.7 MB/s (11649992 of 11649992 found) `
Reviewed By: akankshamahajan15
Differential Revision: D38774009
Pulled By: anand1976
fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007
Summary:
The patch adds a new API `GetEntity` that can be used to perform
wide-column point lookups. It also extends the `Get` code path and
the `MemTable` / `MemTableList` and `Version` / `GetContext` logic
accordingly so that wide-column entities can be served from both
memtables and SSTs. If the result of a lookup is a wide-column entity
(`kTypeWideColumnEntity`), it is passed to the application in deserialized
form; if it is a plain old key-value (`kTypeValue`), it is presented as a
wide-column entity with a single default (anonymous) column.
(In contrast, regular `Get` returns plain old key-values as-is, and
returns the value of the default column for wide-column entities, see
https://github.com/facebook/rocksdb/issues/10483 .)
The result of `GetEntity` is a self-contained `PinnableWideColumns` object.
`PinnableWideColumns` contains a `PinnableSlice`, which either stores the
underlying data in its own buffer or holds on to a cache handle. It also contains
a `WideColumns` instance, which indexes the contents of the `PinnableSlice`,
so applications can access the values of columns efficiently.
There are several pieces of functionality which are currently not supported
for wide-column entities: there is currently no `MultiGetEntity` or wide-column
iterator; also, `Merge` and `GetMergeOperands` are not supported, and there
is no `GetEntity` implementation for read-only and secondary instances.
We plan to implement these in future PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10540
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D38847474
Pulled By: ltamasi
fbshipit-source-id: 42311a34ccdfe88b3775e847a5e2a5296e002b5b
Summary:
Previously, the flushes triggered by `WriteBufferManager` could affect
the same CF repeatedly if it happens to get consecutive writes. Such
flushes are not particularly useful for reducing memory usage since
they switch nearly-empty memtables to immutable while they've just begun
filling their first arena block. In fact they may not even reduce the
mutable memory count if they involve replacing one mutable memtable containing
one arena block with a new mutable memtable containing one arena block.
Further, if such switches happen even a few times before a flush finishes,
the immutable memtable limit will be reached and writes will stall.
This PR adds a heuristic to not switch memtables to immutable for CFs
that already have one or more immutable memtables awaiting flush. There
is a memory usage regression if the user continues writing to the same
CF, that DB does not have any CFs eligible for switching, flushes
are not finishing, and the `WriteBufferManager` was constructed with
`allow_stall=false`. Before, it would grow by switching nearly empty
memtables until writes stall. Now, it would grow by filling memtables
until writes stall. This feels like an acceptable behavior change because
users who prefer to stall over violate the memory limit should be using
`allow_stall=true`, which is unaffected by this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6364
Test Plan:
- Command:
`rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num_multi_db=8 -num_column_families=2 -write_buffer_size=4194304 -db_write_buffer_size=16777216 -compression_type=none -statistics=true -target_file_size_base=4194304 -max_bytes_for_level_base=16777216`
- `rocksdb.db.write.stall` count before this PR: 175
- `rocksdb.db.write.stall` count after this PR: 0
Reviewed By: jay-zhuang
Differential Revision: D20167197
Pulled By: ajkr
fbshipit-source-id: 4a64064e9bc33d57c0a35f15547542d0191d0cb7
Summary:
The fix in https://github.com/facebook/rocksdb/issues/10513 was not complete w.r.t range deletion handling. It didn't handle the case where a file with a range tombstone covering a key also overlapped another key in the batch. In that case, ```mget_range``` would be non-empty. However, ```mget_range``` would only have the second key and, therefore, the first key would be skipped when iterating through the range tombstones in ```TableCache::MultiGet```.
Test plan -
1. Add a unit test
2. Run stress tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10534
Reviewed By: akankshamahajan15
Differential Revision: D38773880
Pulled By: anand1976
fbshipit-source-id: dae491dbe52e18bbce5179b77b63f20771a66c00
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461
Reviewed By: siying
Differential Revision: D38672823
Pulled By: ltamasi
fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
Summary:
Fix copyright for two more extra headers to make internal tool happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10525
Reviewed By: jay-zhuang
Differential Revision: D38661390
fbshipit-source-id: ab2d055bfd145dfe82b5bae7a6c25cc338c8de94
Summary:
... so that cache keys can be derived from DB manifest data
before reading the file from storage--so that every part of the file
can potentially go in a persistent cache.
See updated comments in cache_key.cc for technical details. Importantly,
the new cache key encoding uses some fancy but efficient math to pack
data into the cache key without depending on the sizes of the various
pieces. This simplifies some existing code creating cache keys, like
cache warming before the file size is known.
This should provide us an essentially permanent mapping between SST
unique IDs and base cache keys, with the ability to "upgrade" SST
unique IDs (and thus cache keys) with new SST format_versions.
These cache keys are of similar, perhaps indistinguishable quality to
the previous generation. Before this change (see "corrected" days
between collision):
```
./cache_bench -stress_cache_key -sck_keep_bits=43
18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
```
After this change (keep 43 bits, up through 50, to validate "trajectory"
is ok on "corrected" days between collision):
```
19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
```
However, the change does modify (probably weaken) the "guaranteed unique" promise from this
> SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86
to this (see https://github.com/facebook/rocksdb/issues/10388)
> With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits
I don't think this is a practical concern, though.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394
Test Plan: unit tests updated, see simulation results above
Reviewed By: jay-zhuang
Differential Revision: D38667529
Pulled By: pdillinger
fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
Summary:
The info LOG file does not currently give any direct
information about the existence of old, live snapshots, nor how to
estimate wall time from a sequence number within the scope of LOG
history. This change addresses both with:
* Logging smallest and largest seqnos for generated SST files, which can
help associate sequence numbers with write time (based on flushes).
* Logging oldest_snapshot_seqno for each compaction, which (along with
that seqno info) helps us to determine how much old data might be kept
around for old (leaked?) snapshots. Including the date here I thought might
be excessive.
I wanted to log the date and seqno of the oldest snapshot with periodic
stats, but the current structure of the code doesn't really support that
because `DumpDBStats` doesn't have access to the DB object.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10454
Test Plan:
manual inspect LOG from
`KEEP_DB=1 ./db_basic_test --gtest_filter=*CompactBetweenSnapshots*`
Reviewed By: ajkr
Differential Revision: D38326948
Pulled By: pdillinger
fbshipit-source-id: 294918ffc04a419844146cd826045321b4d5c038
Summary:
https://github.com/facebook/rocksdb/pull/10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10518
Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.
Reviewed By: ajkr
Differential Revision: D38603145
fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
Summary:
New blobdb has a bug in compaction filter, where `blob_value_` is not reset for next iterated key. This will cause blob_value_ not empty and previous value read from blob is passed into the filter function for next key, even if its value is not in blob. Fixed by reseting regardless of key type.
Test Case:
Add `FilterByValueLength` test case in `DBBlobCompactionTest`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10391
Reviewed By: riversand963
Differential Revision: D38629900
Pulled By: ltamasi
fbshipit-source-id: 47d23ff2e5ec697958a210db9e6ceeb8b2fc49fa
Summary:
Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519
Reviewed By: jay-zhuang
Differential Revision: D38603291
fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1
Summary:
Moved linux builds to using docker to avoid CI instability caused by dependency installation site down.
Added the `Dockerfile` which is used to build the image.
The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test).
Also fixed a few issues detected while building this:
* `DestoryDB()` Status not checked for a few tests
* nullptr might be used in `inlineskiplist.cc`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D38554200
Pulled By: jay-zhuang
fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd
Summary:
A test in db_block_cache_test.cc was skipping ClockCache due to the 16-byte key length requirement. We fixed this. Along the way, we fixed a bug in ApplyToSomeEntries, which assumed the function being applied could modify handle metadata, and thus took an exclusive reference. This is incompatible with calls that need to inspect every element (including externally referenced ones) to gather stats.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10482
Test Plan: ``make -j24 check``
Reviewed By: anand1976
Differential Revision: D38553073
Pulled By: guidotag
fbshipit-source-id: 0ed63fed4d3b89e5056b35b7091fce579f5647ae
Summary:
A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489
Test Plan: Fix existing unit tests and watch crash tests for a while.
Reviewed By: anand1976
Differential Revision: D38473277
fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366
Summary:
The main purpose is to make debugging easier without sacrificing performance.
Instead of using a boolean variable for `CompactionIterator::valid_`, we can extend it to an `uint8_t`,
using the LSB to denote if the compaction iterator is valid and 4 additional bits to denote where
the iterator is set valid inside `NextFromInput()`. Therefore, when the control flow reaches
`PrepareOutput()` and hits assertion there, we can have a better idea of what has gone wrong.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10505
Test Plan:
make check
```
TEST_TMPDIR=/dev/shm/rocksdb time ./db_bench -compression_type=none -write_buffer_size=1073741824 -benchmarks=fillseq,flush
```
The above command has a 'flush' benchmark which uses `CompactionIterator`. I haven't observed any CPU regression or drop in throughput or latency increase.
Reviewed By: ltamasi
Differential Revision: D38551615
Pulled By: riversand963
fbshipit-source-id: 1250848fc118bb753d71fa9ff8ba840df999f5e0
Summary:
This PR fixes 2 bugs introduced in https://github.com/facebook/rocksdb/issues/10432 -
1. If the bloom filter returned a negative result for all MultiGet keys in a file, the range tombstones in that file were being ignored, resulting in incorrect results if those tombstones covered a key in a higher level.
2. If all the keys in a file were filtered out in `TableCache::MultiGetFilter`, the table cache handle was not being released.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10513
Test Plan: Add a new unit test that fails without this fix
Reviewed By: akankshamahajan15
Differential Revision: D38548739
Pulled By: anand1976
fbshipit-source-id: a741a1e25d2e991d63f038100f126c2dc404a87c
Summary:
We have recently added caching support to BlobDB, and separately,
implemented an optimization where reading blobs from the cache
results in the cache handle being transferred to the target `PinnableSlice`
(as opposed to the contents getting copied). With these changes,
it makes sense to reset the `PinnableSlice` storing the blob value in
`DBIter` as soon as we move to a different iterator position to prevent
us from holding on to the cache handle any longer than necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10490
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D38473630
Pulled By: ltamasi
fbshipit-source-id: 84c045ffac76436c6152fd0f5775b007f4051386
Summary:
The patch adds support for wide-column entities to the existing query
APIs (`Get`, `MultiGet`, and iterator). Namely, when during a query a
wide-column entity is encountered, we will return the value of the default
(anonymous) column as the result. Later, we plan to add wide-column
specific query APIs which will enable retrieving entire wide-column entities
or a subset of their columns.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10483
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D38441881
Pulled By: ltamasi
fbshipit-source-id: 6444e79a31aff2470e866698e3a97985bc2b3543
Summary:
Change tiered compaction feature from `bottommost_temperture` to
`last_level_temperture`. The old option is kept for migration purpose only,
which is behaving the same as `last_level_temperture` and it will be removed in
the next release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471
Test Plan: CI
Reviewed By: siying
Differential Revision: D38450621
Pulled By: jay-zhuang
fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527
Summary:
Current universal compaction picker may cause extra size amplification
compaction if there're more hot data on penultimate level. Improve the picker
to skip the last level for size amp calculation if tiered compaction is
enabled, which can
1. avoid extra unnecessary size amp compaction;
2. typically cold tier (the last level) is not size constrained, so skip size
amp for cold tier is intended;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10467
Test Plan: CI and added unittest
Reviewed By: siying
Differential Revision: D38391350
Pulled By: jay-zhuang
fbshipit-source-id: 103c0731c05e0a7e8f267e9e829d022328be25d2
Summary:
Similarly to https://github.com/facebook/rocksdb/pull/10457, we now have
to explicitly set the `fill_cache` read option when reading blobs in
`DBIter` to prevent the cache from getting polluted by queries with
`fill_cache` set to false. (Before we added support for a blob cache,
the setting had not made any difference either way.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10492
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D38476121
Pulled By: ltamasi
fbshipit-source-id: ea5c5e252f83e4a4e2c74156b37d40308d7e0c80
Summary:
Local static string is not friendly to Jemalloc arena aware implementation, as it will be allocated on the arena of the first caller, which causes crash if the allocated arena gets refunded earlier.
P.S. A Jemalloc arena aware implementation is each rocksdb instance only use certain Jemalloc arenas, and arena will be refunded after associated DB instance is destroyed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8103
Reviewed By: ajkr
Differential Revision: D38477235
Pulled By: ltamasi
fbshipit-source-id: a58d32cb647ed64c144b4736fb2d5db27c2c28f9
Summary:
The feature `SuggestCompactRange()` is still experimental. Just
re-add the test back.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10473
Test Plan: CI
Reviewed By: akankshamahajan15
Differential Revision: D38427153
Pulled By: jay-zhuang
fbshipit-source-id: 0b4491c947cbce6c18ff147b167e3c678633129a
Summary:
**Context/summary:**
`ChargeFileMetadataTestWithParam/ChargeFileMetadataTestWithParam.Basic/0 ` relies on `DBImpl::BackgroundCallCompaction:PurgedObsoleteFiles` happens before verifying `EXPECT_EQ(file_metadata_charge_only_cache->GetCacheCharge(),
1 * CacheReservationManagerImpl<
CacheEntryRole::kFileMetadata>::GetDummyEntrySize());` or `EXPECT_EQ(file_metadata_charge_only_cache->GetCacheCharge(), 0);` to ensure appropriate cache reservation release is done before checking.
However, this might not be the case under some timing delay and spurious wake-up as coerced below.
```
diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 4378f3212..3e4f60853 100644
--- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -2989,6 +2989,8 @@ void DBImpl::BackgroundCallCompaction(PrepickedCompaction* prepicked_compaction,
if (job_context.HaveSomethingToClean() ||
job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
mutex_.Unlock();
+ bg_cv_.SignalAll();
+ usleep(1000);
// Have to flush the info logs before bg_compaction_scheduled_--
// because if bg_flush_scheduled_ becomes 0 and the lock is
// released, the deconstructor of DB can kick in and destroy all the
// states of DB so info_log might not be available after that point.
// It also applies to access other states that DB owns.
log_buffer.FlushBufferToLog();
if (job_context.HaveSomethingToDelete()) {
PurgeObsoleteFiles(job_context);
TEST_SYNC_POINT("DBImpl::BackgroundCallCompaction:PurgedObsoleteFiles");
}
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10481
Test Plan:
The test of interest failed often at the above coercion:
After fix, the test of interest passed at the above coercion:
Reviewed By: jay-zhuang
Differential Revision: D38438256
Pulled By: hx235
fbshipit-source-id: de80ecdb250174f00e7c2f5e4d952695ed56f51e
Summary:
- Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
- db_bench is updated to print out the number of range deletions executed if there is any.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380
Test Plan:
- CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
- Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable.
```
single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100
multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
```
Commit 99cdf16464 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.
Single thread result:
| Max # tombstones | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op | 99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0 |6.68 |6.57 |6.72 |4.72 |4.79 |4.54 |
| 1 |6.67 |6.58 |6.62 |5.41 |4.74 |4.72 |
| 10 |6.59 |6.5 |6.56 |7.83 |4.69 |4.59 |
| 100 |6.62 |6.75 |6.58 |29.57 |5.04 |5.09 |
| 1000 |6.54 |6.82 |6.61 |320.33 |5.22 |5.21 |
32-thread result: note that "Max # tombstones" is per thread.
| Max # tombstones | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op | 99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0 |234.52 |260.25 |239.42 |5.06 |5.38 |5.09 |
| 1 |236.46 |262.0 |231.1 |19.57 |22.14 |5.45 |
| 10 |236.95 |263.84 |251.49 |151.73 |21.61 |5.73 |
| 100 |268.16 |296.8 |280.13 |2308.52 |22.27 |6.57 |
Reviewed By: ajkr
Differential Revision: D37916564
Pulled By: cbi42
fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
Summary:
This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432
Reviewed By: akankshamahajan15
Differential Revision: D38245910
Pulled By: anand1976
fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de
Summary:
Existing DBWALTest.RaceInstallFlushResultsWithWalObsoletion test relies
on a specific interleaving of two background flush threads. We call them
bg1 and bg2, and assume bg1 starts to install flush results ahead of
bg2. After bg1 enters `ProcessManifestWrites`, bg1 waits for bg2 to also
enter `MemTableList::TryInstallMemtableFlushResults()` before bg1 can
proceed with MANIFEST write. However, if bg2 called `SyncClosedLogs()`
and needed to commit to the MANIFEST but falls behind bg1, then bg2
needs to wait for bg1 to finish writing to MANIFEST. This is a circular
dependency.
Fix this by allowing bg2 to start only after bg1 grabs the chance to
sync the WAL and commit to MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10456
Test Plan:
1. make check
2. export TEST_TMPDIR=/dev/shm && gtest-parallel -r 1000 -w 32 ./db_wal_test --gtest_filter=DBWALTest.RaceInstallFlushResultsWithWalObsoletion
Reviewed By: ltamasi
Differential Revision: D38391856
Pulled By: riversand963
fbshipit-source-id: 55f647d5b94e534c008a4dd2fb082675ddf58c96
Summary:
This PR avoids allocations and copies for the result of `GetMergeOperands()` when the average operand size is at least 256 bytes and the total operands size is at least 32KB. The `GetMergeOperands()` already included `PinnableSlice` but was calling `PinSelf()` (i.e., allocating and copying) for each operand. When this optimization takes effect, we instead call `PinSlice()` to skip that allocation and copy. Resources are pinned in order for the `PinnableSlice` to point to valid memory even after `GetMergeOperands()` returns.
The pinned resources include a referenced `SuperVersion`, a `MergingContext`, and a `PinnedIteratorsManager`. They are bundled into a `GetMergeOperandsState`. We use `SharedCleanablePtr` to share that bundle among all `PinnableSlice`s populated by `GetMergeOperands()`. That way, the last `PinnableSlice` to be `Reset()` will cleanup the bundle, including unreferencing the `SuperVersion`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10458
Test Plan:
- new DB level test
- measured benefit/regression in a number of memtable scenarios
Setup command:
```
$ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -num=$num -writes=16384 -key_size=16 -value_size=$value_sz -compression_type=none -write_buffer_size=1048576000
```
Benchmark command:
```
./db_bench -threads=$threads -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=1048576000 -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -num=$num -duration=10
```
Worst regression is when a key has many tiny operands:
- Parameters: num=1 (implying 16384 operands per key), value_sz=8, threads=1
- `GetMergeOperands()` latency increases 682 micros -> 800 micros (+17%)
The regression disappears into the noise (<1% difference) if we remove the `Reset()` loop and the size counting loop. The former is arguably needed regardless of this PR as the convention in `Get()` and `MultiGet()` is to `Reset()` the input `PinnableSlice`s at the start. The latter could be optimized to count the size as we accumulate operands rather than after the fact.
Best improvement is when a key has large operands and high concurrency:
- Parameters: num=4 (implying 4096 operands per key), value_sz=2KB, threads=32
- `GetMergeOperands()` latency decreases 11492 micros -> 437 micros (-96%).
Reviewed By: cbi42
Differential Revision: D38336578
Pulled By: ajkr
fbshipit-source-id: 48146d127e04cb7f2d4d2939a2b9dff3aba18258
Summary:
Resolves https://github.com/facebook/rocksdb/issues/9692
This PR adds a unit test that reproduces the race described in https://github.com/facebook/rocksdb/issues/9692 and an according fix.
The unit test does not have any assertions, because I could not find a reliable and save way to assert that the writers list does not form a cycle. So with the old (buggy) code, the test would simply hang, while with the fix the test passes successfully.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9944
Reviewed By: pdillinger
Differential Revision: D36134604
Pulled By: riversand963
fbshipit-source-id: ef636c5a79ddbef18658ab2f19ca9210a427324a
Summary:
TL;DR: due to a recent change, if you drop a column family,
often that DB will no longer fsync after writing new SST files
to remaining or new column families, which could lead to data
loss on power loss.
More bug detail:
The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
DB::Close time rather than waiting for DB object destruction.
Unfortunately, it also closes shared FSDirectory objects on
DropColumnFamily (& destroy remaining handles), which can lead
to use-after-Close on FSDirectory shared with remaining column
families. Those "uses" are only Fsyncs (or redundant Closes). In
the default Posix filesystem, an Fsync on a closed FSDirectory is a
quiet no-op. Consequently (under most configurations), if you drop
a column family, that DB will no longer fsync after writing new SST
files to column families sharing the same directory (true under most
configurations).
More fix detail:
Basically, this removes unnecessary Close ops on destroying
ColumnFamilyData. We let `shared_ptr` take care of calling the
destructor at the right time. If the intent was to require Close be
called before destroying FSDirectory, that was not made clear by the
author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
timely destruction of FSDirectory to suffice as Close (in
CountedFileSystem). Added a TODO to revisit that.
Also in this PR:
* Added a TODO to share FSDirectory instances between DB and its column
families. (Already shared among column families.)
* Made DB::Close attempt to close all its open FSDirectory objects even
if there is a failure in closing one. Also code clean-up around this
logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460
Test Plan:
add an assert to check for use-after-Close. With that
existing tests can detect the misuse. With fix, tests pass (except noted
relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)
Reviewed By: ajkr
Differential Revision: D38357922
Pulled By: pdillinger
fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
Summary:
During compaction, blobs are currently read using the default
`ReadOptions`, which has the `fill_cache` flag set to true. Earlier,
this didn't make any difference since we didn't have a blob cache;
however, now we have to explicitly set this flag to false to avoid
polluting the cache during compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D38333528
Pulled By: ltamasi
fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2
Summary:
FileMetaData::[min|max]_timestamp are not currently being used or
tracked by RocksDB, even when user-defined timestamp is enabled. Each of
them is a std::string which can occupy 32 bytes. Remove them for now.
They may be added back when we have a pressing need for them. When we do
add them back, consider store them in a more compact way, e.g. one
boolean flag and a byte array of size 16.
Per file min/max timestamp bounds are available as table properties.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10443
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D38292275
Pulled By: riversand963
fbshipit-source-id: 841dc4e855ad8f8481c80cb020603de9607c9c94
Summary:
The subcompaction logic currently picks file boundaries as subcompaction boundaries. This is not compatible with user-defined timestamps because of two issues.
Issue1: ReadOptions.iterate_lower_bound and ReadOptions.iterate_upper_bound contains timestamps which results in assertion failure as BlockBasedTableIterator expects bounds to be without timestamps. As result, because of wrong comparison end key is returned as user_key resulting in assertion failure.
Issue2: Since it might result in two keys that only differ by user timestamp getting processed by two different subcompactions (and thus two different CompactionIterator state machines), which in turn can cause data correction issues.
This PR provide support to reenable subcompactions with user-defined timestamps.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10344
Test Plan:
Added new unit test
- Without fix for Issue1 unit test MultipleSubCompactions fails with error:
```
db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterat│
or::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
Received signal 6 (Aborted) │
#0 /usr/local/fbcode/platform009/lib/libc.so.6(gsignal+0x100) [0x7f8fbbbfe530] db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterator::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
Aborted (core dumped)
```
Ran stress test
`make crash_test_with_ts -j32`
Reviewed By: riversand963
Differential Revision: D38220841
Pulled By: akankshamahajan15
fbshipit-source-id: 5d5cae2bd37fcaeba1e77fce0a69070ad4158ccb
Summary:
This PR changes the default value of
`CompactRangeOptions::exclusive_manual_compaction` from true to false so
manual `CompactRange()`s can run in parallel with other compactions. I
believe no artificial parallelism restriction is the intuitive behavior
so feel the old default value is a trap, which I have fallen into
several times, including yesterday.
`CompactRangeOptions::exclusive_manual_compaction == false` has been
used in both our correctness test and in production for years so should
be reasonably safe.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10317
Reviewed By: jay-zhuang
Differential Revision: D37659392
Pulled By: ajkr
fbshipit-source-id: 504915e978bbe300b79483d064070c75e93d91e5
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309
Reviewed By: ltamasi
Differential Revision: D38211655
Pulled By: gangliao
fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
Summary:
Allow sufficient subcompactions can be used when the number of input files is less than `max_subcompactions` under round-robin compaction priority.
Test Case:
Add `RoundRobinWithoutAdditionalResources` into `db_compaction_test`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10422
Reviewed By: ajkr
Differential Revision: D38186545
Pulled By: littlepig2013
fbshipit-source-id: b8e5098306f1e5b9561dfafafc8300a38f7fe88e
Summary:
Make the Stale Flush test more robust by explicitly checking the target CF is
flushed. Currently it's flaky because the default CF may have more than 3
SSTs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10409
Test Plan:
the test more likely to fail on a resource limited host:
```
gtest-parallel ./column_family_test --gtest_filter=FormatDef/ColumnFamilyTest.FlushStaleColumnFamilies/0 -r 1000 -w 100
```
Reviewed By: ajkr
Differential Revision: D38116383
Pulled By: jay-zhuang
fbshipit-source-id: e27cc56f76f14d0936504f126104e3d87e3d0d5f
Summary:
If WAL compression is enabled, WAL fragment decompression results are concatenated together in `log::Reader::ReadPhysicalRecord()`. This PR adds checksum handshake to protect memory corruption during the copying process.
`checksum` is renamed to `record_checksum` in `ReadRecord()` to differentiate it from `checksum_` flag that specifies whether CRC32C checksum is verified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10339
Test Plan: added checksum verification in log_test.cc, `make check -j32`.
Reviewed By: ajkr
Differential Revision: D37763734
Pulled By: cbi42
fbshipit-source-id: c4faa7c76b9ff1df35026edf31adfe4b47ae3154
Summary:
Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows:
* Constraint 1: We can only pick consecutive files
- Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files
- Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys)
* Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes`
* Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)`
* Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3
More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`.
The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps:
* Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()`
* Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()`
* Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions)
More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341
Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc`
Reviewed By: ajkr, hx235
Differential Revision: D37792644
Pulled By: littlepig2013
fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
Summary:
Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393
Test Plan:
1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable;
2. Add a new unit test to ApproximateKeyAnchors()
3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly.
Reviewed By: jay-zhuang
Differential Revision: D38043783
fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f
Summary:
Travis CI is depreciated and haven't been maintained for some time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10407
Reviewed By: ajkr
Differential Revision: D38078382
Pulled By: jay-zhuang
fbshipit-source-id: f42057f2f41f722bdce56bf195f67a94835191fb
Summary:
Minor issue, I just found a few typos on db_test and column_family while reading the code. And I have this PR opened to contribute. :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10139
Reviewed By: ajkr
Differential Revision: D38007098
Pulled By: jay-zhuang
fbshipit-source-id: 511947b32424c34348184691216640f32c410fb1
Summary:
Previously the "Fragmentation" test didn't cover fragmentation because the WAL data was compressible into trivial size. This PR changes it to use random data so the post-compression size is large enough to require fragmentation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10402
Reviewed By: cbi42
Differential Revision: D38065596
Pulled By: ajkr
fbshipit-source-id: 0d5f89ca14d33546501a74b5d4fafbadc28a46a7
Summary:
GCC-12 has strick check on variables, and thus
build fails when it finds read_req is not properly
initialized (-Werror=maybe-uninitialized). Add
default value to fix this.
Change-Id: Ib8a9085e2d613ee7b943b58a6a58e1bc351725d7
Signed-off-by: Jun He <jun.he@arm.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10312
Reviewed By: riversand963
Differential Revision: D37656997
Pulled By: ajkr
fbshipit-source-id: fe47492c913b34b3a03c04beeec9ec57831dcaff
Summary:
## Problem Summary
RocksDB will acquire the global mutex of db instance for every time when user calls `Write`. When RocksDB schedules a lot of compaction jobs, it will compete the mutex with write thread and it will hurt the write performance.
## Problem Solution:
I want to use log_write_mutex to replace the global mutex in most case so that we do not acquire it in write-thread unless there is a write-stall event or a write-buffer-full event occur.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7516
Test Plan:
1. make check
2. CI
3. COMPILE_WITH_TSAN=1 make db_stress
make crash_test
make crash_test_with_multiops_wp_txn
make crash_test_with_multiops_wc_txn
make crash_test_with_atomic_flush
Reviewed By: siying
Differential Revision: D36908702
Pulled By: riversand963
fbshipit-source-id: 59b13881f4f5c0a58fd3ca79128a396d9cd98efe
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321
Reviewed By: ltamasi
Differential Revision: D37913590
Pulled By: gangliao
fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
Summary:
`PeriodicWorkScheduler` is a global singleton, which were used to store per-instance setting `record_seqno_time_cadence_`. Move that to db_impl.h which is per-instance.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10383
Reviewed By: siying
Differential Revision: D37928009
Pulled By: jay-zhuang
fbshipit-source-id: e517754f4a9db98798ac04f72033d4b517f734e9
Summary:
(PR created for informational/testing purposes only.)
- Fixes lost dynamic updates to GenericRateLimiter bandwidth using `SetBytesPerSecond()`
- Benefit over #10374 is eliminating race conditions with Configurable framework.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10378
Reviewed By: pdillinger
Differential Revision: D37914865
fbshipit-source-id: d4f566d60ec9726d26932388c61671adf0ee0f30
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298
Reviewed By: ltamasi
Differential Revision: D37908743
Pulled By: gangliao
fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
Summary:
I add C API
db/c.cc
rocksdb_sstfilewriter_delete_range
and test it , PASS.
can you review it ? ajkr
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10314
Reviewed By: riversand963
Differential Revision: D37657236
Pulled By: ajkr
fbshipit-source-id: c3b758daa36fbd9133210b011716914dff311278
Summary:
RocksDB supports a two-level cache hierarchy (see https://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html), where items evicted from the primary cache can be spilled over to the secondary cache, or items from the secondary cache can be promoted to the primary one. We have a CacheLib-based non-volatile secondary cache implementation that can be used to improve read latencies and reduce the amount of network bandwidth when using distributed file systems. In addition, we have recently implemented a compressed secondary cache that can be used as a replacement for the OS page cache when e.g. direct I/O is used. The goals of this task are to add support for using a secondary cache with the blob cache and to measure the potential performance gains using `db_bench`.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10349
Reviewed By: ltamasi
Differential Revision: D37896773
Pulled By: gangliao
fbshipit-source-id: 7804619ce4a44b73d9e11ad606640f9385969c84
Summary:
Using the Sequence number to time mapping to decide if a key is hot or not in
compaction and place it in the corresponding level.
Note: the feature is not complete, level compaction will run indefinitely until
all penultimate level data is cold and small enough to not trigger compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10370
Test Plan:
CI
* Run basic db_bench for universal compaction manually
Reviewed By: siying
Differential Revision: D37892338
Pulled By: jay-zhuang
fbshipit-source-id: 792bbd91b1ccc2f62b5d14c53118434bcaac4bbe
Summary:
Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the `preclude_last_level_data_seconds`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338
Test Plan: CI
Reviewed By: siying
Differential Revision: D37810187
Pulled By: jay-zhuang
fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f
Summary:
InternalKeyComparator is an internal class which is a simple wrapper of Comparator. https://github.com/facebook/rocksdb/pull/8336 made Comparator customizeable. As a side effect, internal key comparator was made configurable too. This introduces overhead to this simple wrapper. For example, every InternalKeyComparator will have an std::vector attached to it, which consumes memory and possible allocation overhead too.
We remove InternalKeyComparator from being customizable by making InternalKeyComparator not a subclass of Comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10342
Test Plan: Run existing CI tests and make sure it doesn't fail
Reviewed By: riversand963
Differential Revision: D37771351
fbshipit-source-id: 917256ee04b2796ed82974549c734fb6c4d8ccee
Summary:
Support per_key_placement for last level compaction, which will
be used for tiered compaction.
* compaction iterator reports which level a key should output to;
* compaction get the output level information and check if it's safe to
output the data to penultimate level;
* all compaction output files will be installed.
* extra internal compaction stats added for penultimate level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9964
Test Plan:
* Unittest
* db_bench, no significate difference: https://gist.github.com/jay-zhuang/3645f8fb97ec0ab47c10704bb39fd6e4
* microbench manual compaction no significate difference: https://gist.github.com/jay-zhuang/ba679b3e89e24992615ee9eef310e6dd
* run the db_stress multiple times (not covering the new feature) looks good (internal: https://fburl.com/sandcastle/9w84pp2m)
Reviewed By: ajkr
Differential Revision: D36249494
Pulled By: jay-zhuang
fbshipit-source-id: a96da57c8031c1df83e4a7a8567b657a112b80a3
Summary:
Enabled zstd checksum flag in StreamingCompress so that WAL (de)compreression is protected by a checksum per compression frame.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10319
Test Plan:
- `make check`
- WAL perf: average ops/sec over 10 runs is 161226 pre PR and 159635 post PR (1% drop).
```
sudo TEST_TMPDIR=/dev/shm/memtable_write ./db_bench_checksum -benchmarks=fillseq -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=10 -wal_compression=zstd
```
Reviewed By: ajkr
Differential Revision: D37673311
Pulled By: cbi42
fbshipit-source-id: 9f34a3bfc2a82e5c80b1ec63bb339a7465108ec9
Summary:
ClockCache is still in experimental stage, and currently fails some pre-release fbcode tests. See https://www.internalfb.com/diff/D37772011. API calls to construct ClockCache are done via the function NewClockCache. For now, NewClockCache calls will return an LRUCache (with appropriate arguments), which is stable.
The idea that NewClockCache returns nullptr was also floated, but this would be interpreted as unsupported cache, and a default LRUCache would be constructed instead, potentially causing a performance regression that is harder to identify.
A new version of the NewClockCache function was created for our internal tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10351
Test Plan: ``make -j24 check`` and re-run the pre-release tests.
Reviewed By: pdillinger
Differential Revision: D37802685
Pulled By: guidotag
fbshipit-source-id: 0a8d10612ff21e576f7360cb13e20bc36e244972
Summary:
InternalKeyComparator is a thin wrapper around user comparator. Storing a string for name is relatively expensive to this small wrapper for both CPU and memory usage. Try to remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10343
Test Plan: Run existing tests
Reviewed By: ajkr
Differential Revision: D37772469
fbshipit-source-id: d2d106a8d022193058fd7f6b220108e3d94aca34
Summary:
The blob cache enables an optimization on the read path: when a blob is found in the cache, we can avoid copying it into the buffer provided by the application. Instead, we can simply transfer ownership of the cache handle to the target `PinnableSlice`. (Note: this relies on the `Cleanable` interface, which is implemented by `PinnableSlice`.)
This has the potential to save a lot of CPU, especially with large blob values.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10297
Reviewed By: riversand963
Differential Revision: D37640311
Pulled By: gangliao
fbshipit-source-id: 92de0e35cc703d06c87c5c1861cc2899ec52234a
Summary:
Earlier implementation of cutting the output files with a compact cursor under Round-Robin priority uses `Valid()` to determine if the `output_split_key` is valid in `ShouldStopBefore`. This contributes to excessive CPU computation, as pointed out by [this issue](https://github.com/facebook/rocksdb/issues/10315). In this PR, we change the type of `output_split_key` to be `InternalKey*` and set it as `nullptr` if it is not going to be used in `ShouldStopBefore`, `Valid()` condition checking can be avoided using that pointer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10316
Reviewed By: ajkr
Differential Revision: D37661492
Pulled By: littlepig2013
fbshipit-source-id: 66ff1105f3378e5573d3a126fdaff9bb23b5498f
Summary:
I noticed it would clean up some things to have Cache::Insert()
return our MemoryLimit Status instead of Incomplete for the case in
which the capacity limit is reached. I suspect this fixes some existing but
unknown bugs where this Incomplete could be confused with other uses
of Incomplete, especially no_io cases. This is the most suspicious case I
noticed, but was not able to reproduce a bug, in part because the existing
code is not covered by unit tests (FIXME added): 57adbf0e91/table/get_context.cc (L397)
I audited all the existing uses of IsIncomplete and updated those that
seemed relevant.
HISTORY updated with a clear warning to users of strict_capacity_limit=true
to update uses of `IsIncomplete()`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262
Test Plan: updated unit tests
Reviewed By: hx235
Differential Revision: D37473155
Pulled By: pdillinger
fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb
Summary:
In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck.
When pick up a file to compact and it doesn't have overlapping files in the next level, try to expand to the next file if there is still no overlapping.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10190
Test Plan:
Add some unit tests.
For performance, Try to run
./db_bench_multi_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
Together with https://github.com/facebook/rocksdb/pull/10188 , stalling will be eliminated in this benchmark.
Reviewed By: jay-zhuang
Differential Revision: D37230647
fbshipit-source-id: 42b260f545c46abc5d90335ac2bbfcd09602b549
Summary:
Before this PR, it is required that application open RocksDB secondary
instance with `max_open_files = -1`. This is a hacky workaround that
prevents IOErrors on the seconary instance during point-lookup or range
scan caused by primary instance deleting the table files. This is not
necessary if the application can coordinate the primary and secondaries
so that primary does not delete files that are still being used by the
secondaries. Or users can provide a custom Env/FS implementation that
deletes the files only after all primary and secondary instances
indicate files are obsolete and deleted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10260
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D37462633
Pulled By: riversand963
fbshipit-source-id: 9c2fc939f49663efa61e3d60c8f1e01d64b9d72c
Summary:
In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify.
1. We always try to find L0->L1 trivial move from the oldest files. Keep including newer files, until adding a new file won't trigger a trivial move
2. Modify the trivial move condition so that this compaction would be tagged as trivial move.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10188
Test Plan:
See throughput improvements with db_bench with fast fillseq benchmark and small L0 files:
./db_bench_l0_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
The throughput improved by about 50%. Stalling still happens though.
Reviewed By: jay-zhuang
Differential Revision: D37224743
fbshipit-source-id: 8958d97f22e12bdfc14d2e85930f6fa0070e9659
Summary:
In round-robin compaction priority, when splitting the compaction into sub-compactions, the earlier implementation takes into account the compact cursor to have full use of available sub-compactions. But this may result in unbalanced sub-compactions, so we remove this here. The removal does not affect the cursor-based splitting mechanism within a sub-compaction, and thus the output files are still ensured to be split according to the cursor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10289
Reviewed By: ajkr
Differential Revision: D37559091
Pulled By: littlepig2013
fbshipit-source-id: b8b45b99f63b09cf873f7f049bcb4ab13871fffc
Summary:
The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most.
Basic idea:
(1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is:
(2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10057
Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.
Reviewed By: ajkr
Differential Revision: D37539742
fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4
Summary:
- [x] Enabled blob caching for MultiGetBlob in RocksDB
- [x] Refactored MultiGetBlob logic and interface in RocksDB
- [x] Cleaned up Version::MultiGetBlob() and moved 'blob'-related code snippets into BlobSource
- [x] Add End-to-end test cases in db_blob_basic_test and also add unit tests in blob_source_test
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10272
Reviewed By: ltamasi
Differential Revision: D37558112
Pulled By: gangliao
fbshipit-source-id: a73a6a94ffdee0024d5b2a39e6d1c1a7d38664db
Summary:
Add load_latest_options() to C api.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10152
Test Plan:
Extend the existing c_test by reopening db using the latest options file
at different parts of the test.
Reviewed By: akankshamahajan15
Differential Revision: D37305225
Pulled By: ajkr
fbshipit-source-id: 8b3bab73f56fa6fcbdba45aae393145d007b3962
Summary:
If the internal iterator is not valid, `SeekToLast` with iter_start_ts should have `valid_` is false without assertion failure.
Test plan
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10279
Reviewed By: ltamasi
Differential Revision: D37539393
Pulled By: riversand963
fbshipit-source-id: 8e94057838f8a05144fad5768f4d62f1893ec315
Summary:
This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
- Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
- Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
- Middle insertions into the clock list.
- A test that exercises the clock eviction policy.
- Update the Java API of ClockCache and Java calls to C++.
Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37522461
Pulled By: guidotag
fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
Summary:
Currently, when installing a new super version, when stalling condition triggers, we compare estimated compaction bytes to previously, and if the new value is larger or equal to the previous one, we reduce the slowdown write rate. However, if concurrent compactions happen, the same value might be used. The result is that, although some compactions reduce estimated compaction bytes, we treat them as a signal for further slowing down. In some cases, it causes slowdown rate drops all the way to the minimum, far lower than needed.
Fix the bug by not triggering a re-calculation if a new super version doesn't have Version or a memtable change. With this fix, number of compaction finishes are still undercounted in this algorithm, but it is still better than the current bug where they are negatively counted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10270
Test Plan: Run a benchmark where the slowdown rate is dropped to minimal unnessarily and see it is back to a normal value.
Reviewed By: ajkr
Differential Revision: D37497327
fbshipit-source-id: 9bca961cc38fed965c3af0fa6c9ca0efaa7637c4
Summary:
Resolves https://github.com/facebook/rocksdb/issues/9761
With this PR, applications can create an iterator with the following
```cpp
ReadOptions read_opts;
read_opts.timestamp = &ts_ub;
read_opts.iter_start_ts = &ts_lb;
auto* it = db->NewIterator(read_opts);
it->SeekToLast();
// or it->SeekForPrev("foo");
it->Prev();
...
```
The application can access different versions of the same user key via `key()`, `value()`, and `timestamp()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10200
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D37258074
Pulled By: riversand963
fbshipit-source-id: 3f0b866ade50dcff7ef60d506397a9dd6ec91565
Summary:
Most of the properties listed as required for prefix extractors
are not really required but offer some conveniences. This updates API
comments to clarify actual requirements, and adds tests to demonstrate
how previously presumed requirements can be safely violated.
This might seem like a useless exercise, but this relaxing of requirements
would be needed if we generalize prefixes to group keys not just at the
byte level but also based on bits or arbitrary value ranges. For
applications without a "natural" prefix size, having only byte-level
granularity often means one prefix size to the next differs in magnitude
by a factor of 256.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10245
Test Plan: Tests added, also covering missing Iterator cases from https://github.com/facebook/rocksdb/issues/10244
Reviewed By: bjlemaire
Differential Revision: D37371559
Pulled By: pdillinger
fbshipit-source-id: ab2dd719992eea7656e9042cf8542393e02fa244
Summary:
The `bytes_read` returned by the current BlobSource interface is ambiguous. The uncompressed blob size is returned if the cache hits. The size of the blob read from disk, presumably the compressed version, is returned if the cache misses. Two differing semantics might cause ambiguity and consistency issues. For example, this inconsistency causes the assertion failure (T124246362 and its hot fix is https://github.com/facebook/rocksdb/issues/10249).
This goal is to require that the value of `byte read` always be an on-disk blob record size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10248
Reviewed By: ltamasi
Differential Revision: D37470292
Pulled By: gangliao
fbshipit-source-id: fbca521b2791d3674dbf2484cea5fcae2fdd94d2
Summary:
The patch fixes a couple of issues related to in-place updates: 1) the value type was not passed from
`MemTableInserter::PutCFImpl` to `MemTable::Update` and 2) `MemTable::UpdateCallback` was called
for any value type (with the callee's logic assuming `kTypeValue`) even though the callback mechanism
is only safe for plain values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10254
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D37463644
Pulled By: ltamasi
fbshipit-source-id: 33802477dac0691681f416ae84c4d9742c6fe41a
Summary:
The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
a new API called `PutEntity` that can be used to write a wide-column entity
to the database. The new API is added to both `DB` and `WriteBatch`. Note
that currently there is no way to retrieve these entities; more precisely, all
read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
encounter a wide-column entity that is required to answer a query. Read-side
support (as well as other missing functionality like `Merge`, compaction filter,
and timestamp support) will be added in later PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D37369748
Pulled By: ltamasi
fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d
Summary:
The 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test` may occasionally fail due to the inconsistent LSM state. The issue is fixed by adding `Flush()` and `WaitForFlushMemTable()` to produce a more predictable and stable LSM state.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10250
Test Plan: 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test`
Reviewed By: jay-zhuang, riversand963
Differential Revision: D37426091
Pulled By: littlepig2013
fbshipit-source-id: 56fbaab0384c380c1f279a16dc8732b139c9f611
Summary:
Currently SortFileByOverlappingRatio() is O(nlogn). It is usually OK but When there are a lot of files in an LSM-tree, SortFileByOverlappingRatio() can take non-trivial amount of time. The problem is severe when the user is loading keys in sorted order, where compaction is only trivial move and this operation becomes the bottleneck and limit the total throughput. This commit makes SortFileByOverlappingRatio() only find the top 50 files based on score. 50 files are usually enough for the parallel compactions needed for the level, and in case it is not enough, we would fall back to random, which should be acceptable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10161
Test Plan:
Run a fillseq that generates a lot of files, and observe throughput improved (although stall is not yet eliminated). The command ran:
TEST_TMPDIR=/dev/shm/ ./db_bench_sort --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000
The throughput improved by 11%.
Reviewed By: jay-zhuang
Differential Revision: D37129469
fbshipit-source-id: 492da2ef5bfc7cdd6daa3986b50d2ff91f88542d
Summary:
This task is to fix assertion failures during the crash test runs. The cache entry size might not match value size because value size can include the on-disk (possibly compressed) size. Therefore, we removed the assertions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10249
Reviewed By: ltamasi
Differential Revision: D37407576
Pulled By: gangliao
fbshipit-source-id: 577559f267c5b2437bcd0631cd0efabb6dde3b69
Summary:
Resolves https://github.com/facebook/rocksdb/issues/10129
I extracted this fix from https://github.com/facebook/rocksdb/issues/7516 since it's also already a bug in main branch, and we want to
separate it from the main part of the PR.
There can be a race condition between two threads. Thread 1 executes
`DBImpl::FindObsoleteFiles()` while thread 2 executes `GetSortedWals()`.
```
Time thread 1 thread 2
| mutex_.lock
| read disable_delete_obsolete_files_
| ...
| wait on log_sync_cv and release mutex_
| mutex_.lock
| ++disable_delete_obsolete_files_
| mutex_.unlock
| mutex_.lock
| while (pending_purge_obsolete_files > 0) { bg_cv.wait;}
| wake up with mutex_ locked
| compute WALs tracked by MANIFEST
| mutex_.unlock
| wake up with mutex_ locked
| ++pending_purge_obsolete_files_
| mutex_.unlock
|
| delete obsolete WAL
| WAL missing but tracked in MANIFEST.
V
```
The fix proposed eliminates the possibility of the above by increasing
`pending_purge_obsolete_files_` before `FindObsoleteFiles()` can possibly release the mutex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10187
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D37214235
Pulled By: riversand963
fbshipit-source-id: 556ab1b58ae6d19150169dfac4db08195c797184
Summary:
Add suggest_compact_range() and suggest_compact_range_cf() to C API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10175
Test Plan:
As verifying the result requires SyncPoint, which is not available in the c_test.c,
the test is currently done by invoking the functions and making sure it does not crash.
Reviewed By: jay-zhuang
Differential Revision: D37305191
Pulled By: ajkr
fbshipit-source-id: 0fe257b45914f6c9aeb985d8b1820dafc57a20db
Summary:
The files behind the compaction cursor contain newer data than the files ahead of it. If a compaction writes a file that spans from before its output level’s cursor to after it, then data before the cursor will be contaminated with the old timestamp from the data after the cursor. To avoid this, we can split the output file into two – one entirely before the cursor and one entirely after the cursor. Note that, in rare cases, we **DO NOT** need to cut the file if it is a trivial move since the file will not be contaminated by older files. In such case, the compact cursor is not guaranteed to be the boundary of the file, but it does not hurt the round-robin selection process.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10227
Test Plan:
Add 'RoundRobinCutOutputAtCompactCursor' unit test in `db_compaction_test`
Task: [T122216351](https://www.internalfb.com/intern/tasks/?t=122216351)
Reviewed By: jay-zhuang
Differential Revision: D37388088
Pulled By: littlepig2013
fbshipit-source-id: 9246a6a084b6037b90d6ab3183ba4dfb75a3378d
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10225
Test Plan:
Add test cases for MultiGetBlob
In this task, we added the new API MultiGetBlob() for BlobSource.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Reviewed By: ltamasi
Differential Revision: D37358364
Pulled By: gangliao
fbshipit-source-id: aff053a37615d96d768fb9aedde17da5618c7ae6
Summary:
When a key is "out of domain" for the prefix_extractor (no
prefix assigned) then the Bloom filter is not queried. PerfContext
was counting this as a Bloom "hit" while Statistics doesn't count this
as a prefix Bloom checked. I think it's more accurate to call it neither
hit nor miss, so changing the counting to make it PerfContext coounting
more like Statistics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10244
Test Plan:
tests updates and expanded (Get and MultiGet). Iterator test
coverage of the change will come in next PR
Reviewed By: bjlemaire
Differential Revision: D37371297
Pulled By: pdillinger
fbshipit-source-id: fed132fba6a92b2314ab898d449fce2d1586c157
Summary:
**Summary**
Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.
**Motivation**
RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.
**Content of this PR**
This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.
**Benchmarking**
I will add numbers to prove that there is no performance impact within the next 12 hours.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011
Reviewed By: pdillinger
Differential Revision: D36462357
Pulled By: bjlemaire
fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
Summary:
* Add metadata related structs and functions in C API, including
- `rocksdb_get_column_family_metadata()` and `rocksdb_get_column_family_metadata_cf()`
that returns `rocksdb_column_family_metadata_t`.
- `rocksdb_column_family_metadata_t` and its get functions & destroy function.
- `rocksdb_level_metadata_t` and its and its get functions & destroy function.
- `rocksdb_file_metadata_t` and its and get functions & destroy functions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10207
Test Plan:
Extend the existing c_test.c to include additional checks for column_family_metadata
inside CheckCompaction.
Reviewed By: riversand963
Differential Revision: D37305209
Pulled By: ajkr
fbshipit-source-id: 0a5183206353acde145f5f9b632c3bace670aa6e
Summary:
https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`,
`DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot
directly expose the error to caller without some additional work.
This is a first version proposal which:
- Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation
- Checks the error during `DB::Open()` and return it to caller if non-ok
This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`.
Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything
in case other callers of `SanitizeOptions()` assumes that a logger should be created.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D37321717
Pulled By: riversand963
fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b
Summary:
So that DBImpl::RecoverLogFiles do not have to deal with implementation
details of WalFilter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10214
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D37299122
Pulled By: riversand963
fbshipit-source-id: acf1a80f1ef75da393d375f55968b2f3ac189816
Summary:
Add `kRoundRobin` as a compaction priority. The implementation is as follows.
- Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`).
- After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function.
- Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107
Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test`
Reviewed By: ajkr
Differential Revision: D37316037
Pulled By: littlepig2013
fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.
Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198
Reviewed By: ltamasi
Differential Revision: D37294735
Pulled By: gangliao
fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
Summary:
One consistency check in SaveTo() is dupilcated with the one within Apply(). Remove one of then in release mode to reduce time spent in DB mutex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10169
Test Plan: Run existing tests and see nothing breaks.
Reviewed By: ltamasi
Differential Revision: D37157821
fbshipit-source-id: 73b89443a20b43362ff66d10b9212022034a8234
Summary:
include "include/rocksdb/blah.h" is messing up some internal
builds vs. include "rocksdb/blah." This fixes the bad case and adds a
check for future instances.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10213
Test Plan: back-port to 7.4 release candidate and watch internal build
Reviewed By: hx235
Differential Revision: D37296202
Pulled By: pdillinger
fbshipit-source-id: d7cc6b2c57d858dff0444f19320d83c8b4f9b185
Summary:
This bug was discovered after write batch checksum verification before WAL is added (https://github.com/facebook/rocksdb/issues/10114) and stress test with write batch checksum protection is turned on (https://github.com/facebook/rocksdb/issues/10037). In this [line](d5d8920f2c/db/write_batch.cc (L2887)), the number of checksums may not be consistent with `batch->Count()`. This PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10201
Test Plan:
```
./db_stress --batch_protection_bytes_per_key=8 --destroy_db_initially=1 --max_key=100000 --use_txn=1
```
Reviewed By: ajkr
Differential Revision: D37260799
Pulled By: cbi42
fbshipit-source-id: ff8dce7dcce295d689333bc9d892d17a843bf0ea
Summary:
`FlushWAL(true /* sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced.
The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`.
An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB.
```
Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185
Test Plan:
- repro unit test
- the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list"
Reviewed By: riversand963
Differential Revision: D37200993
Pulled By: ajkr
fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.
However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159
Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
- Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work
Reviewed By: ajkr, anand1976
Differential Revision: D37135172
Pulled By: hx235
fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we added a new abstraction layer `BlobSource` to retrieve blobs from either blob cache or raw blob file. Note: For simplicity, the current PR only includes `GetBlob()`. `MultiGetBlob()` will be included in the next PR.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10178
Reviewed By: ltamasi
Differential Revision: D37250507
Pulled By: gangliao
fbshipit-source-id: 3fc4a55a0cea955a3147bdc7dba06430e377259b
Summary:
folly DistributedMutex is faster than standard mutexes though
imposes some static obligations on usage. See
https://github.com/facebook/folly/blob/main/folly/synchronization/DistributedMutex.h
for details. Here we use this alternative for our Cache implementations
(especially LRUCache) for better locking performance, when RocksDB is
compiled with folly.
Also added information about which distributed mutex implementation is
being used to cache_bench output and to DB LOG.
Intended follow-up:
* Use DMutex in more places, perhaps improving API to support non-scoped
locking
* Fix linking with fbcode compiler (needs ROCKSDB_NO_FBCODE=1 currently)
Credit: Thanks Siying for reminding me about this line of work that was previously
left unfinished.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10179
Test Plan:
for correctness, existing tests. CircleCI config updated.
Also Meta-internal buck build updated.
For performance, ran simultaneous before & after cache_bench. Out of three
comparison runs, the middle improvement to ops/sec was +21%:
Baseline: USE_CLANG=1 DEBUG_LEVEL=0 make -j24 cache_bench (fbcode
compiler)
```
Complete in 20.201 s; Rough parallel ops/sec = 1584062
Thread ops/sec = 107176
Operation latency (ns):
Count: 32000000 Average: 9257.9421 StdDev: 122412.04
Min: 134 Median: 3623.0493 Max: 56918500
Percentiles: P50: 3623.05 P75: 10288.02 P99: 30219.35 P99.9: 683522.04 P99.99: 7302791.63
```
New: (add USE_FOLLY=1)
```
Complete in 16.674 s; Rough parallel ops/sec = 1919135 (+21%)
Thread ops/sec = 135487
Operation latency (ns):
Count: 32000000 Average: 7304.9294 StdDev: 108530.28
Min: 132 Median: 3777.6012 Max: 91030902
Percentiles: P50: 3777.60 P75: 10169.89 P99: 24504.51 P99.9: 59721.59 P99.99: 1861151.83
```
Reviewed By: anand1976
Differential Revision: D37182983
Pulled By: pdillinger
fbshipit-source-id: a17eb05f25b832b6a2c1356f5c657e831a5af8d1
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.
Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.
There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037
Test Plan:
- Manual
- Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
- Enabled in MyShadow for 1+ week
- Automated
- Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
- Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection
Reviewed By: cbi42
Differential Revision: D36614569
Pulled By: ajkr
fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
Summary:
There was an interesting code path not covered by testing that
is difficult to replicate in a unit test, which is now covered using a
sync point. Specifically, the case of table_prefix_extractor == null and
!need_upper_bound_check in `BlockBasedTable::PrefixMayMatch`, which
can happen if table reader is open before extractor is registered with global
object registry, but is later registered and re-set with SetOptions. (We
don't have sufficient testing control over object registry to set that up
repeatedly.)
Also, this function has been renamed to `PrefixRangeMayMatch` for clarity
vs. other functions that are not the same.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10122
Test Plan: unit tests expanded
Reviewed By: siying
Differential Revision: D36944834
Pulled By: pdillinger
fbshipit-source-id: 9e52d9da1929a3e42bbc230fcdc3599949de7bdb
Summary:
In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
the public API, because of its inefficiency. Although we normally maintain read compatibility
on old DBs forever, filters are not required for reading a DB, only for optimizing read
performance. Thus, it should be acceptable to remove this code and the substantial
maintenance burden it carries as useful features are developed and validated (such
as user timestamp).
This change completely removes the code for reading and writing the old block-based
filters, net removing about 1370 lines of code no longer needed. Options removed from
testing / benchmarking tools. The prior existence is only evident in a couple of places:
* `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
a major release to minimize source code incompatibilities.
* A warning is logged when an old table file is opened that used the old block-based
filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
should suffice. Unfortunately, sst_dump does not tell you whether a file uses
block-based filter, and the structure of the code makes it very difficult to fix.
* To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
for metaindex is maintained (for now).
Other notes:
* In some cases where numbers are associated with filter configurations, we have had to
update the assigned numbers so that they all correspond to something that exists.
* Fixed potential stat counting bug by assuming `filter_checked = false` for cases
like `filter == nullptr` rather than assuming `filter_checked = true`
* Removed obsolete `block_offset` and `prefix_extractor` parameters from several
functions.
* Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
because the caller guarantees the prefix extractor exists and is compatible
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184
Test Plan:
tests updated, manually test new warning in LOG using base version to
generate a DB
Reviewed By: riversand963
Differential Revision: D37212647
Pulled By: pdillinger
fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
Summary:
Add a couple of stats to help users estimate the impact of potential MultiGet perf improvements -
1. NUM_LEVEL_READ_PER_MULTIGET - A histogram stat for number of levels that required MultiGet to read from a file
2. MULTIGET_COROUTINE_COUNT - A ticker stat to count the number of times the coroutine version of MultiGetFromSST was used
The NUM_DATA_BLOCKS_READ_PER_LEVEL stat is obsoleted as it doesn't provide useful information for MultiGet optimization.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10182
Reviewed By: akankshamahajan15
Differential Revision: D37213296
Pulled By: anand1976
fbshipit-source-id: 5d2b7708017c0e278578ae4bffac3926f6530efb
Summary:
In CompactionIterator code, there are multiple places where the process
will abort in dbg mode before logging the error message describing the
cause. This PR changes only the logging behavior for compaction iterator so
that error message is written to LOG before the process aborts in debug
mode.
Also updated the triggering condition for an assertion for single delete with
user-defined timestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10183
Test Plan: make check
Reviewed By: akankshamahajan15
Differential Revision: D37190218
Pulled By: riversand963
fbshipit-source-id: 741bb007067be7cfbe94ac9e530ad4b2b339c009
Summary:
A consequence of https://github.com/facebook/rocksdb/issues/9990 was requiring a non-empty DB ID to generate
new SST files. But if the DB ID is not tracked in the manifest and the IDENTITY file
is somehow truncated to 0 bytes, then an empty DB ID would be assigned, leading
to crash. This change ensures a non-empty DB ID is assigned and set in the
IDENTITY file.
Also,
* Some light refactoring to clean up the logic
* (I/O efficiency) If the ID is tracked in the manifest and already matches the
IDENTITY file, don't needlessly overwrite the file.
* (Debugging) Log the DB ID to info log on open, because sometimes IDENTITY
can change if DB is moved around (though it would be unusual for info log to
be copied/moved without IDENTITY file)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10173
Test Plan: unit tests expanded/updated
Reviewed By: ajkr
Differential Revision: D37176545
Pulled By: pdillinger
fbshipit-source-id: a9b414cd35bfa33de48af322a36c24538d50bef1
Summary:
This code is unreachable when `ROCKSDB_LITE` not defined. And it cause build fail on my environment VS2019 16.11.15.
```
-- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19044.
-- The CXX compiler identification is MSVC 19.29.30145.0
-- The C compiler identification is MSVC 19.29.30145.0
-- The ASM compiler identification is MSVC
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10146
Reviewed By: akankshamahajan15
Differential Revision: D37112916
Pulled By: ajkr
fbshipit-source-id: e0b2bf3055d6fac1b3fb40b9f02c4cbae3f82757
Summary:
`PinnableSlice` may hold a handle to a cache value which must be released to correctly decrement the ref-counter. However, when `PinnableSlice` variables are reused, e.g. like this:
```
PinnableSlice pin_slice;
db.Get("foo", &pin_slice);
db.Get("foo", &pin_slice);
```
then the second `Get` simply overwrites the old value in `pin_slice` and the handle returned by the first `Get` is _not_ released.
This PR adds `Reset` calls to the `Get`/`MultiGet` calls that accept `PinnableSlice` arguments to ensure proper cleanup of old values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10166
Reviewed By: hx235
Differential Revision: D37151632
Pulled By: ajkr
fbshipit-source-id: 9dd3c3288300f560531b843f67db11aeb569a9ff
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10155
Reviewed By: ltamasi
Differential Revision: D37150819
Pulled By: gangliao
fbshipit-source-id: b807c7916ea5d411588128f8e22a49f171388fe2
Summary:
We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values.
Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154
Test Plan: `make -j24 check`
Reviewed By: pdillinger
Differential Revision: D37124451
Pulled By: guidotag
fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
Summary:
FastLRUCache now only supports 16B keys. The tests have changed to reflect this.
Because the unit tests were designed for caches that accept any string as keys, some tests are no longer compatible with FastLRUCache. We have disabled those for runs with FastLRUCache. (We could potentially change all tests to use 16B keys, but we don't because the cache public API does not require this.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10137
Test Plan: make -j24 check
Reviewed By: gitbw95
Differential Revision: D37083934
Pulled By: guidotag
fbshipit-source-id: be1719cf5f8364a9a32bc4555bce1a0de3833b0d
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.
It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.
This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.
In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.
This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```
If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.
Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();
// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
std::shared_ptr<TransactionNotifier> notifier,
TxnTimestamp ts,
std::shared_ptr<const Snapshot>* ret);
// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```
The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```
We also provide two additional APIs for stats collection and reporting purposes.
```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
TxnTimestamp ts_lb,
TxnTimestamp ts_ub,
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```
To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```
Before shutdown, RocksDB will release all timestamped snapshots.
Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.
Timestamped snapshots can also be used with user-defined timestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879
Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```
Reviewed By: siying
Differential Revision: D35783919
Pulled By: riversand963
fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
Summary:
When opening an SST file created using index_type=kHashSearch,
the *current* prefix_extractor would be saved, and used with hash index
if the *new current* prefix_extractor at query time is compatible with
the SST file. This is a problem if the prefix_extractor at SST open time
is not compatible but SetOptions later changes (back) to one that is
compatible.
This change fixes that by using the known compatible (or missing) prefix
extractor we save for use with prefix filtering. Detail: I have moved the
InternalKeySliceTransform wrapper to avoid some indirection and remove
unnecessary fields.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10128
Test Plan:
expanded unit test (using some logic from https://github.com/facebook/rocksdb/issues/10122) that fails
before fix and probably covers some other previously uncovered cases.
Reviewed By: siying
Differential Revision: D36955738
Pulled By: pdillinger
fbshipit-source-id: 0c78a6b0d24054ef2f3cb237bf010c1c5589fb10
Summary:
This PR helps handle the race condition mentioned in this comment thread: https://github.com/facebook/rocksdb/pull/7884#discussion_r572402281 In case where actual full_history_ts_low is higher than the user's requested ts, return a try again message so they don't have the misconception that data between [ts, full_history_ts_low) is kept.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10126
Test Plan:
```
$COMPILE_WITH_ASAN=1 make -j24 all
$./db_with_timestamp_basic_test --gtest_filter=UpdateFullHistoryTsLowTest.ConcurrentUpdate
$ make -j24 check
```
Reviewed By: riversand963
Differential Revision: D37055368
Pulled By: jowlyzhang
fbshipit-source-id: 787fd0984a246540fa03ac227b1d232590d27828
Summary:
RocksDB uses WalManager to manage WAL files. In WalManager::ReadFirstLine(), the assumption is that reading the first record of a valid WAL file will return OK status and set the output sequence to non-zero value.
This assumption has been broken by WAL compression which writes a `kSetCompressionType` record which is not associated with any sequence number.
Consequently, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10130
Test Plan: - Newly Added test
Reviewed By: riversand963
Differential Revision: D36985744
Pulled By: akankshamahajan15
fbshipit-source-id: dfde7b3be68b6a30b75b49479779748eedf29f7f
Summary:
**Summary:**
Add unit tests to verify that the dynamic priority can be passed from compaction to FS. Compaction reads&writes and other DB reads&writes share the same read&write paths to FSRandomAccessFile or FSWritableFile, so a MockTestFileSystem is added to replace the default filesystem from Env to intercept and verify the io_priority. To prepare the compaction input files, use the default filesystem from Env. To test the io priority of the compaction reads and writes, db_options_.fs is set as MockTestFileSystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10088
Test Plan: Add unit tests.
Reviewed By: anand1976
Differential Revision: D36882528
Pulled By: gitbw95
fbshipit-source-id: 120adc15801966f2b8c9fc45285f590a3fff96d1
Summary:
The default implementation of Close() function in Directory/FSDirectory classes returns `NotSupported` status. However, we don't want operations that worked in older versions to begin failing after upgrading when run on FileSystems that have not implemented Directory::Close() yet. So we require the upper level that calls Close() function should properly handle "NotSupported" status instead of treating it as an error status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10127
Reviewed By: ajkr
Differential Revision: D36971112
Pulled By: littlepig2013
fbshipit-source-id: 100f0e6ad1191e1acc1ba6458c566a11724cf466
Summary:
As pointed out by [https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422](https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422), check `manual_compaction_paused` and `manual_compaction_canceled` can be reduced by setting `*canceled` to be true in `DisableManualCompaction()` and `*canceled` to be false in the last time calling `EnableManualCompaction()`.
Changed Tests: The origin `DBTest2.PausingManualCompaction1` uses a callback function to increase `manual_compaction_paused` and the origin CompactionJob/CompactionIterator with `manual_compaction_paused` can detect this. I changed the callback function so that it sets `*canceled` as true if `canceled` is not `nullptr` (to notify CompactionJob/CompactionIterator the compaction has been canceled).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10070
Test Plan: This change does not introduce new features, but some slight difference in compaction implementation. Run the same manual compaction unit tests as before (e.g., PausingManualCompaction[1-4], CancelManualCompaction[1-2], CancelManualCompactionWithListener in db_test2, and db_compaction_test).
Reviewed By: ajkr
Differential Revision: D36949133
Pulled By: littlepig2013
fbshipit-source-id: c5dc4c956fbf8f624003a0f5ad2690240063a821
Summary:
With this change, when a given read timestamp is smaller than the column-family's full_history_ts_low, Get(), MultiGet() and iterators APIs will return Status::InValidArgument().
Test plan
```
$COMPILE_WITH_ASAN=1 make -j24 all
$./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.UpdateFullHistoryTsLow
$ make -j24 check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10109
Reviewed By: riversand963
Differential Revision: D36901126
Pulled By: jowlyzhang
fbshipit-source-id: 255feb1a66195351f06c1d0e42acb1ff74527f86
Summary:
The patch adds some low-level logic that can be used to serialize/deserialize
a sorted vector of wide columns to/from a simple binary searchable string
representation. Currently, there is no user-facing API; this will be implemented in
subsequent stages.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9915
Test Plan: `make check`
Reviewed By: siying
Differential Revision: D35978076
Pulled By: ltamasi
fbshipit-source-id: 33f5f6628ec3bcd8c8beab363b1978ac047a8788
Summary:
If caller specifies a non-null `timestamp` argument in `DB::Get()` or a non-null `timestamps` in `DB::MultiGet()`,
RocksDB will return the timestamps of the point tombstones.
Note: DeleteRange is still unsupported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10056
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D36677956
Pulled By: riversand963
fbshipit-source-id: 2d7af02cc7237b1829cd269086ea895a49d501ae
Summary:
Closing https://github.com/facebook/rocksdb/issues/10080
When `SyncWAL()` calls `MarkLogsSynced()`, even if there is only one active WAL file,
this event should still be added to the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10087
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D36797580
Pulled By: riversand963
fbshipit-source-id: 24184c9dd606b3939a454ed41de6e868d1519999