Summary:
Fix build error: variable 'base_level' may be uninitialized
```
db_impl_compaction_flush.cc:1195:21: error: variable 'base_level' may be uninitialized when used here [-Werror,-Wconditional-uninitialized]
level = base_level;
```
^~~~~~~~~~
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11435
Test Plan: CircleCI jobs
Reviewed By: cbi42
Differential Revision: D45708176
Pulled By: akankshamahajan15
fbshipit-source-id: 851b1205b22b63d728495e5735fa91b0ad8e012b
Summary:
**Context:**
We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in https://github.com/facebook/rocksdb/pull/4156, which has caused small reads in unlucky case (e.g, small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug. Therefore we decide to record the exact tail size and use it directly to prefetch tail of the SST instead of relying heuristics.
**Summary:**
- Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()`
- For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more.
- Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11406
Test Plan:
1. New UT
2. db bench
Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
```
diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc
index bd5669f0f..791484c1f 100644
--- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
@@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail(
&tail_prefetch_size);
// Try file system prefetch
- if (!file->use_direct_io() && !force_direct_prefetch) {
+ if (false && !file->use_direct_io() && !force_direct_prefetch) {
if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority)
.IsNotSupported()) {
prefetch_buffer->reset(new FilePrefetchBuffer(
diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
index ea40f5fa0..39a0ac385 100644
--- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@@ -4191,6 +4191,8 @@ class Benchmark {
std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options));
} else {
BlockBasedTableOptions block_based_options;
+ block_based_options.metadata_cache_options.partition_pinning =
+ PinningTier::kAll;
block_based_options.checksum =
static_cast<ChecksumType>(FLAGS_checksum_type);
if (FLAGS_use_hash_search) {
```
Create DB
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
ReadRandom
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
(a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 3395
rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614
```
(b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 14257
rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540
```
As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR
3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR.
Reviewed By: pdillinger
Differential Revision: D45413346
Pulled By: hx235
fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064
Summary:
the test is flaky when compiled with `make -j56 COERCE_CONTEXT_SWITCH=1 ./db_universal_compaction_test`. The cause is that a manual compaction `CompactRange()` can finish and return before obsolete files are deleted. One reason for this is that a manual compaction waits until `manual.done` is set here 62fc15f009/db/db_impl/db_impl_compaction_flush.cc (L1978)
and the compaction thread can set `manual.done`:
62fc15f009/db/db_impl/db_impl_compaction_flush.cc (L3672)
and then temporarily release mutex_:
62fc15f009/db/db_impl/db_impl_files.cc (L317)
before purging obsolete files:
62fc15f009/db/db_impl/db_impl_compaction_flush.cc (L3144)
With `COERCE_CONTEXT_SWITCH=1`, `bg_cv_.SignalAll()` is called during `mutex_.Lock()`, so the manual compaction thread can wake up and return before obsolete files are deleted. Updated the test to only count live SST files.
Also updated `FindObsoleteFiles()` to avoid locking a locked mutex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11412
Test Plan: `make -j56 COERCE_CONTEXT_SWITCH=1 ./db_universal_compaction_test`
Reviewed By: ajkr
Differential Revision: D45342242
Pulled By: cbi42
fbshipit-source-id: 955c9796aa3f484e3557d300f97cffacb3ed9b0c
Summary:
when I use g++-13 to exec the `make all` command, the output throws the warnings.
```
db/compaction/compaction_job_test.cc: In member function ‘void rocksdb::CompactionJobTestBase::AddMockFile(const rocksdb::mock::KVVector&, int)’:
db/compaction/compaction_job_test.cc:376:57: error: redundant move in initialization [-Werror=redundant-move]
376 | env_, GenerateFileName(file_number), std::move(contents)));
| ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
375 | EXPECT_OK(mock_table_factory_->CreateMockTable(
| ^~~~~~~~~
db/compaction/compaction_job_test.cc:376:57: note: remove ‘std::move’ call
376 | env_, GenerateFileName(file_number), std::move(contents)));
| ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
375 | EXPECT_OK(mock_table_factory_->CreateMockTable(
| ^~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [Makefile:2507: db/compaction/compaction_job_test.o] Error 1
```
and I also add some `(void)unused_variable` statements because of the cmake argument `-Wunused-but-set-variable -Wunused-but-set-variable`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11418
Reviewed By: akankshamahajan15
Differential Revision: D45528223
Pulled By: ajkr
fbshipit-source-id: fee1a77c30039a56b481de953f0a834cc788abbc
Summary:
When a DB is opened, RocksDB creates an empty WAL file. When the DB is reopened and the WAL is empty, the min log number to keep is not advanced until a memtable flush happens. If a process crashes soon after reopening the DB, its likely that no memtable flush would have happened, which means the empty WAL file is not deleted. In a crash loop scenario, this leads to empty WAL files accumulating. Fix this by ensuring the min log number is advanced if the WAL is empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11409
Test Plan: Add a unit test
Reviewed By: ajkr
Differential Revision: D45281685
Pulled By: anand1976
fbshipit-source-id: 0225877c613e65ffb30972a0051db2830105423e
Summary:
For better clarity, encouraging more options explicitly specified using fields rather than positionally via constructor parameter lists. Simplifies code maintenance as new fields are added. Deprecate some cases of the confusing pattern of NewWhatever() functions returning shared_ptr.
Net reduction of about 70 source code lines (including comments).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11386
Test Plan: existing tests
Reviewed By: ajkr
Differential Revision: D45059075
Pulled By: pdillinger
fbshipit-source-id: d53fa09b268024f9c55254bb973b6c69feebf41a
Summary:
add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
1. checksum construction and verification in block.cc/h
2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
3. unit tests/crash test updates
Tests:
* Added unit tests
* Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`
Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.
Performance:
Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):
```
SETUP
make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none
BENCHMARK
./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE
The readrandom ops/sec looks like the following:
Block cache size: 2GB 1.2GB * 0.9 1.2GB * 0.8 1.2GB * 0.5 8MB
Main 240805 223604 198176 161653 139040
PR prot_bytes=0 238691 226693 200127 161082 141153
PR prot_bytes=1 214983 193199 178532 137013 108211
prot_bytes=1 vs -10% -15% -10.8% -15% -23%
prot_bytes=0
```
The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287
Reviewed By: ajkr
Differential Revision: D43970708
Pulled By: cbi42
fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
Summary:
Tweak some bounds and things, and reduce risk of surprise results by running on all supported compressions (mostly).
Also improves the precise compressibility of CompressibleString by using RandomBinaryString.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11396
Test Plan: updated tests
Reviewed By: ltamasi
Differential Revision: D45211938
Pulled By: pdillinger
fbshipit-source-id: 9dc1dd8574a60a9364efe18558be66d31a35598b
Summary:
## Option API updates
* Add new CompressionOptions::max_compressed_bytes_per_kb, which corresponds to 1024.0 / min allowable compression ratio. This avoids the hard-coded minimum ratio of 8/7.
* Remove unnecessary constructor for CompressionOptions.
* Document undocumented CompressionOptions. Use idiom for default values shown clearly in one place (not precariously repeated).
## Stat API updates
* Deprecate the BYTES_COMPRESSED, BYTES_DECOMPRESSED histograms. Histograms incur substantial extra space & time costs compared to tickers, and the distribution of uncompressed data block sizes tends to be uninteresting. If we're interested in that distribution, I don't see why it should be limited to blocks stored as compressed.
* Deprecate the NUMBER_BLOCK_NOT_COMPRESSED ticker, because the name is very confusing.
* New or existing tickers relevant to compression:
* BYTES_COMPRESSED_FROM
* BYTES_COMPRESSED_TO
* BYTES_COMPRESSION_BYPASSED
* BYTES_COMPRESSION_REJECTED
* COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES (both existing)
* NUMBER_BLOCK_COMPRESSED (existing)
* NUMBER_BLOCK_COMPRESSION_BYPASSED
* NUMBER_BLOCK_COMPRESSION_REJECTED
* BYTES_DECOMPRESSED_FROM
* BYTES_DECOMPRESSED_TO
We can compute a number of things with these stats:
* "Successful" compression ratio: BYTES_COMPRESSED_FROM / BYTES_COMPRESSED_TO
* Compression ratio of data on which compression was attempted: (BYTES_COMPRESSED_FROM + BYTES_COMPRESSION_REJECTED) / (BYTES_COMPRESSED_TO + BYTES_COMPRESSION_REJECTED)
* Compression ratio of data that could be eligible for compression: (BYTES_COMPRESSED_FROM + X) / (BYTES_COMPRESSED_TO + X) where X = BYTES_COMPRESSION_REJECTED + NUMBER_BLOCK_COMPRESSION_REJECTED
* Overall SST compression ratio (compression disabled vs. actual): (Y - BYTES_COMPRESSED_TO + BYTES_COMPRESSED_FROM) / Y where Y = COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES
Keeping _REJECTED separate from _BYPASSED helps us to understand "wasted" CPU time in compression.
## BlockBasedTableBuilder
Various small refactorings, optimizations, and name clean-ups.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11388
Test Plan:
unit tests added
* `options_settable_test.cc`: use non-deprecated idiom for configuring CompressionOptions from string. The old idiom is tested elsewhere and does not need to be updated to support the new field.
Reviewed By: ajkr
Differential Revision: D45128202
Pulled By: pdillinger
fbshipit-source-id: 5a652bf5c022b7ec340cf79018cccf0686962803
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.
**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
- Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288
Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob)
- May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689
// PlainTable
Does not apply
```
- **Db bench 2: performance**
**Read**
SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`
**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none
rocksdb.sst.read.micros COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec`
Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec`
Reviewed By: ajkr
Differential Revision: D44007011
Pulled By: hx235
fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
Summary:
This test exhibited the following flaky failure:
```
db/db_write_test.cc:653: Failure
db_->Resume()
Corruption: Not active
```
I was able to repro it by applying the following patch to coerce a specific race condition:
```
diff --git a/db/db_write_test.cc b/db/db_write_test.cc
index d82c57376..775ba3cde 100644
--- a/db/db_write_test.cc
+++ b/db/db_write_test.cc
@@ -636,6 +636,10 @@ TEST_P(DBWriteTest, LockWALInEffect) {
ASSERT_TRUE(dbfull()->WALBufferIsEmpty());
ASSERT_OK(db_->UnlockWAL());
+ // Test thread: sleep interval: [0, 3)
+ // In this interval, the file system is active
+ sleep(3);
+
// Fail the WAL flush if applicable
fault_fs->SetFilesystemActive(false);
Status s = Put("key2", "value");
@@ -649,6 +653,11 @@ TEST_P(DBWriteTest, LockWALInEffect) {
ASSERT_OK(db_->LockWAL());
ASSERT_OK(db_->UnlockWAL());
}
+
+ // Test thread: sleep interval: [3, 6)
+ // In this interval, the file system is inactive
+ sleep(3);
+
fault_fs->SetFilesystemActive(true);
ASSERT_OK(db_->Resume());
// Writes should work again
diff --git a/db/flush_job.cc b/db/flush_job.cc
index 8193f594f..602ee2c9f 100644
--- a/db/flush_job.cc
+++ b/db/flush_job.cc
@@ -979,6 +979,10 @@ Status FlushJob::WriteLevel0Table() {
DirFsyncOptions(DirFsyncOptions::FsyncReason::kNewFileSynced));
}
TEST_SYNC_POINT_CALLBACK("FlushJob::WriteLevel0Table", &mems_);
+ // Flush thread: sleep interval: [0, 4)
+ // Upon awakening, the file system will be inactive. Then the MANIFEST
+ // update will fail.
+ sleep(4);
db_mutex_->Lock();
}
base_->Unref();
```
The fix for this scenario is explained in the code change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11382
Reviewed By: cbi42
Differential Revision: D45027632
Pulled By: ajkr
fbshipit-source-id: 6bfa35a5781c0c080fb74e13f2b2c9f871f7effb
Summary:
In CircleCI build-linux-arm-test-full job (https://app.circleci.com/pipelines/github/facebook/rocksdb/26462/workflows/a9d39d2c-c970-4b0f-9c10-7743beb9771b/jobs/591722), this test exhibited the following flaky failure:
```
db/db_bloom_filter_test.cc:2506: Failure
Expected: (TestGetTickerCount(options, BLOOM_FILTER_USEFUL)) > (65000 * 2), actual: 120558 vs 130000
```
I ssh'd to an instance and observed it cuts memtables at slightly different points across runs. Logging in `ConcurrentArena` pointed to `try_lock()` returning false at different points across runs.
This PR changes the approach to allow a fixed number of keys per memtable flush. I verified the bloom filter useful count is deterministic now even on the CircleCI ARM instance.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11383
Reviewed By: cbi42
Differential Revision: D45036829
Pulled By: ajkr
fbshipit-source-id: b602dacb63955f1af09bf0ed409cde0552805a08
Summary:
When calculating the largest_key in ImportColumnFamilyJob::GetIngestedFileInfo, only the first element of range_del_iter is calculated. If range_del_iter has multiple elements, the largest_key will be wrong
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11381
Reviewed By: cbi42
Differential Revision: D44981450
Pulled By: ajkr
fbshipit-source-id: 584bc7da86295568a96984d2951644f289e578c7
Summary:
Before this PR, in `LevelCompactionBuilder::TryExtendNonL0TrivialMove(index)`, we start from a file at index and expand the compaction input towards right to find files to trivial move. This PR adds the logic to also expand towards left.
Another major change made in this PR is to not expand L0 files through `TryExtendNonL0TrivialMove()`. This happens currently when compacting L0 files to an empty output level. The condition for expanding files in `TryExtendNonL0TrivialMove()` is to check atomic boundary, which does not take into account that L0 files can overlap in key range and are not sorted in key order. So it may include more L0 files than needed and disallow a trivial move. This change is included in this PR so that we don't make it worse by always expanding L0 in both direction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11347
Test Plan:
* new unit test
* Benchmark does not show obvious improvement or regression:
```
Write sequentially
./db_bench --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216
Main:
fillseq : 4.726 micros/op 211592 ops/sec 472.607 seconds 100000000 operations; 23.4 MB/s
This PR:
fillseq : 4.755 micros/op 210289 ops/sec 475.534 seconds 100000000 operations; 23.3 MB/s
Write randomly
./db_bench --benchmarks=fillrandom --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216
Main:
fillrandom : 16.351 micros/op 61159 ops/sec 1635.066 seconds 100000000 operations; 6.8 MB/s
This PR:
fillrandom : 15.798 micros/op 63298 ops/sec 1579.817 seconds 100000000 operations; 7.0 MB/s
```
Reviewed By: ajkr
Differential Revision: D44645650
Pulled By: cbi42
fbshipit-source-id: 8631f3a6b3f01decbbf18c34f2b62833cb4f9733
Summary:
**Context/Summary:**
ASSERT_EQ will only verify the code of Status, but will not check the state message of Status.
- Assert by checking Status state in `ImportColumnFamilyTest`
- Forgot to set db_comparator_name when creating ExportImportFilesMetaData in `ImportColumnFamilyNegativeTest`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11372
Reviewed By: ajkr
Differential Revision: D45004343
Pulled By: cbi42
fbshipit-source-id: a13d45521df17ead3d6d4c1c1fe1e4c95397ce8b
Summary:
Makes it easier to use generated Rust bindings. Constness of these is already part of the C++ API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11243
Reviewed By: hx235
Differential Revision: D44840394
Pulled By: ajkr
fbshipit-source-id: bcd1aeb8c959c304148d25b00043bb8c4cd3e0a4
Summary:
If RocksDB enables user-defined timestamp, then RocksDB read path can filter table files by the min/max timestamps of each file. If application wants to lookup a key that is the most recent and visible to a certain timestamp ts, then we can compare ts with the min_ts of each file. If ts < min_ts, then we know all keys in the file is not visible at time ts, then we do not have to open the file. This can also save an IO.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11332
Reviewed By: pdillinger
Differential Revision: D44763497
Pulled By: guowentian
fbshipit-source-id: abde346b9f18480fe03c04e4006e7d62aa9c22a8
Summary:
When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340
Test Plan:
- New unit tests
- `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run.
Reviewed By: ajkr
Differential Revision: D44563884
Pulled By: cbi42
fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8
Summary:
VerifyFileChecksums currently interprets the readahead_size as a payload of readahead_size for calculating the checksum, plus a prefetch of an additional readahead_size. Hence each read is readahead_size * 2. This change treats it as chunks of readahead_size for checksum calculation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11328
Test Plan: Add a unit test
Reviewed By: pdillinger
Differential Revision: D44718781
Pulled By: anand1976
fbshipit-source-id: 79bae1ebaa27de2a13bc86f5910bf09356936e63
Summary:
**Context/Summary:**
- Allow runtime changes to whether `WriteBufferManager` allows stall or not by calling `SetAllowStall()`
- Misc: some clean up - see PR conversation
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11335
Test Plan: - New UT
Reviewed By: akankshamahajan15
Differential Revision: D44502555
Pulled By: hx235
fbshipit-source-id: 24b5cc57df7734b11d42e4870c06c87b95312b5e
Summary:
**Context/Summary:**
Motived by user need of investigating db iterator behavior during an interval of any time length of a certain thread, we decide to collect and expose related counters in `PerfContext` as an experimental feature, in addition to the existing db-scope ones (i.e, tickers)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11320
Test Plan:
- new UT
- db bench
Setup
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -compression_type=none -bloom_bits=3
```
Test till converges
```
./db_bench -seed=1679526311157283 -use_existing_db=1 -perf_level=2 -db=/dev/shm/testdb/ -benchmarks="seekrandom[-X60]"
```
pre-change
`seekrandom [AVG 33 runs] : 7545 (± 100) ops/sec`
post-change (no regression)
`seekrandom [AVG 33 runs] : 7688 (± 67) ops/sec`
Reviewed By: cbi42
Differential Revision: D44321931
Pulled By: hx235
fbshipit-source-id: f98a254ba3e3ced95eb5928884e33f1b99dca401
Summary:
…evel_bytes
During DB open, if a column family uses level compaction with level_compaction_dynamic_level_bytes=true, trivially move its files down in the LSM such that the bottommost files are in Lmax, the second from bottommost level files are in Lmax-1 and so on. This is aimed to make it easier to migrate level_compaction_dynamic_level_bytes from false to true. Before this change, a full manual compaction is suggested for such migration. After this change, user can just restart DB to turn on this option. db_crashtest.py is updated to randomly choose value for level_compaction_dynamic_level_bytes.
Note that there may still be too many unnecessary levels if a user is migrating from universal compaction or level compaction with a smaller level multiplier. A full manual compaction may still be needed in that case before some PR that automatically drain unnecessary levels like https://github.com/facebook/rocksdb/issues/3921 lands. Eventually we may want to change the default value of option level_compaction_dynamic_level_bytes to true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11321
Test Plan:
1. Added unit tests.
2. Crash test: ran a variation of db_crashtest.py (like 32516507e77521ae887e45091b69139e32e8efb7) that turns level_compaction_dynamic_level_bytes on and off and switches between LC and UC for the same DB.
TODO: Update `OptionChangeMigration`, either after this PR or https://github.com/facebook/rocksdb/issues/3921.
Reviewed By: ajkr
Differential Revision: D44341930
Pulled By: cbi42
fbshipit-source-id: 013de19a915c6a0502be569f07c4cc8f1c3c6be2
Summary:
In DBCompactionTest::CancelCompactionWaitingOnConflict, when generating SST files to trigger a compaction, we don't wait after each file, which may cause multiple memtables going to the same SST file, causing insufficient files to trigger the compaction. We do the waiting instead, except the last one, which would trigger compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11318
Test Plan: Run DBCompactionTest.CancelCompactionWaitingOnConflict multiple times.
Reviewed By: ajkr
Differential Revision: D44267273
fbshipit-source-id: 86af49b05fc67ea3335312f0f5f3d22df1520bf8
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.
## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.
## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
## Crash/stress test
Updated to use the new combination.
## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```
**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127
**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)
This is an acceptable difference IMHO.
Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec
So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
**Before**:
multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec
Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
Let's also look at Get() in db_bench:
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec
**After**:
readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec
That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
**Before**:
readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec
**After**:
readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec
Reviewed By: ltamasi
Differential Revision: D44177044
Pulled By: pdillinger
fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
Summary:
In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296
Test Plan: Add new unit tests
Reviewed By: akankshamahajan15
Differential Revision: D44045776
Pulled By: anand1976
fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9
Summary:
In preparation for factoring secondary cache support out of individual Cache implementations, we can get rid of the "in secondary cache" flag on entries through a workable hack: when an entry is promoted from secondary, it is inserted in primary using a helper that lacks secondary cache support, thus preventing re-insertion into secondary cache through existing logic.
This adds to the complexity of building CacheItemHelpers, because you always have to be able to get to an equivalent helper without secondary cache support, but that complexity is reasonably isolated within RocksDB typed_cache.h and test code.
gcc-7 seems to have problems with constexpr constructor referencing `this` so removed constexpr support on CacheItemHelper.
Also refactored some related test code to share common code / functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11299
Test Plan: existing tests
Reviewed By: anand1976
Differential Revision: D44101453
Pulled By: pdillinger
fbshipit-source-id: 7a59d0a3938ee40159c90c3e65d7004f6a272345
Summary:
... ahead of a larger change.
* Rename confusingly named `is_in_sec_cache` to `kept_in_sec_cache`
* Unify naming of "standalone" block cache entries (was "detached" in clock_cache)
* Remove some unused definitions in clock_cache.h (leftover from a previous revision)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11291
Test Plan: usual tests and CI, no behavior changes
Reviewed By: anand1976
Differential Revision: D43984642
Pulled By: pdillinger
fbshipit-source-id: b8bf0c5b90a932a88bcbdb413b2f256834aedf97
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
Summary:
In rare cases seeing failures like this
```
[ RUN ] DBWriteTestInstance/DBWriteTest.LockWALInEffect/2
db/db_write_test.cc:653: Failure
Put("key3", "value")
Corruption: Not active
```
in a test with no explicit threading. This is likely because of the unpredictability of background auto-resume. I didn't really know this feature, in part because DB::Resume() was undocumented. So I believe I have fixed the test and documented the API function.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11290
Test Plan: 1000s of stress runs of the test with gtest-parallel
Reviewed By: anand1976
Differential Revision: D43984583
Pulled By: pdillinger
fbshipit-source-id: d30dec120b4864e193751b2e33ff16834d313db3
Summary:
CreateColumnFamilyWithImport() did not support range tombstones for two reasons:
1. it uses point keys of a input file to determine its boundary (smallest and largest internal key), which means range tombstones outside of the point key range will be effectively dropped.
2. it does not handle files with no point keys.
Also included a fix in external_sst_file_ingestion_job.cc where the blocks read in `GetIngestedFileInfo()` can be added to block cache now (issue fixed in https://github.com/facebook/rocksdb/pull/6429).
This PR adds support for exporting and importing column family with range tombstones. The main change is to add smallest internal key and largest internal key to `SstFileMetaData` that will be part of the output of `ExportColumnFamily()`. Then during `CreateColumnFamilyWithImport(...,const ExportImportFilesMetaData& metadata,...)`, file boundaries can be set from `metadata` directly. This is needed since when file boundaries are extended by range tombstones, sometimes they cannot be deduced from a file's content alone.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11252
Test Plan:
- added unit tests that fails before this change
Closes https://github.com/facebook/rocksdb/issues/11245
Reviewed By: ajkr
Differential Revision: D43577443
Pulled By: cbi42
fbshipit-source-id: 6bff78e583cc50c44854994dea0a8dd519398f2f
Summary:
The existing PerfContext counter `internal_merge_count` only tracks the
Merge operands applied during range scans. The patch adds a new counter
called `internal_merge_count_point_lookups` to track the same metric
for point lookups (`Get` / `MultiGet` / `GetEntity` / `MultiGetEntity`), and
also fixes a couple of cases in the iterator where the existing counter wasn't
updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11284
Test Plan: `make check`
Reviewed By: jowlyzhang
Differential Revision: D43926082
Pulled By: ltamasi
fbshipit-source-id: 321566d8b4cf0a3b6c9b73b7a5c984fb9bb492e9
Summary:
`BlobSourceCacheReservationTest.IncreaseCacheReservationOnFullCache` is both flaky and also doesn't do what its name says. The patch changes this test so it actually tests increasing the cache reservation, hopefully also deflaking it in the process.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11273
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D43800935
Pulled By: ltamasi
fbshipit-source-id: 5eb54130dfbe227285b0e14f2084aa4b89f0b107
Summary:
Hi. :) Noticed we are copying ColumnFamilyDescriptor here because my process crashed during copy constructor (cause unrelated)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10978
Reviewed By: cbi42
Differential Revision: D41473924
Pulled By: ajkr
fbshipit-source-id: 58a3473f2d7b24918f79d4b2726c20081c5e95b4
Summary:
During backward iteration, blob verification would fail because the user key (ts included) in `saved_key_` doesn't match the blob. This happens because during`FindValueForCurrentKey`, `saved_key_` is not updated when the user key(ts not included) is the same for all cases except when `timestamp_lb_` is specified. This breaks the blob verification logic when user defined timestamp is enabled and `timestamp_lb_` is not specified. Fix this by always updating `saved_key_` when a smaller user key (ts included) is seen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11258
Test Plan:
`make check`
`./db_blob_basic_test --gtest_filter=DBBlobWithTimestampTest.IterateBlobs`
Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced with:
`./db_bench -user_timestamp_size=8 -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5`
Baseline:
- seekrandom [AVG 6 runs] : 72188 (± 1481) ops/sec; 37.2 (± 0.8) MB/sec
With this PR:
- seekrandom [AVG 6 runs] : 74171 (± 1427) ops/sec; 38.2 (± 0.7) MB/sec
Reviewed By: ltamasi
Differential Revision: D43675642
Pulled By: jowlyzhang
fbshipit-source-id: 8022ae8522d1f66548821855e6eed63640c14e04
Summary:
This makes it possible to eliminate some copies in `GetEntity` / `MultiGetEntity`,
in particular when `Merge`s or blobs are involved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11248
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D43544215
Pulled By: ltamasi
fbshipit-source-id: bc4c8955a24bbd8bc4ab098e72133ead757f9707
Summary:
A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)).
The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113
Test Plan:
* added unit test in db_range_del_test
* crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10`
Reviewed By: ajkr
Differential Revision: D42655709
Pulled By: cbi42
fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c
Summary:
Fix complain
```
db/db_impl/db_impl_compaction_flush.cc:417:19: error: loop variable 'bg_flush_arg' of type 'const rocksdb::DBImpl::BGFlushArg' creates a copy from type
'const rocksdb::DBImpl::BGFlushArg' [-Werror,-Wrange-loop-analysis]
for (const auto bg_flush_arg : bg_flush_args) {
^
db/db_impl/db_impl_compaction_flush.cc:417:8: note: use reference type 'const rocksdb::DBImpl::BGFlushArg &' to prevent copying
for (const auto bg_flush_arg : bg_flush_args) {
^~~~~~~~~~~~~~~~~~~~~~~~~
&
db/db_impl/db_impl_compaction_flush.cc:2911:21: error: loop variable 'bg_flush_arg' of type 'const rocksdb::DBImpl::BGFlushArg' creates a copy from type
'const rocksdb::DBImpl::BGFlushArg' [-Werror,-Wrange-loop-analysis]
for (const auto bg_flush_arg : bg_flush_args) {
^
db/db_impl/db_impl_compaction_flush.cc:2911:10: note: use reference type 'const rocksdb::DBImpl::BGFlushArg &' to prevent copying
for (const auto bg_flush_arg : bg_flush_args) {
^~~~~~~~~~~~~~~~~~~~~~~~~
&
```
from
```sh
xxx@MacBook-Pro / % g++ -v
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 12.0.0 (clang-1200.0.32.29)
Target: x86_64-apple-darwin21.6.0
Thread model: posix
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11240
Reviewed By: cbi42
Differential Revision: D43458729
Pulled By: ajkr
fbshipit-source-id: 26e110f83451509463a1bc308f737ccb693c9f45
Summary:
in DBIter::SeekToLast(), key() can be called when iter is invalid and fails the following assertion:
```
./db/db_iter.h:153: virtual rocksdb::Slice rocksdb::DBIter::key() const: Assertion `valid_' failed.
```
This happens when `iterate_upper_bound` and timestamp_lb_ are set. SeekForPrev(*iterate_upper_bound_) positions the iterator on the same user key as *iterate_upper_bound_. A subsequent PrevInternal() call makes the iterator invalid just be the call to key().
This PR fixes this issue by setting updating the seek key to have max sequence number AND max timestamp when the seek key has the same user key as *iterate_upper_bound_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11223
Test Plan: - Added a unit test that would fail the above assertion before this fix.
Reviewed By: jowlyzhang
Differential Revision: D43283600
Pulled By: cbi42
fbshipit-source-id: 0dd3999845b722584679bbc95be2664b266005ba
Summary:
This PR adds support to the c-api bindings for calling `Flush()` with multiple column families, which is useful for performing atomic flushes (assuming also that the db has been opened with `atomic_flush = true`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11112
Reviewed By: cbi42
Differential Revision: D42666382
Pulled By: ajkr
fbshipit-source-id: 82f05bf32d28452d85c79ea42411c8fea961fd87
Summary:
The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available. With the removal of LITE mode, the function was no longer required.
Note that the MergeOperator had some private classes defined in header files. To gain access to their constructors (and name methods), the class definitions were moved into header files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203
Reviewed By: cbi42
Differential Revision: D43160255
Pulled By: pdillinger
fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c
Summary:
From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions.
This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231
Test Plan: updated unit test
Reviewed By: akankshamahajan15
Differential Revision: D43396607
Pulled By: ajkr
fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9
Summary:
The new `MultiGetEntity` API can be used to get a consistent view of
a batch of keys, with the results presented as wide-column entities.
Similarly to `GetEntity` and the iterator's `columns` API, if the entry
corresponding to the key is a wide-column entity to start with, it is
returned as-is, and if it is a plain key-value, it is wrapped into an entity
with a single default column.
Implementation-wise, the new API shares the logic of the batched `MultiGet`
API (via the `MultiGetCommon` methods). Both single-CF and multi-CF
`MultiGetEntity` APIs are provided, and blobs are also supported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11222
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D43256950
Pulled By: ltamasi
fbshipit-source-id: 47fb2cb7e2d0470e3580f43fdb2fe9e51f0e7005
Summary:
The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.
In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544
Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.
Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192
Test Plan: `make check`
Reviewed By: anand1976
Differential Revision: D43055487
Pulled By: pdillinger
fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110
Summary:
Example failure:
```
[ RUN ] DBWriteTestInstance/DBWriteTest.LockWALInEffect/1
db/db_write_test.cc:646: Failure
Put("key3", "value")
Corruption: Not active
```
Presumably from a background compaction prior to Put.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11209
Test Plan: watch CI
Reviewed By: akankshamahajan15
Differential Revision: D43147727
Pulled By: pdillinger
fbshipit-source-id: a1c34ac5ab124bfe2f23205a30777990056e9082
Summary:
Fix a bug in the calculation of the input buffer address/offset in log_reader.cc. The bug is when consecutive fragments of a compressed record are located at the same offset in the log reader buffer, the second fragment input buffer is treated as a leftover from the previous input buffer. As a result, the offset in the `ZSTD_inBuffer` is not reset.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11198
Test Plan: Add a unit test in log_test.cc that fails without the fix and passes with it.
Reviewed By: ajkr, cbi42
Differential Revision: D43102692
Pulled By: anand1976
fbshipit-source-id: aa2648f4802c33991b76a3233c5a58d4cc9e77fd
Summary:
The patch adds compaction filter support for wide-column entities by introducing
a new `CompactionFilter` API called `FilterV3`. This API is called for regular
key-values, merge operands, and wide-column entities as well. It is passed the
existing value/operand or wide-column structure and it can update the value or
columns or keep/delete/etc. the key-value as usual. For compatibility, the default
implementation of `FilterV3` keeps all wide-column entities and falls back to calling
`FilterV2` for plain old key-values and merge operands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11196
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D43094147
Pulled By: ltamasi
fbshipit-source-id: 75acabe9a35254f7f404ba6173ee9c2774382ebd
Summary:
This option has long been intended to be set to false by default and deprecated. It might never be practical to completely remove the feature, so that we can continue to test for backward compatibility by keeping the ability to generate DBs in the old way.
Also improved API comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11179
Test Plan: existing tests (with one tiny update)
Reviewed By: hx235
Differential Revision: D42973927
Pulled By: pdillinger
fbshipit-source-id: e9bc161cb933266e094aea2dff8cc03753c39dab
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11160
By counting the number of stalls placed on a write queue, we can check in UnlockWAL() whether the stall present at the start of UnlockWAL() has been cleared by the end, or wait until it's cleared.
More details in code comments and new unit test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11172
Test Plan: unit test added. Yes, it uses sleep to amplify failure on buggy behavior if present, but using a sync point to only allow new behavior would fail with the old code only because it doesn't contain the new sync point. Basically, using a sync point in UnlockWAL() could easily mask a regression by artificially limiting key behaviors. The test would only check that UnlockWAL() invokes code that *should* do the right thing, without checking that it *does* the right thing.
Reviewed By: ajkr
Differential Revision: D42894341
Pulled By: pdillinger
fbshipit-source-id: 15c9da0ca383e6aec845b29f5447d76cecbf46c3
Summary:
Currently, we incorrectly return a Status::Corruption to the MultiGet caller if the file system ReadAsync cannot issue a read and returns an error for some reason, such as IOStatus::NotSupported(). In this PR, we copy the ReadAsync error to the request status so it can be returned to the user.
Tests:
Update existing unit tests and add a new one for this scenario
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11171
Reviewed By: akankshamahajan15
Differential Revision: D42950057
Pulled By: anand1976
fbshipit-source-id: 85ffcb015fa6c064c311f8a28488fec78c487869
Summary:
The patch makes some code quality enhancements in `CompactionIterator::InvokeFilterIfNeeded`
including the renaming of `filter` (which is most likely a remnant of the days before the `FilterV2`
API when the compaction filter used to return a boolean) to `decision`, the removal of some
outdated comments, the elimination of an `error` flag which was only used in one failure case
out of many, as well as some small stylistic improvements. (Some the above will also come in
handy when adding compaction filter support for wide-column entities.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11174
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D42901408
Pulled By: ltamasi
fbshipit-source-id: ab382d59a4990c5dfe1cee219d49e1d80902b666
Summary:
This PR adds logic to the `RunManualCompaction()` loop to check for cancellation before waiting on any conflicting compactions to finish. In case of cancellation, `RunManualCompaction()` no longer waits on conflicting compactions
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11165
Test Plan: repro test case
Reviewed By: cbi42
Differential Revision: D42864058
Pulled By: ajkr
fbshipit-source-id: ea4dd1a8f294abe212905495a8fbe8f07fca3f5a
Summary:
The patch fixes a feature interaction bug between BlobDB and the `GetEntity` API:
without the patch, `GetEntity` would return the blob reference (wrapped into a
single-column entity) instead of the actual blob value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11162
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D42854092
Pulled By: ltamasi
fbshipit-source-id: f750d0ff57def107da16f545077ddce9860ff21a
Summary:
The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:
* Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
* Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
* Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
* Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
* Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
* Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143
Test Plan: unit tests added / updated, added to stress/crash test
Reviewed By: ajkr
Differential Revision: D42848627
Pulled By: pdillinger
fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
Summary:
Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller.
Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105
Test Plan:
make V=1 db_blob_basic_test
./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*"
Reviewed By: ltamasi
Differential Revision: D42716487
Pulled By: jowlyzhang
fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
Most of changes were done through following comments:
unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
Test Plan: See CI
Reviewed By: pdillinger
Differential Revision: D42796341
fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11136
Test Plan: the provided unit test used to fail due to `GetMergeOperands()` returning `Status::MergeInProgress()`; it passes now because the `GetMergeOperands()` call returns `Status::OK()`
Reviewed By: pdillinger
Differential Revision: D42759198
Pulled By: ajkr
fbshipit-source-id: 878f9f40ccc1d7e2fe7b1352814bae3a49c19939
Summary:
Migrate derived classes from EnvWrapper to FileSystemWrapper so we can eventually deprecate the storage methods in Env.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11125
Test Plan: CircleCI jobs
Reviewed By: anand1976
Differential Revision: D42732241
Pulled By: akankshamahajan15
fbshipit-source-id: c89a70a79fcfb13e158bf8919b1a87a9de133222
Summary:
PR https://github.com/facebook/rocksdb/issues/11020 fixed a case where it was easy to deadlock the DB with LockWAL() but introduced a bug showing up as a rare assertion failure in the stress test. Specifically, `assert(w->state == STATE_INIT)` in `WriteThread::LinkOne()` called from `BeginWriteStall()`, `DelayWrite()`, `WriteImplWALOnly()`. I haven't been about to generate a unit test that reproduces this failure but I believe the root cause is that DelayWrite() was never meant to be re-entrant, only called from the DB's write_thread_ leader. https://github.com/facebook/rocksdb/issues/11020 introduced a call to DelayWrite() from the nonmem_write_thread_ group leader.
This fix is to make DelayWrite() apply to the specific write queue that it is being called from (inject a dummy write stall entry to the head of the appropriate write queue). WriteController is re-entrant, based on polling and state changes signalled with bg_cv_, so can manage stalling two queues. The only anticipated complication (called out by Andrew in previous PR) is that we don't want timed write delays being injected in parallel for the two queues, because that dimishes the intended throttling effect. Thus, we only allow timed delays for the primary write queue.
HISTORY not updated because this is intended for the same release where the bug was introduced.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11130
Test Plan:
Although I was not able to reproduce the assertion failure, I was able to reproduce a distinct flaw with what I believe is the same root cause: a kind of deadlock if both write queues need to wake up from stopped writes. Only one will be waiting on bg_cv_ (the other waiting in `LinkOne()` for the write queue to open up), so a single SignalAll() will only unblock one of the queues, with the other re-instating the stop until another signal on bg_cv_. A simple unit test is added for this case.
Will also run crash_test_with_multiops_wc_txn for a while looking for issues.
Reviewed By: ajkr
Differential Revision: D42749330
Pulled By: pdillinger
fbshipit-source-id: 4317dd899a93d57c26fd5af7143038f82d4d4d1b
Summary:
Migrate ErrorEnv from EnvWrapper to FileSystemWrapper so we can eventually deprecate the storage methods in Env.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11124
Reviewed By: akankshamahajan15
Differential Revision: D42727791
Pulled By: anand1976
fbshipit-source-id: e8362ad624dc28e55c99fc35eda12866755f62c6
Summary:
Compressed block cache is replaced by compressed secondary cache. Remove the feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117
Test Plan: See CI passes
Reviewed By: pdillinger
Differential Revision: D42700164
fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d
Summary:
Capture more of the original intent at a high level, without getting bogged down in low-level details.
The old text made some weak promises about handling of LOCK files. There should be no specific concern for LOCK files, because we already rely on LockFile() to create the file if it's not present already. And the lock file is generally size 0, so don't have to worry about truncation. Added a unit test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11085
Test Plan: existing tests, and a new one.
Reviewed By: siying
Differential Revision: D42713233
Pulled By: pdillinger
fbshipit-source-id: 2fce7c974d35fac065037c9c4c7326a59c9fe340
Summary:
**Context:**
Concurrent flushes on the same CF can set on `ColumnFamilyData::flush_reason` before each other flush finishes. An symptom is one CF has different flush_reason with others though all of them are in an atomic flush `db_stress: db/db_impl/db_impl_compaction_flush.cc:423: rocksdb::Status rocksdb::DBImpl::AtomicFlushMemTablesToOutputFiles(const rocksdb::autovector<rocksdb::DBImpl::BGFlushArg>&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority): Assertion cfd->GetFlushReason() == cfds[0]->GetFlushReason() failed. `
**Summary:**
Suggested by ltamasi, we now refactor and let FlushRequest/Job to own flush_reason as there is no good way to define `ColumnFamilyData::flush_reason` in face of concurrent flushes on the same CF (which wasn't the case a long time ago when `ColumnFamilyData::flush_reason ` first introduced`)
**Tets:**
- new unit test
- make check
- aggressive crash test rehearsal
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11111
Reviewed By: ajkr
Differential Revision: D42644600
Pulled By: hx235
fbshipit-source-id: 8589c8184869d3415e5b780c887f877818a5ebaf
Summary:
Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage.
To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092
Reviewed By: siying
Differential Revision: D42525673
Pulled By: ajkr
fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee
Summary:
the `last_tombstone_start_user_key` variable in `BuildTable()` and in `CompactionOutputs::AddRangeDels()` may point to a start key that is freed if user-defined timestamp is enabled. This was causing ASAN failure and this PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11106
Test Plan: Added UT for repro.
Reviewed By: ajkr
Differential Revision: D42590862
Pulled By: cbi42
fbshipit-source-id: c493265ececdf89636d801d55ae929806c4d4b2c
Summary:
in `CompactionOutputs::ShouldStopBefore()`, TTL-related states, `cur_files_to_cut_for_ttl_` and `next_files_to_cut_for_ttl_`, are not updated if the function returns early. This can cause unnecessary compaction output file cuttings and hence produce smaller output files, which may hurt write amp. See the example in the unit test for how this "unnecessary file cutting" can happen. This PR fixes this issue by moving the code for updating TTL states earlier in `CompactionOutputs::ShouldStopBefore()` so that the states are updated for each key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11075
Test Plan: - Added new unit test.
Reviewed By: hx235
Differential Revision: D42398739
Pulled By: cbi42
fbshipit-source-id: 09fab66679c1a734abcfc31bcea33dd9aeb9dbc7
Summary:
in `CompactionOutputs::AddRangeDels()`, range tombstones with the same start and end key but different sequence numbers all contribute to compensated range tombstone size. This PR removes this redundancy. This PR also includes a fix from https://github.com/facebook/rocksdb/issues/11067 where a range tombstone that is not within a file's range was being added to the file. This fixes an assertion failure for `icmp.Compare(start, end) <= 0` in VersionSet::ApproximateSize() when calculating compensated range tombstone size. Assertions and a comment/essay was added to reason that no such range tombstone will be added after this fix.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11091
Test Plan:
- Added unit tests
- Stress test with small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10`
Reviewed By: ajkr
Differential Revision: D42521588
Pulled By: cbi42
fbshipit-source-id: 5bda3fe38997995314e1f7592319af12b69bc4f8
Summary:
This reverts commit f02c708aa3 since it introduced several bugs (see https://github.com/facebook/rocksdb/issues/11078 and https://github.com/facebook/rocksdb/issues/11067 for attempts to fix them) and that I do not have a high confidence to fix all of them and ensure no further ones before the next release branch cut. There are also come existing issue found during bug fixing. We will work on it and try to merge it to the release after.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11089
Test Plan: existing CI.
Reviewed By: ajkr
Differential Revision: D42505972
Pulled By: cbi42
fbshipit-source-id: 2f66dcde6b85dc94977b317c2ce513872cfbc153
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)
## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
* Simpler for implementations to deal with just one Insert and one Lookup.
* Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
* Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
* Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
* It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
* I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
Test Plan:
tests updated
Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
Reviewed By: anand1976
Differential Revision: D42417818
Pulled By: pdillinger
fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
Summary:
valgrind build for `ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithMixedValueType` and `ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithGlobalSeqnoPickedSeqno` started failing (see error message in T141554665). I could not repro but I suspect it is due to file ingestion range overlapping with ongoing compaction, which caused a new global seqno being assigned after https://github.com/facebook/rocksdb/issues/10988.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11070
Test Plan: monitor future valgrind tests result.
Reviewed By: hx235
Differential Revision: D42319056
Pulled By: cbi42
fbshipit-source-id: acbcd841a2a15e36b278f39ba514f4b9a6ee43ca
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in 0f88160f67, 5c64fb67d2 and 87dfc1d23e.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](c62f322169/db/db_iter.cc (L342-L343)) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (0f88160f67, 5c64fb67d2 and 87dfc1d23e.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
Summary:
compensate file sizes in compaction picking so files with range tombstones are preferred, such that they get compacted down earlier as they tend to delete a lot of data. This PR adds a `compensated_range_deletion_size` field in FileMeta that is computed during Flush/Compaction and persisted in MANIFEST. This value is added to `compensated_file_size` which will be used for compaction picking. Currently, for a file in level L, `compensated_range_deletion_size` is set to the estimated bytes deleted by range tombstone of this file in all levels > L. This helps to reduce space amp when data in older levels are covered by range tombstones in level L.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10734
Test Plan:
- Added unit tests.
- benchmark to check if the above definition `compensated_range_deletion_size` is reducing space amp as intended, without affecting write amp too much. The experiment set up favorable for this optimization: large range tombstone issued infrequently. Command used:
```
./db_bench -benchmarks=fillrandom,waitforcompaction,stats,levelstats -use_existing_db=false -avoid_flush_during_recovery=true -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -max_bytes_for_level_base=134217728 -target_file_size_base=33554432 -writes_per_range_tombstone=500000 -range_tombstone_width=5000000 -num=50000000 -benchmark_write_rate_limit=8388608 -threads=16 -duration=1800 --max_num_range_tombstones=1000000000
```
In this experiment, each thread wrote 16 range tombstones over the duration of 30 minutes, each range tombstone has width 5M that is the 10% of the key space width. Results shows this PR generates a smaller DB size.
Compaction stats from this PR:
```
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 2/0 31.54 MB 0.5 0.0 0.0 0.0 8.4 8.4 0.0 1.0 0.0 63.4 135.56 110.94 544 0.249 0 0 0.0 0.0
L4 3/0 96.55 MB 0.8 18.5 6.7 11.8 18.4 6.6 0.0 2.7 65.3 64.9 290.08 284.03 108 2.686 284M 1957K 0.0 0.0
L5 15/0 404.41 MB 1.0 19.1 7.7 11.4 18.8 7.4 0.3 2.5 66.6 65.7 292.93 285.34 220 1.332 293M 3808K 0.0 0.0
L6 143/0 4.12 GB 0.0 45.0 7.5 37.5 41.6 4.1 0.0 5.5 71.2 65.9 647.00 632.66 251 2.578 739M 47M 0.0 0.0
Sum 163/0 4.64 GB 0.0 82.6 21.9 60.7 87.2 26.5 0.3 10.4 61.9 65.4 1365.58 1312.97 1123 1.216 1318M 52M 0.0 0.0
```
Compaction stats from main:
```
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 8.4 8.4 0.0 1.0 0.0 60.5 142.12 115.89 569 0.250 0 0 0.0 0.0
L4 3/0 85.68 MB 1.0 17.7 6.8 10.9 17.6 6.7 0.0 2.6 62.7 62.3 289.05 281.79 112 2.581 272M 2309K 0.0 0.0
L5 11/0 293.73 MB 1.0 18.8 7.5 11.2 18.5 7.2 0.5 2.5 64.9 63.9 296.07 288.50 220 1.346 288M 4365K 0.0 0.0
L6 130/0 3.94 GB 0.0 51.5 7.6 43.9 47.9 3.9 0.0 6.3 67.2 62.4 784.95 765.92 258 3.042 848M 51M 0.0 0.0
Sum 144/0 4.31 GB 0.0 88.0 21.9 66.0 92.3 26.3 0.5 11.0 59.6 62.5 1512.19 1452.09 1159 1.305 1409M 58M 0.0 0.0```
Reviewed By: ajkr
Differential Revision: D39834713
Pulled By: cbi42
fbshipit-source-id: fe9341040b8704a8fbb10cad5cf5c43e962c7e6b
Summary:
Some users are at least considering using SstPartitioner to support efficient physical migration of specific key ranges between RocksDB instances. One might expect manual `CompactRange()` over a narrow key range across some partition to enforce partitioning of any SST files crossing that partition boundary, but that currently only works if there are keys within that range.
This change makes the overlap logic in CompactRange more aware of the partitioner to automatically select relevant files crossing a partition boundary, even when they otherwise would not be selected due to the compaction range falling in a gap between entries.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11032
Test Plan: unit test included
Reviewed By: hx235
Differential Revision: D41981380
Pulled By: pdillinger
fbshipit-source-id: 2fe445bdddc73c00276c20f295cc1fa33d15b05a
Summary:
the [assertion](c3f720c60d/db/compaction/compaction_outputs.cc (L643)) in `CompactionOutputs::AddRangeDels()` can fail after https://github.com/facebook/rocksdb/pull/10802. The assertion fails when `lower_bound_from_range_tombstone` is true during `AddRangeDels()` for a new compaction output file, while the lower bound range tombstone key has seqno 0 and op_type kTypeRangeDeletion. It can have seqno 0 when it was truncated at a point key whose seqno was zeroed out during compaction, the seqno and op_type could be set [here](c3f720c60d/db/compaction/compaction_outputs.cc (L594)). This PR fixes the assertion excluding the case when `lower_bound_from_range_tombstone` is true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11040
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D42119914
Pulled By: cbi42
fbshipit-source-id: 0897e71b5304cb02aac30f71667b590c37b72baf
Summary:
This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802
Test Plan:
- added unit tests in db_range_del_test.
- stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000`
Reviewed By: ajkr, jay-zhuang
Differential Revision: D40308827
Pulled By: cbi42
fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df
Summary:
RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and
releases the internal `DBImpl::log_write_mutex_`.
According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288
> Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_.
This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application
should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing
the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()`
is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`.
There are two issues:
1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after
calling `LockWAL()` because log_write_mutex is not recursive.
2. Two threads may dead lock due to lock order inversion.
To fix these issues, we can modify the implementation of LockWAL so that it does not keep
`log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can
instead manually inject a write stall so that all future writes will be stopped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D41785203
Pulled By: riversand963
fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run 36a5686ec0 (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: ajkr
Differential Revision: D41063187
Pulled By: hx235
fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
Summary:
Add a tiered storage migration test which would conflict with
an ongoing penultimate level compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10908
Test Plan: Test only change
Reviewed By: anand1976
Differential Revision: D40864509
Pulled By: ajkr
fbshipit-source-id: e316e849a01a6c71a41be130101f909b6c0498cb
Summary:
Besides the existing ordering and validation, more is coming to VersionBuilder/VersionStorageInfo, like migration of epoch_numbers from older RocksDB versions. We should start using those common classes for importing CFs, instead of duplicating their ordering, validation, and migration logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11028
Test Plan: rely on existing tests
Reviewed By: hx235
Differential Revision: D41865427
Pulled By: ajkr
fbshipit-source-id: 873f5cd87b8902a2380c3b71373ce0b0db3a0c50
Summary:
Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure:
* Magic number error is reported first if we attempt to open an SST file and the footer is completely bad.
* Footer errors are reported with affected file.
* If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009
Test Plan:
unit tests added, some manual
Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed).
Reviewed By: siying
Differential Revision: D41656272
Pulled By: pdillinger
fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df
Summary:
**Context/Summary:**
Credit to ajkr's https://github.com/facebook/rocksdb/pull/11016#pullrequestreview-1205020134,
flaky test https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 is due to `Flush()` called in the test returned earlier than obsoleted WAL being found in background flush and SyncWAL() was called (i.e, "sync_point_called" sets to true). Fix this by making checking `sync_point_called == true` after obsoleted WAL is found and `SyncWAL()` is called. Also rename the "sync_point_called" to be something more specific.
Also, fix a potential flakiness due to manually setting a log threshold to force new manifest creation. This is unreliable so I decided to use sync point to force new manifest creation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11016
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D41717786
Pulled By: hx235
fbshipit-source-id: ad1e4701a987285bbe6c8e7d9b05c4db06b4edf4
Summary:
when the compaction output file is empty, the assertion in `TimestampTablePropertiesCollector::Finish()` breaks. This PR fixes this assert and added unit test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11015
Test Plan: added UT.
Reviewed By: ajkr
Differential Revision: D41716719
Pulled By: cbi42
fbshipit-source-id: d891d46be4c4805e3d49be6b80c9d75f1bd51080
Summary:
When MultiGet with the async_io option encounters an IO error in TableCache::GetTableReader, it may result in leakage of table cache handles due to queued coroutines being abandoned. This PR fixes it by ensuring any queued coroutines are run before aborting the MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10997
Test Plan:
1. New unit test in db_basic_test
2. asan_crash
Reviewed By: pdillinger
Differential Revision: D41587244
Pulled By: anand1976
fbshipit-source-id: 900920cd3fba47cb0fc744a62facc5ffe2eccb64
Summary:
**Context**
`Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown.
`DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest.
And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open.
- Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`.
As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown.
**Summary**
This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest.
**Test**
- Make check
- Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after
- [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears
- [Ongoing]Regular CI stress test to confirm such fix didn't harm anything
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892
Reviewed By: ajkr
Differential Revision: D40778965
Pulled By: hx235
fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f
Summary:
**Context/Summary:**
This reverts commit fc74abb436 and related HISTORY record.
The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004
Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D41572604
Pulled By: hx235
fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
Summary:
In MergingIterator, if a range tombstone's start or end key is added to minHeap/maxHeap, the key is copied. This PR removes the copying of range tombstone keys by adding InternalKey comparator that compares `Slice` for internal key and `ParsedInternalKey` directly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10878
Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmarks: I did not get improvement when compiling with DEBUG_LEVEL=0, and saw many noise. With `OPTIMIZE_LEVEL="-O3" USE_LTO=1` I do see improvement.
```
# Favorable set up: half of the writes are DeleteRange.
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=1000000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=50
# benchmark command
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --disable_auto_compactions=true --avoid_flush_during_recovery=true --seek_nexts=100 --reads=1000000 --num=1000000 --threads=25
# main
readseq [AVG 5 runs] : 26017977 (± 371077) ops/sec; 3721.9 (± 53.1) MB/sec
readseq [MEDIAN 5 runs] : 26096905 ops/sec; 3733.2 MB/sec
# this PR
readseq [AVG 5 runs] : 27481724 (± 568758) ops/sec; 3931.3 (± 81.4) MB/sec
readseq [MEDIAN 5 runs] : 27323957 ops/sec; 3908.7 MB/sec
```
Reviewed By: ajkr
Differential Revision: D40711170
Pulled By: cbi42
fbshipit-source-id: 708cb584e2bd085a9ce0d2ef6a420489f721717f
Summary:
Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966
Test Plan:
- added unit test
- stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100`
- ran different stress tests over sandcastle
- Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones.
Reviewed By: ajkr
Differential Revision: D41414172
Pulled By: cbi42
fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981
Summary:
Enabled output to penultimate level when file endpoints overlap. This is probably only possible when range tombstones span files. Otherwise the overlapping files would all be included in the penultimate level inputs thanks to our atomic compaction unit logic.
Also, corrected `penultimate_output_range_type_`, which is a minor fix as it appears only used for logging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10961
Test Plan: updated unit test
Reviewed By: cbi42
Differential Revision: D41370615
Pulled By: ajkr
fbshipit-source-id: 7e75ec369a3b41b8382b336446c81825a4c4f572
Summary:
The check for SST unique IDs added to best-efforts recovery (`Options::best_efforts_recovery` is true).
With best_efforts_recovery being true, RocksDB will recover to the latest point in
MANIFEST such that all valid SST files included up to this point pass unique ID checks as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10962
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D41378241
Pulled By: riversand963
fbshipit-source-id: a036064e2c17dec13d080a24ef2a9f85d607b16c
Summary:
Since the latency measurement uses real time it is possible for the operation to complete in zero microseconds and then fail these checks. We saw this with the operation that invokes Get() on an invalid CF. This PR relaxes the assertions to allow for operations completing in zero microseconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10979
Reviewed By: riversand963
Differential Revision: D41478300
Pulled By: ajkr
fbshipit-source-id: 50ef096bd8f0162b31adb46f54ae6ddc337d0a5e
Summary:
before this PR, if there is a range tombstone-only file generated in penultimate level, it is marked the `last_level_temperature`. This PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10972
Test Plan: added unit test for this scenario.
Reviewed By: ajkr
Differential Revision: D41449215
Pulled By: cbi42
fbshipit-source-id: 1e06b5ae3bc0183db2991a45965a9807a7e8be0c
Summary:
We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967
Test Plan:
- built `db_bench` with DEBUG_LEVEL=0
- ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120`
- compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output
- Before this fix, Y was always zero
- After this fix, Y gradually increased throughout the benchmark
Reviewed By: riversand963
Differential Revision: D41417726
Pulled By: ajkr
fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329
Summary:
Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused.
Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965
Test Plan:
unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not.
Also manual testing with db_bench. Create db with
```
./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1
```
Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge)
```
./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX
```
Seeing warnings / errors or not as expected.
Reviewed By: anand1976
Differential Revision: D41406932
Pulled By: pdillinger
fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7
Summary:
This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954
Test Plan: tests updated
Reviewed By: akankshamahajan15
Differential Revision: D41310123
Pulled By: pdillinger
fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c
Summary:
The patch extends the compaction logic to handle `Merge`s in conjunction with wide-column entities. As usual, the merge operation is applied to the anonymous default column, and any other columns are unaffected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10946
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D41233722
Pulled By: ltamasi
fbshipit-source-id: dfd9b1362222f01bafcecb139eb48480eb279fed
Summary:
The patch adds `Merge` support for wide-column entities in `DBIter`. As before, the `Merge` operation is applied to the default column of the entity; any other columns are unchanged. As a small cleanup, the PR also changes the signature of `DBIter::Merge` to simply return a boolean instead of the `Merge` operation's `Status` since the actual `Status` is already stored in a member variable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10941
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D41195471
Pulled By: ltamasi
fbshipit-source-id: 362cf555897296e252c3de5ddfbd569ef34f85ef
Summary:
The patch untangles some nested ifs in `MergeHelper::MergeUntil`. This will come in handy when extending the compaction logic to support `Merge` for wide-column entities, and also enables us to eliminate some repeated branching on value type and to decrease the scope of some variables.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10943
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D41201946
Pulled By: ltamasi
fbshipit-source-id: 890bd3d4e31cdccadca614489a94686d76485ba9
Summary:
The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)`
a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge`
overload, which makes the responsibilities clearer by making sure the query
result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in
`SaveValue` and `GetContext`. Second, it changes the interface of
`TimedFullMergeWithEntity` so it exposes its result in a serialized form; this
is a more decoupled design which will come in handy when adding support
for `Merge` with wide-column entities to `DBIter`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D41129399
Pulled By: ltamasi
fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677
Summary:
`DBIter::saved_value_` stores the result of any `Merge` that was performed to compute the iterator's current value. This value can be ditched whenever the iterator's position is changed, and is already cleared in `Seek`, `SeekForPrev`, `SeekToFirst`, and `SeekToLast`. With the patch, it is also cleared in `Next` and `Prev`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10934
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D41133473
Pulled By: ltamasi
fbshipit-source-id: cf9e936f48151e64e455cc1664d6e9f4a03aa308
Summary:
The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D41096453
Pulled By: ltamasi
fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
Summary:
Prevents `MemTableList::PickMemtablesToFlush()` from picking non-consecutive memtables. It leads to wrong ordering in L0 if the files are committed, or an error like below if force_consistency_checks=true catches it:
```
Corruption: force_consistency_checks: VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/25 with seqno 320416 368066 vs. file https://github.com/facebook/rocksdb/issues/24 with seqno 336037 352068
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10921
Test Plan: fix the expectation in the existing test of this behavior
Reviewed By: riversand963
Differential Revision: D41046935
Pulled By: ajkr
fbshipit-source-id: 783696bff56115063d5dc5856dfaed6a9881d1ab
Summary:
When performing Merge during range scan, iterator should understand value types of kDeletionWithTimestamp.
Also add an additional check in debug mode to MergeHelper, and account for the presence of compaction filter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10915
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D40960039
Pulled By: riversand963
fbshipit-source-id: dd79d86d7c79d05755bb939a3d94e0c53ddd7f59
Summary:
The patch adds `Merge` support for wide-column entities to the point lookup
APIs, i.e. `Get`, `MultiGet`, `GetEntity`, and `GetMergeOperands`. (I plan to
update the iterator and compaction logic in separate PRs.) In terms of semantics,
the `Merge` operation is applied to the default (anonymous) column; any other
columns in the entity are unaffected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10916
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D40962311
Pulled By: ltamasi
fbshipit-source-id: 244bc9d172be1af2f204796b2f89104e4d2fa373
Summary:
Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910
Reviewed By: riversand963
Differential Revision: D40880683
Pulled By: ajkr
fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.
Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```
Some code comments are added for clarity.
Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D40603195
Pulled By: riversand963
fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
Summary:
I've tried to compile the main branch, but there are two minor things which are make CE.
I'm not sure about the second one (`num_empty_non_l0_level`), probably there is should be additional assert.
```
-c ../cache/clock_cache.cc
[build] ../cache/clock_cache.cc:855:15: error: variable 'i' set but not used [-Werror,-Wunused-but-set-variable]
[build] for (size_t i = 0; &array_[current] != h; i++) {
[build] ^
```
```
[build] ../db/version_set.cc:3665:7: error: variable 'num_empty_non_l0_level' set but not used [-Werror,-Wunused-but-set-variable]
[build] int num_empty_non_l0_level = 0;
[build] ^
[build] 1 error generated.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10907
Reviewed By: jay-zhuang
Differential Revision: D40866667
Pulled By: ajkr
fbshipit-source-id: 963b7bd56859d0b3b2779cd36fad229425cb7b17
Summary:
When there is a column family that doesn't get any traffic, its stats are still dumped when options.options.stats_dump_period_sec triggers. This sometimes spam the information logs. With this change, we skip the printing if there is not change, until 8 periods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10891
Test Plan: Manually test the behavior with hacked db_bench setups.
Reviewed By: jay-zhuang
Differential Revision: D40777183
fbshipit-source-id: ef0b9a793e4f6282df099b464f01d1fb4c5a2cab
Summary:
Currently, a memtable's stats `num_deletes_` is incremented only if the entry is a regular delete (kTypeDeletion). We need to fix it by accounting for kTypeSingleDeletion and kTypeDeletionWithTimestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10886
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D40740754
Pulled By: riversand963
fbshipit-source-id: 7bde62cd6df136585bc5bfb1c426c7a8276c08e1
Summary:
The PR fixes the handling of `Merge`s in `GetEntity`. Note that `Merge` is not yet
supported for wide-column entities written using `PutEntity`; this change is
about returning correct (i.e. consistent with `Get`) results in cases like when the
base value is a plain old key-value written using `Put` or when there is no real base
value because we hit either a tombstone or the beginning of history.
Implementation-wise, the patch introduces a new wrapper around the existing
`MergeHelper::TimedFullMerge` that can store the merge result in either a string
(for the purposes of `Get`) or a `PinnableWideColumns` instance (for `GetEntity`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10894
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D40782708
Pulled By: ltamasi
fbshipit-source-id: 3d700d56b2ef81f02ba1e2d93f6481bf13abcc90
Summary:
The call to `folly::coro::collectAllRange()` should move the input `mget_tasks`. But just in case, assert and clear the std::vector before reusing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10845
Reviewed By: akankshamahajan15
Differential Revision: D40611719
Pulled By: anand1976
fbshipit-source-id: 0f32b387cf5a2894b13389016c020b01ab479b5e
Summary:
FragmentedRangeTombstoneList has a member variable `seq_set_` that contains the sequence numbers of all range tombstones in a set. The set is constructed in `FragmentTombstones()` and is used only in `FragmentedRangeTombstoneList::ContainsRange()` which only happens during compaction. This PR moves the initialization of `seq_set_` to `FragmentedRangeTombstoneList::ContainsRange()`. This should speed up `FragmentTombstones()` when the range tombstone list is used for read/scan requests. Microbench shows the speed improvement to be ~45%.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10848
Test Plan:
- Existing tests and stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5`.
- Microbench: update `range_del_aggregator_bench` to benchmark speed of `FragmentTombstones()`:
```
./range_del_aggregator_bench --num_range_tombstones=1000 --tombstone_start_upper_bound=50000000 --num_runs=10000 --tombstone_width_mean=200 --should_deletes_per_run=100 --use_compaction_range_del_aggregator=true
Before this PR:
=========================
Fragment Tombstones: 270.286 us
AddTombstones: 1.28933 us
ShouldDelete (first): 0.525528 us
ShouldDelete (rest): 0.0797519 us
After this PR: time to fragment tombstones is pushed to AddTombstones() which only happen during compaction.
=========================
Fragment Tombstones: 149.879 us
AddTombstones: 102.131 us
ShouldDelete (first): 0.565871 us
ShouldDelete (rest): 0.0729444 us
```
- db_bench: this should improve speed for fragmenting range tombstones for mutable memtable:
```
./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=100 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=250000 --disable_auto_compactions --max_num_range_tombstones=100000 --finish_after_writes --write_buffer_size=1073741824 --threads=25
Before this PR:
readwhilewriting : 18.301 micros/op 1310445 ops/sec 4.769 seconds 6250000 operations; 28.1 MB/s (41001 of 250000 found)
After this PR:
readwhilewriting : 16.943 micros/op 1439376 ops/sec 4.342 seconds 6250000 operations; 23.8 MB/s (28977 of 250000 found)
```
Reviewed By: ajkr
Differential Revision: D40646227
Pulled By: cbi42
fbshipit-source-id: ea471667edb258f67d01cfd828588e80a89e4083
Summary:
**Context:**
Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
Repro:
```
COERCE_CONTEXT_SWICH=1 make -j56 db_stress
./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35
put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
```
**Summary:**
FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
- [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
- For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
- This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
- [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
- For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
- This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
- [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
- [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
- Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.
Some additional clean-up included in this PR:
- Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
- Added comment about `earliest_memtable_seqno` in related APIs
- Made parameter `earliest_memtable_seqno` constant and required
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777
Test Plan:
- make check
- New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
- Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only
Reviewed By: ajkr
Differential Revision: D40090485
Pulled By: hx235
fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
Summary:
Apply the formatting changes suggested by `clang-format`, except
where they would ruin the ASCII art in `blob_log_format.h`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10856
Test Plan: `make check`
Reviewed By: siying
Differential Revision: D40652224
Pulled By: ltamasi
fbshipit-source-id: 8c1f5757b758474ea3e8102a7c5a1cf9e6dc1402
Summary:
Allow the last level only compaction able to output result to penultimate level if the penultimate level is empty. Which will also block the other compaction output to the penultimate level.
(it includes the PR https://github.com/facebook/rocksdb/issues/10829)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10822
Reviewed By: siying
Differential Revision: D40389180
Pulled By: jay-zhuang
fbshipit-source-id: 4e5dcdce307795b5e07b5dd1fa29dd75bb093bad
Summary:
Right now UserComparatorWrapper is a Customizable object, although it is not, which introduces some intialization overhead for the object. In some benchmarks, it shows up in CPU profiling. Make it not configurable by defining most functions needed by UserComparatorWrapper to an interface and implement the interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10837
Test Plan: Make sure existing tests pass
Reviewed By: pdillinger
Differential Revision: D40528511
fbshipit-source-id: 70eaac89ecd55401a26e8ed32abbc413a9617c62
Summary:
Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter.
`DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`.
New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer.
This same philosophy can be applied to KV and IO tracing as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811
Test Plan:
existing unit tests
Old API DB::StartBlockTrace checked with db_bench tool
create database
```
./db_bench --benchmarks="fillseq" \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000
```
To trace block cache accesses when running readrandom benchmark:
```
./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \
--threads=16 \
-block_cache_trace_file="/tmp/binary_trace_test_example" \
-block_cache_trace_max_trace_file_size_in_bytes=1073741824 \
-block_cache_trace_sampling_frequency=1
```
Reviewed By: anand1976
Differential Revision: D40435289
Pulled By: akankshamahajan15
fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282
Summary:
Currently, the code in `SaveValue` that handles `kTypeValue` and
`kTypeBlobIndex` (and more recently, `kTypeWideColumnEntity`) is
mostly shared. This made sense originally; however, by now the
handling of these three value types has diverged significantly. The
patch makes the logic cleaner and also eliminates quite a bit of branching
by giving each value type its own `case` and removing a fall-through.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10840
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D40568420
Pulled By: ltamasi
fbshipit-source-id: 2e614606afd1c3d9c76d9b5f1efa0959fc174103
Summary:
When the `preclude_last_level_data_seconds` or
`preserve_internal_time_seconds` is smaller than 100 (seconds), no seqno->time information was recorded.
Also make sure all data will be compacted to the last level even if there's no write to record the time information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10829
Test Plan: added unittest
Reviewed By: siying
Differential Revision: D40443934
Pulled By: jay-zhuang
fbshipit-source-id: 2ecf1361daf9f3e5c3385aee6dc924fa59e2813a
Summary:
The patch makes it possible to provide the value of the default column
separately when calling `WideColumnSerialization::Serialize`. This eliminates
the need to construct a new `WideColumns` vector in certain cases
(for example, it will come in handy when implementing `Merge`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10839
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D40561448
Pulled By: ltamasi
fbshipit-source-id: 69becdd510e6a83ab1feb956c12772110e1040d6
Summary:
This new property allows users to trigger the background block cache stats collection mode through the `GetProperty()` and `GetMapProperty()` APIs. The background mode has much lower overhead at the expense of returning stale values in more cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10832
Test Plan: updated unit test
Reviewed By: pdillinger
Differential Revision: D40497883
Pulled By: ajkr
fbshipit-source-id: bdcc93402f426463abb2153756aad9e295447343
Summary:
FIFO compaction can theoretically open a DB with any compaction style.
However, the current code only allows FIFO compaction to open a DB with
a single level.
This PR relaxes the limitation of FIFO compaction and allows it to open a
DB with multiple levels. Below is the read / write / compaction behavior:
* The read behavior is untouched, and it works like a regular rocksdb instance.
* The write behavior is untouched as well. When a FIFO compacted DB
is opened with multiple levels, all new files will still be in level 0, and no files
will be moved to a different level.
* Compaction logic is extended. It will first identify the bottom-most non-empty level.
Then, it will delete the oldest file in that level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348
Test Plan:
Added a new test to verify the migration from level to FIFO where the db has multiple levels.
Extended existing test cases in db_test and db_basic_test to also verify
all entries of a key after reopening the DB with FIFO compaction.
Reviewed By: jay-zhuang
Differential Revision: D40233744
fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1
Summary:
Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.
For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828
Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI
Reviewed By: hx235
Differential Revision: D40447634
Pulled By: pdillinger
fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
Summary:
The motivation for this change is a planned feature (related to HyperClockCache) that will depend on a large array that can essentially grow automatically, up to some bound, without the pointer address changing and with guaranteed zero-initialization of the data. Anonymous mmaps provide such functionality, and this change provides an internal API for that.
The other existing use of anonymous mmap in RocksDB is for allocating in huge pages. That code and other related Arena code used some awkward non-RAII and pre-C++11 idioms, so I cleaned up much of that as well, with RAII, move semantics, constexpr, etc.
More specifcs:
* Minimize conditional compilation
* Add Windows support for anonymous mmaps
* Use std::deque instead of std::vector for more efficient bag
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10810
Test Plan: unit test added for new functionality
Reviewed By: riversand963
Differential Revision: D40347204
Pulled By: pdillinger
fbshipit-source-id: ca83fcc47e50fabf7595069380edd2954f4f879c
Summary:
This reverts https://github.com/facebook/rocksdb/issues/10792 and uses a different strategy to stabilize the test: remove the unnecessary randomness by providing a constant seed for shuffling keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10816
Test Plan: `gtest-parallel ./db_bloom_filter_test -r1000 --gtest_filter=*ForHits*`
Reviewed By: jay-zhuang
Differential Revision: D40347957
Pulled By: pdillinger
fbshipit-source-id: a270e157485cbd94ed03b80cdd21b954ebd57d57
Summary:
Lock the penultimate level for the whole compaction inputs range, so any
key in that compaction is safe to move up from the last level to
penultimate level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10782
Reviewed By: siying
Differential Revision: D40231540
Pulled By: siying
fbshipit-source-id: ca115cc8b4018b35d797329fa85a19b06cc8c13e
Summary:
This change is motivated by ensuring that `ldb update_manifest` or `UpdateManifestForFilesState` can run without expecting files to open when the old temperature is provided (in case the FileSystem strictly interprets non-kUnknown), but ended up fixing a problem in `OfflineManifestWriter` (used by `ldb unsafe_remove_sst_file`) where it would open some SST files during recovery and expect them to match the prior manifest state, even if not required by the intended new state.
Also update BackupEngine to retry with Temperature kUnknown when reading file with potentially "wrong" temperature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10796
Test Plan: tests added/updated, that fail before the change(s) and now pass
Reviewed By: jay-zhuang
Differential Revision: D40232645
Pulled By: jay-zhuang
fbshipit-source-id: b5aa2688aecfe0c320b80a7da689b315414c20be
Summary:
Provide support for async_io if ReadOptions.tailing is set true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10781
Test Plan:
- Update unit tests
- Ran db_bench: ./db_bench --benchmarks="readrandom" --use_existing_db --use_tailing_iterator=1 --async_io=1
Reviewed By: anand1976
Differential Revision: D40128882
Pulled By: anand1976
fbshipit-source-id: 55e17855536871a5c47e2de92d238ae005c32d01
Summary:
The test may fail because the L5 files may only cover small portion of the whole key range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10792
Test Plan:
```
gtest-parallel ./db_bloom_filter_test --gtest_filter=DBBloomFilterTest.OptimizeFiltersForHits -r 1000 -w 100
```
Reviewed By: siying
Differential Revision: D40217600
Pulled By: siying
fbshipit-source-id: 18db549184bccf5e513eaa7e31ab17385b71ef71
Summary:
Add option `preserve_internal_time_seconds` to preserve the internal
time information.
It's mostly for the migration of the existing data to tiered storage (
`preclude_last_level_data_seconds`). When the tiering feature is just
enabled, the existing data won't have the time information to decide if
it's hot or cold. Enabling this feature will start collect and preserve
the time information for the new data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747
Reviewed By: siying
Differential Revision: D39910141
Pulled By: siying
fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138
Summary:
When checking if a range [start, end) overlaps with a compaction whose range is [start1, end1), always exclude timestamp from start, end, start1 and end1, otherwise some versions of one user key may be compacted to bottommost layer while others remain in the original level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10787
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D40187672
Pulled By: ltamasi
fbshipit-source-id: 81226267fd3e33ffa79665c62abadf2ebec45496
Summary:
To make it consistent with the compaction picker which uses the `sstableKeyCompare()` to pick the overlap files. For example, without this change, it may cut L1 files like:
```
L1: [2-21] [22-30]
L2: [1-10] [21-30]
```
Because "21" on L1 is smaller than "21" on L2. But for compaction, these 2 files are overlapped.
`sstableKeyCompare()` also take range delete into consideration which may cut file for the same key.
It also makes the `max_compaction_bytes` calculation more accurate for cases like above, the overlapped bytes was under estimated. Also make sure the 2 keys won't be splitted to 2 files because of reaching `max_compaction_bytes`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10763
Reviewed By: cbi42
Differential Revision: D39971904
Pulled By: cbi42
fbshipit-source-id: bcc309e9c3dc61a8f50667a6f633e6132c0154a8
Summary:
As the first step of covering the wide-column functionality of iterators
in our stress tests, the patch adds verification logic to
`NonBatchedOpsStressTest::VerifyDb` that checks whether the
iterator's value and columns are in sync. Note: I plan to update the other
types of stress tests and add similar verification for prefix scans etc.
in separate PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10783
Test Plan: Ran some simple blackbox crash tests.
Reviewed By: riversand963
Differential Revision: D40152370
Pulled By: riversand963
fbshipit-source-id: 8f9d17d7af5da58ccf1bd2057cab53cc9645ac35
Summary:
With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773
Test Plan:
Reproduce: D39993203 has detailed steps.
Run the test with and without the fix.
Reviewed By: cbi42
Differential Revision: D40077955
Pulled By: cbi42
fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240
Summary:
fix for https://github.com/facebook/rocksdb/issues/10752 where RocksDB could be in an infinite compaction loop (with compaction reason kBottommostFiles) if allow_ingest_behind is enabled and the bottommost level is unfilled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10767
Test Plan: Added a unit test to reproduce the compaction loop.
Reviewed By: ajkr
Differential Revision: D40031861
Pulled By: ajkr
fbshipit-source-id: 71c4b02931fbe507a847632905404c9b8fa8c96b
Summary:
Fix a bug in Iterator::Refresh() where the local SV it obtained could be obsolete upon return, and should be cleaned up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10770
Test Plan: added a unit test to reproduce the issue.
Reviewed By: ajkr
Differential Revision: D40063809
Pulled By: ajkr
fbshipit-source-id: 619e728eb0f1ac9540b4d0ad38e43acc37a514b2
Summary:
When `FlushOptions::wait` is set to false, manual flush should not stall forever.
If the database has already stopped writes, then the thread calling `DB::Flush()` with
`FlushOptions::wait=false` should not enter the `DBImpl::write_thread_`.
To prevent this, we should do a check at the beginning and return `TryAgain()`
Resolves: https://github.com/facebook/rocksdb/issues/9892
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10001
Reviewed By: siying
Differential Revision: D36422303
Pulled By: siying
fbshipit-source-id: 723bd3065e8edc4f17c82449d0d6b95a2381ac0a
Summary:
For RoundRobin compaction, the data should be mostly sorted per level and within level. Use normal compaction picker for RR until all expired data is compacted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10725
Reviewed By: ajkr
Differential Revision: D39771069
Pulled By: jay-zhuang
fbshipit-source-id: 7ccf88d7c093fad5673bda73a7b08cc4757780cd
Summary:
This is intended as a step toward possibly separating secondary cache integration from the
Cache implementation as much as possible, to (hopefully) minimize code duplication in
adding secondary cache support to HyperClockCache.
* Major clarifications to API docs of secondary cache compatible parts of Cache. For example, previously the docs seemed to suggest that Wait() was not needed if IsReady()==true. And it wasn't clear what operations were actually supported on pending handles.
* Add some assertions related to these requirements, such as that we don't Release() before Wait() (which would leak a secondary cache handle).
* Fix a leaky abstraction with dummy handles, which are supposed to be internal to the Cache. Previously, these just used value=nullptr to indicate dummy handle, which meant that they could be confused with legitimate value=nullptr cases like cache reservations. Also fixed blob_source_test which was relying on this leaky abstraction.
* Drop "incomplete" terminology, which was another name for "pending".
* Split handle flags into "mutable" ones requiring mutex and "immutable" ones which do not. Because of single-threaded access to pending handles, the "Is Pending" flag can be in the "immutable" set. This allows removal of a TSAN work-around and removing a mutex acquire-release in IsReady().
* Remove some unnecessary handling of charges on handles of failed lookups. Keeping total_charge=0 means no special handling needed. (Removed one unnecessary mutex acquire/release.)
* Simplify handling of dummy handle in Lookup(). There is no need to explicitly Ref & Release w/Erase if we generally overwrite the dummy anyway. (Removed one mutex acquire/release, a call to Release().)
Intended follow-up:
* Clarify APIs in secondary_cache.h
* Doesn't SecondaryCacheResultHandle transfer ownership of the Value() on success (implementations should not release the value in destructor)?
* Does Wait() need to be called if IsReady() == true? (This would be different from Cache.)
* Do Value() and Size() have undefined behavior if IsReady() == false?
* Why have a custom API for what is essentially a std::future<std::pair<void*, size_t>>?
* Improve unit testing of standalone handle case
* Apparent null `e` bug in `free_standalone_handle` case
* Clean up secondary cache testing in lru_cache_test
* Why does TestSecondaryCacheResultHandle hold on to a Cache::Handle?
* Why does TestSecondaryCacheResultHandle::Wait() do nothing? Shouldn't it establish the post-condition IsReady() == true?
* (Assuming that is sorted out...) Shouldn't TestSecondaryCache::WaitAll simply wait on each handle in order (no casting required)? How about making that the default implementation?
* Why does TestSecondaryCacheResultHandle::Size() check Value() first? If the API is intended to be returning 0 before IsReady(), then that is weird but should at least be documented. Otherwise, if it's intended to be undefined behavior, we should assert IsReady().
* Consider replacing "standalone" and "dummy" entries with a single kind of "weak" entry that deletes its value when it reaches zero refs. Suppose you are using compressed secondary cache and have two iterators at similar places. It will probably common for one iterator to have standalone results pinned (out of cache) when the second iterator needs those same blocks and has to re-load them from secondary cache and duplicate the memory. Combining the dummy and the standalone should fix this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10730
Test Plan:
existing tests (minor update), and crash test with sanitizers and secondary cache
Performance test for any regressions in LRUCache (primary only):
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test before & after (run at same time) with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X100] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=233000000 -duration 30 -threads=16
```
Before: readrandom [AVG 100 runs] : 22234 (± 63) ops/sec; 1.6 (± 0.0) MB/sec
After: readrandom [AVG 100 runs] : 22197 (± 64) ops/sec; 1.6 (± 0.0) MB/sec
That's within 0.2%, which is not significant by the confidence intervals.
Reviewed By: anand1976
Differential Revision: D39826010
Pulled By: anand1976
fbshipit-source-id: 3202b4a91f673231c97648ae070e502ae16b0f44
Summary:
Add new property "do_not_recurse" in IOOptions for underlying file system to skip iteration of directories during DB::Open if there are no sub directories and list only files.
By default this property is set to false. This property is set true currently in the code where RocksDB is sure only files are needed during DB::Open.
Provided support in PosixFileSystem to use "do_not_recurse".
TestPlan:
- Existing tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10668
Reviewed By: anand1976
Differential Revision: D39471683
Pulled By: akankshamahajan15
fbshipit-source-id: 90e32f0b86d5346d53bc2714d3a0e7002590527f
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661
Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case.
| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|
Reviewed By: riversand963
Differential Revision: D39441192
Pulled By: cbi42
fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
Summary:
**Context/Summary:**
Introduce `manual_wal_flush_one_in` as titled.
- When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in`
Incompatibilities fixed along the way:
```
db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed.
```
- It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false.
- As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698
Test Plan:
- Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3`
- Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs
Reviewed By: ajkr
Differential Revision: D39593752
Pulled By: ajkr
fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce
Summary:
currently, there are places in compaction_picker where we add up `compensated_file_size` of files being compacted and limit the sum to be under `max_compaction_bytes`. `compensated_file_size` contains booster for point tombstones and should be used only for determining file's compaction priority. This PR replaces `compensated_file_size` with actual file size in such places.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10728
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D39789427
Pulled By: cbi42
fbshipit-source-id: 1f89fb6c0159c53bf01d8dc783f465959f442c81
Summary:
Try to align the compaction output file boundaries to the next level ones
(grandparent level), to reduce the level compaction write-amplification.
In level compaction, there are "wasted" data at the beginning and end of the
output level files. Align the file boundary can avoid such "wasted" compaction.
With this PR, it tries to align the non-bottommost level file boundaries to its
next level ones. It may cut file when the file size is large enough (at least
50% of target_file_size) and not too large (2x target_file_size).
db_bench shows about 12.56% compaction reduction:
```
TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432
# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds
# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
```
The compaction simulator shows a similar result (14% with 100G random data).
As a side effect, with this PR, the SST file size can exceed the
target_file_size, but is capped at 2x target_file_size. And there will be
smaller files. Here are file size statistics when loading 100GB with the target
file size 32MB:
```
baseline this_PR
count 1.656000e+03 1.705000e+03
mean 3.116062e+07 3.028076e+07
std 7.145242e+06 8.046139e+06
```
The feature is enabled by default, to revert to the old behavior disable it
with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`
Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
use case like user adding 2 or more non-overlapping data range at the same
time, it can reduce the overlapping of 2 datasets in the lower levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655
Reviewed By: cbi42
Differential Revision: D39552321
Pulled By: jay-zhuang
fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
Summary:
when a new internal iterator is constructed during iterator refresh, pointer to the previous memtable range tombstone iterator was not cleared. This could cause segfault for future `Refresh()` calls when they try to free the memtable range tombstones. This PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10739
Test Plan: added a unit test in db_range_del_test.cc to reproduce this issue.
Reviewed By: ajkr, riversand963
Differential Revision: D39825283
Pulled By: cbi42
fbshipit-source-id: 3b59a2b73865aed39e28cdd5c1b57eed7991b94c
Summary:
Currently, without this fix, DBImpl::GetLatestSequenceForKey() may not return the latest sequence number for merge operands of the key. This can cause conflict checking during optimistic transaction commit phase to fail. Fix it by always returning the latest sequence number of the key, also considering range tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10724
Test Plan: make check
Reviewed By: cbi42
Differential Revision: D39756847
Pulled By: riversand963
fbshipit-source-id: 0764c3dd4cb24960b37e18adccc6e7feed0e6876
Summary:
We have a lot of confusing code because of mixed, sometimes
completely opposite uses of of the term "raw block" or "raw contents",
sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
`raw_block_contents` and `raw_size` generally referred to uncompressed block
contents and size, while `WriteRawBlock` referred to writing a block that
is already compressed if it is going to be. Meanwhile, in
`BlockBasedTable`, `raw_block_contents` either referred to a (maybe
compressed) block with trailer, or a maybe compressed block maybe
without trailer. (Note: left as follow-up work to use C++ typing to
better sort out the various kinds of BlockContents.)
This change primarily tries to apply some consistent terminology around
the kinds of block representations, avoiding the unclear "raw". (Any
meaning of "raw" assumes some bias toward the storage layer or toward
the logical data layer.) Preferred terminology:
* **Serialized block** - bytes that go into storage. For block-based table
(usually the case) this includes the block trailer. WART: block `size` may or
may not include the trailer; need to be clear about whether it does or not.
* **Maybe compressed block** - like a serialized block, but without the
trailer (or no promise of including a trailer). Must be accompanied by a
CompressionType.
* **Uncompressed block** - "payload" bytes that are either stored with no
compression, used as input to compression function, or result of
decompression function.
* **Parsed block** - an in-memory form of a block in block cache, as it is
used by the table reader. Different C++ types are used depending on the
block type (see block_like_traits.h).
Other refactorings:
* Misc corrections/improvements of internal API comments
* Remove a few misleading / unhelpful / redundant comments.
* Use move semantics in some places to simplify contracts
* Use better parameter names to indicate which parameters are used for
outputs
* Remove some extraneous `extern`
* Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408
Test Plan: existing tests
Reviewed By: akankshamahajan15
Differential Revision: D38172617
Pulled By: pdillinger
fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c
Summary:
Change the library order in PLATFORM_LDFLAGS to enable fbcode platform 10 build with folly. This PR also has a few fixes for platform 10 compiler errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10708
Test Plan:
ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_COROUTINES=1 make -j64 check
ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_FOLLY=1 make -j64 check
Reviewed By: ajkr
Differential Revision: D39666590
Pulled By: anand1976
fbshipit-source-id: 256a1127ef561399cd6299a6a392ca29bd68ca44
Summary:
when there is a single memtable without range tombstones and no SST files in the database, DBIter should wrap memtable iterator directly. Currently we create a merging iterator on top of the memtable iterator, and have DBIter wrap around it. This causes iterator regression and this PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10705
Test Plan:
- `make check`
- Performance:
- Set up: `./db_bench -benchmarks=filluniquerandom -write_buffer_size=$((1 << 30)) -num=10000`
- Benchmark: `./db_bench -benchmarks=seekrandom -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=$((1 << 30)) -num=10000 -threads=16 -duration=60 -seek_nexts=$seek_nexts`
```
seek_nexts main op/sec https://github.com/facebook/rocksdb/issues/10705 RocksDB v7.6
0 5746568 5749033 5786180
30 2411690 3006466 2837699
1000 102556 128902 124667
```
Reviewed By: ajkr
Differential Revision: D39644221
Pulled By: cbi42
fbshipit-source-id: 8063ff611ba31b0e5670041da3927c8c54b2097d
Summary:
The background compaction may still running while the test end, which would cause ASAN stack-use-after-scope error.
Explicitly close the DB before test end.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10697
Test Plan:
able to reproduce with:
```
gtest-parallel ./compaction_service_test --gtest_filter=CompactionServiceTest.BasicCompactions -r 10000 -w 100
```
Reviewed By: gitbw95
Differential Revision: D39590974
Pulled By: jay-zhuang
fbshipit-source-id: da264b2e6a276afbda7d5ff7adb9d7b8d4213d90
Summary:
Fix invalid reference in MultiGet due to resizing of the ```batches``` autovector.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10702
Test Plan: Run asan crash test
Reviewed By: riversand963
Differential Revision: D39608753
Pulled By: anand1976
fbshipit-source-id: 7a9e7fc6f436f08eb22003d0e6b0e1e4dcdc1a2a
Summary:
The assertion in ```FilePickerMultiGet::ReplaceRange()``` was incorrect. The function should only be called to replace the range after finishing the search in the current level, which is indicated by ```hit_file_ == nullptr``` i.e no more overlapping files in this level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10695
Reviewed By: gitbw95
Differential Revision: D39583217
Pulled By: anand1976
fbshipit-source-id: d4cedfb2b62fb9f3a083e9848a403ae6342f0519
Summary:
This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
* We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
* We want to highlight the key feature: it's fast (especially under parallel load)
* Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
* We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)
Some API detail:
* To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
* Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.
For performance tests see https://github.com/facebook/rocksdb/pull/10626
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684
Test Plan: no interesting functional changes; tests updated
Reviewed By: anand1976
Differential Revision: D39547800
Pulled By: pdillinger
fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
Summary:
The stats were not accurate for the coroutine version of MultiGet. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10673
Reviewed By: akankshamahajan15
Differential Revision: D39492615
Pulled By: anand1976
fbshipit-source-id: b46c04e15ea27e66f4c31f00c66497aa283bf9d3
Summary:
Fix a bug in the async IO/coroutine version of MultiGet that may cause a segfault or assertion failure due to accessing an invalid file index in a LevelFilesBrief. The bug is that when a MultiGetRange is split into two, we may re-process keys in the original range that were already marked to be skipped (in ```current_level_range_```) due to not overlapping the level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10688
Reviewed By: gitbw95
Differential Revision: D39556131
Pulled By: anand1976
fbshipit-source-id: 65e79438508a283cb19e64eca5c91d0714b81458
Summary:
When user-defined timestamp is enabled, subcompaction bounds should be set up properly. When creating InputIterator for the compaction, the `start` and `end` should have their timestamp portions set to kMaxTimestamp, which is the highest possible timestamp. This is similar to what we do with setting up their sequence numbers to `kMaxSequenceNumber`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10658
Test Plan:
```bash
make check
rm -rf /dev/shm/rocksdb/* && mkdir
/dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress
--allow_data_in_errors=True --clear_column_family_one_in=0
--continuous_verification_interval=0 --data_block_index_type=1
--db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5
--delrangepercent=0
--expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected
--iterpercent=0 --max_background_compactions=20
--max_bytes_for_level_base=10485760 --max_key=25000000
--max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1
--ops_per_thread=300000 --paranoid_file_checks=1 --partition_filters=0
--prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0
--snapshot_hold_ops=100000 --subcompactions=4
--target_file_size_base=65536 --target_file_size_multiplier=2
--test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1
--user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1
--write_buffer_size=65536 --writepercent=60 -disable_wal=1
-column_families=1
```
Reviewed By: akankshamahajan15
Differential Revision: D39393402
Pulled By: riversand963
fbshipit-source-id: f276e35b19fce51a175c368a502fb0718d1f3871
Summary:
fix a data race introduced in https://github.com/facebook/rocksdb/issues/10547 (P5295241720), first reported by pdillinger. The race is between the `std::atomic_load_explicit` in NewRangeTombstoneIteratorInternal and the `std::atomic_store_explicit` in MemTable::Add() that operate on `cached_range_tombstone_`. P5295241720 shows that `atomic_store_explicit` initializes some mutex which `atomic_load_explicit` could be trying to call `lock()` on at the same time. This fix moves the initialization to memtable constructor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10680
Test Plan: `USE_CLANG=1 COMPILE_WITH_TSAN=1 make -j24 whitebox_crash_test`
Reviewed By: ajkr
Differential Revision: D39528696
Pulled By: cbi42
fbshipit-source-id: ee740841044438e18ad2b8ea567444dd542dd8e2
Summary:
The patch extends the iterator API with a new `columns` method which
can be used to retrieve all wide columns for the current key. Similarly to
the `Get` and `GetEntity` APIs, the classic `value` API returns the value
of the default (anonymous) column for wide-column entities, and `columns`
returns an entity with a single default column for plain old key-values.
(The goal here is to maintain the invariant that `value()` is the same as
the value of the default column in `columns()`.) The patch also involves a
smaller refactoring: historically, `value()` was implemented using a bunch
of conditions, that is, the `Slice` to be returned was decided based on the
direction of the iteration, whether a merge had been done etc. when the
method was called; with the patch, the value to be exposed is stored in a
member `Slice value_` when the iterator lands on a new key, and `value()`
simply returns this `Slice`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10670
Test Plan: Ran `make check` and a simple blackbox crash test.
Reviewed By: riversand963
Differential Revision: D39475551
Pulled By: ltamasi
fbshipit-source-id: 29e7a6ed9ef340841aab36803b832b7c8f668b0b
Summary:
Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. https://github.com/facebook/rocksdb/issues/10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (https://github.com/facebook/rocksdb/issues/10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables.
The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (https://github.com/facebook/rocksdb/issues/10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10547
Test Plan:
- TestGet() in stress test is updated in https://github.com/facebook/rocksdb/issues/10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones.
```
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000
```
Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8e172dc5f1ef65c9d0a6eedbd9987b2c72 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used.
micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom.
| |fillrandom main | write path caching | read path caching |memtable V3 (https://github.com/facebook/rocksdb/issues/10308) | readrandom main | write path caching | read path caching |memtable V3 |
|--- |--- |--- |--- |--- | --- | --- | --- | --- |
| 0 |6.35 |6.15 |5.82 |6.12 |2.24 |2.26 |2.03 |2.07 |
| 1 |5.99 |5.88 |5.77 |6.28 |2.65 |2.27 |2.24 |2.5 |
| 10 |6.15 |6.02 |5.92 |5.95 |5.15 |2.61 |2.31 |2.53 |
| 100 |5.95 |5.78 |5.88 |6.23 |28.31 |2.34 |2.45 |2.94 |
| 100 25 threads |52.01 |45.85 |46.18 |47.52 |35.97 |3.34 |3.34 |3.56 |
| 1000 |6.0 |7.07 |5.98 |6.08 |333.18 |2.86 |2.7 |3.6 |
| 1000 25 threads |52.6 |148.86 |79.06 |45.52 |473.49 |3.66 |3.48 |4.38 |
- Benchmark performance of`readwhilewriting` from https://github.com/facebook/rocksdb/issues/10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes`
readrandom micros/op:
| |main |write path caching |read path caching |memtable V3 |
|---|---|---|---|---|
| single thread |48.28 |1.55 |1.52 |1.96 |
| 25 threads |64.3 |2.55 |2.67 |2.64 |
Reviewed By: ajkr
Differential Revision: D38895410
Pulled By: cbi42
fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559
Summary:
**Context/Summary:**
Previous experience with bugs and flaky tests taught us there exist features in RocksDB vulnerable to race condition caused by acquiring db mutex at a particular timing. This PR aggressively exposes those vulnerable features by injecting spurious wakeup and sleep to cause acquiring db mutex at various timing in order to expose such race condition
**Testing:**
- `COERCE_CONTEXT_SWITCH=1 make -j56 check / make -j56 db_stress` should reveal
- flaky tests caused by db mutex related race condition
- Reverted https://github.com/facebook/rocksdb/pull/9528
- A/B testing on `COMPILE_WITH_TSAN=1 make -j56 listener_test` w/ and w/o `COERCE_CONTEXT_SWITCH=1` followed by `./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10`
- `COERCE_CONTEXT_SWITCH=1` can cause expected test failure (i.e, expose target TSAN data race error) within 10 run while the other couldn't.
- This proves our injection can expose flaky tests caused by db mutex related race condition faster.
- known or new race-condition-type of internal bug by continuously running this PR
- Performance
- High ops-threads time: COERCE_CONTEXT_SWITCH=1 regressed by 4 times slower (2:01.16 vs 0:22.10 elapsed ). This PR will be run as a separate CI job so this regression won't affect any existing job.
```
TEST_TMPDIR=$db /usr/bin/time ./db_stress \
--ops_per_thread=100000 --expected_values_dir=$exp --clear_column_family_one_in=0 \
--write_buffer_size=524288 —target_file_size_base=524288 —ingest_external_file_one_in=100 —compact_files_one_in=1000 —compact_range_one_in=1000
```
- Start-up time: COERCE_CONTEXT_SWITCH=1 didn't regress by 25% (0:01.51 vs 0:01.29 elapsed)
```
TEST_TMPDIR=$db ./db_stress -ops_per_thread=100000000 -expected_values_dir=$exp --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress
TEST_TMPDIR=$db /usr/bin/time ./db_stress \
--ops_per_thread=1 -reopen=0 --expected_values_dir=$exp --clear_column_family_one_in=0 --destroy_db_initially=0
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10291
Reviewed By: ajkr
Differential Revision: D39231182
Pulled By: hx235
fbshipit-source-id: 7ab6695430460e0826727fd8c66679b32b3e44b6
Summary:
Historically, `BlobFileReader` has returned the blob(s) read from the file
in the `PinnableSlice` provided by the client. This interface was
preserved when caching was implemented for blobs, which meant that
the blob data was copied multiple times when caching was in use: first,
into the client-provided `PinnableSlice` (by `BlobFileReader::SaveValue`),
and then, into the object stored in the cache (by `BlobSource::PutBlobIntoCache`).
The patch eliminates these copies and the related allocations by changing
`BlobFileReader` so it returns its results in the form of heap-allocated `BlobContents`
objects that can be directly inserted into the cache. The allocations backing
these `BlobContents` objects are made using the blob cache's allocator if the
blobs are to be inserted into the cache (i.e. if a cache is configured and the
`fill_cache` read option is set). Note: this PR focuses on the common case when
blobs are compressed; some further small optimizations are possible for uncompressed
blobs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10647
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D39335185
Pulled By: ltamasi
fbshipit-source-id: 464503d60a5520d654c8273ffb8efd5d1bcd7b36
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
Summary:
**Summary:**
When a block is firstly `Lookup` from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from `Lookup`. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache.
When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache.
**Implementation Details**
Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0. (refs >= 1 && in_cache == false && IS_STANDALONE == true)
The behaviors of `LRUCacheShard::Lookup()` are updated if the secondary_cache is CompressedSecondaryCache:
1. If a handle is found in primary cache:
1.1. If the handle's value is not nullptr, it is returned immediately.
1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache.
- 1.2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
- 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache.
2. If a handle is not found in primary cache:
2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
2.2. If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block) and return a standalone handle.
The behaviors of `LRUCacheShard::Promote()` are updated as follows:
1. If `e->sec_handle` has value, one of the following steps can happen:
1.1. Insert a dummy handle and return a standalone handle to caller when `secondary_cache_` is `CompressedSecondaryCache` and e is a standalone handle.
1.2. Insert the item into the primary cache and return the handle to caller.
1.3. Exception handling.
3. If `e->sec_handle` has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released.
The behavior of `CompressedSecondaryCache::Insert()` is updated:
1. If a block is evicted from the primary cache for the first time, a dummy item is inserted.
4. If a dummy item is found for a block, the block is inserted into the secondary cache.
The behavior of `CompressedSecondaryCache:::Lookup()` is updated:
1. If a handle is not found or it is a dummy item, a nullptr is returned.
2. If `erase_handle` is true, the handle is erased.
The behaviors of `LRUCacheShard::Release()` are adjusted for the standalone handles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10527
Test Plan:
1. stress tests.
5. unit tests.
6. CPU profiling for db_bench.
Reviewed By: siying
Differential Revision: D38747613
Pulled By: gitbw95
fbshipit-source-id: 74a1eba7e1957c9affb2bd2ae3e0194584fa6eca
Summary:
Example failure:
```
db/blob/db_blob_basic_test.cc:226: Failure
Expected equality of these values:
i
Which is: 1
num_blobs
Which is: 5
```
I can't repro locally, but it looks like the 2KB cache is too small to guarantee no eviction happens between loading all the data into cache and reading from `kBlockCacheTier`. This 2KB setting appears to have come from a test where the cached entries are pinned, where it makes sense to have a small setting. However, such a small setting makes less sense when the blocks are evictable but must remain cached per the test's expectation. This PR increases the capacity setting to 2MB for those cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10636
Reviewed By: cbi42
Differential Revision: D39250976
Pulled By: ajkr
fbshipit-source-id: 769309f9a19cfac20b67b927805c8df5c1d2d1f5
Summary:
Example flake: https://app.circleci.com/pipelines/github/facebook/rocksdb/17660/workflows/7a891875-f07b-4a67-b204-eaa7ca9f9aa2/jobs/467496
The test could get stuck in out-of-space due to a callback executing `SetFilesystemActive(false /* active */)` after the test executed `SetFilesystemActive(true /* active */)`. This could happen because background info logging went through the SyncPoint callback "WritableFileWriter::Append:BeforePrepareWrite", probably unintentionally. The solution of this PR is to call `ClearAllCallBacks()` to wait for any such pending callbacks to drain before calling `SetFilesystemActive(true /* active */)`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10642
Reviewed By: cbi42
Differential Revision: D39265381
Pulled By: ajkr
fbshipit-source-id: 9a2f4916ab19726c8fb4b3a3b590b1b9ed93de1b
Summary:
Previously, automatic compaction could be triggered prior to the test invoking CompactRange(). It could lead to the following flaky failure:
```
/root/project/db/db_block_cache_test.cc:753: Failure
Expected equality of these values:
1 + kNumBlocks
Which is: 11
options.statistics->getTickerCount(BLOCK_CACHE_INDEX_ADD)
Which is: 10
```
A sequence leading to this failure was:
* Automatic compaction
* files [1] [2] trivially moved
* files [3] [4] [5] [6] trivially moved
* CompactRange()
* files [7] [8] [9] trivially moved
* file [10] trivially moved
In such a case, the index/filter block adds that the test expected did not happen since there were no new files.
This PR just tweaks settings to ensure the `CompactRange()` produces one new file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10635
Reviewed By: cbi42
Differential Revision: D39250869
Pulled By: ajkr
fbshipit-source-id: a3c94c49069e28c49c40b4b80dae0059739d19fd
Summary:
Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`.
With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator:
- in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys.
- in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L.
This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail.
One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`.
Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449
Test Plan:
- Added many unit tests in db_range_del_test
- Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2`
- Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913.
```
python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1
```
- Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written". As expected, the performance with this PR depends on the range tombstone width.
```
# Setup:
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50
# Scan entire DB
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true
# Short range scan (10 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true
# Long range scan(1000 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true
```
Avg over of 10 runs (some slower tests had fews runs):
For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones.
- Scan entire DB
| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- | ------------- |
| 0 range tombstone |2525600 (± 43564) |2486917 (± 33698) |-1.53% |
| 100 |1853835 (± 24736) |2073884 (± 32176) |+11.87% |
| 1000 |422415 (± 7466) |1115801 (± 22781) |+164.15% |
| 10000 |22384 (± 227) |227919 (± 6647) |+918.22% |
| 1 range tombstone |2176540 (± 39050) |2434954 (± 24563) |+11.87% |
- Short range scan
| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- | ------------- |
| 0 range tombstone |35398 (± 533) |35338 (± 569) |-0.17% |
| 100 |28276 (± 664) |31684 (± 331) |+12.05% |
| 1000 |7637 (± 77) |25422 (± 277) |+232.88% |
| 10000 |1367 |28667 |+1997.07% |
| 1 range tombstone |32618 (± 581) |32748 (± 506) |+0.4% |
- Long range scan
| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- | ------------- |
| 0 range tombstone |2262 (± 33) |2353 (± 20) |+4.02% |
| 100 |1696 (± 26) |1926 (± 18) |+13.56% |
| 1000 |410 (± 6) |1255 (± 29) |+206.1% |
| 10000 |25 |414 |+1556.0% |
| 1 range tombstone |1957 (± 30) |2185 (± 44) |+11.65% |
- Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61
Reviewed By: ajkr
Differential Revision: D38450331
Pulled By: cbi42
fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca
Summary:
With the current code, when a blob isn't found in the cache and gets read
from the blob file and then inserted into the cache, the application gets
passed the self-contained `PinnableSlice` resulting from the blob file read.
The patch changes this so that the `PinnableSlice` pins the cache entry
instead in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10625
Test Plan: `make check`
Reviewed By: pdillinger
Differential Revision: D39220904
Pulled By: ltamasi
fbshipit-source-id: cb9c62881e3523b1e9f614e00bf503bac2fe3b0a
Summary:
This stat was only getting updated in the async (coroutine) version of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10622
Reviewed By: akankshamahajan15
Differential Revision: D39188790
Pulled By: anand1976
fbshipit-source-id: 7e231507f65fc94c8a006c38f79dfba182a2c24a
Summary:
sanity check value for option `memtable_protection_bytes_per_key` in `ColumnFamilyData::ValidateOptions()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10621
Test Plan: `make check`, added unit test in ColumnFamilyTest.
Reviewed By: ajkr
Differential Revision: D39180133
Pulled By: cbi42
fbshipit-source-id: 009e0da3ccb332d1c9e14d20193304610bd4eb8a
Summary:
**Context:**
Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.
This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.
The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped (which otherwise will internally call`Env/FS::SyncDic()` )
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
```
```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
```
**Summary:**
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
- Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
- Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
- Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
- Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
- `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573
Test Plan:
- `make check`
- Passed the repro db_stress command
- For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.
Reviewed By: ajkr
Differential Revision: D39005886
Pulled By: hx235
fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a
Summary:
The patch adds a dedicated cache entry role for blob values and switches
to a registered deleter so that blobs show up as a separate bucket
(as opposed to "Misc") in the cache occupancy statistics, e.g.
```
Block cache entry stats(count,size,portion): DataBlock(133515,531.73 MB,13.6866%) BlobValue(1824855,3.10 GB,81.7071%) Misc(1,0.00 KB,0%)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10601
Test Plan: Ran `make check` and tested the cache occupancy statistics using `db_bench`.
Reviewed By: riversand963
Differential Revision: D39107915
Pulled By: ltamasi
fbshipit-source-id: 8446c3b190a41a144030df73f318eeda4398c125
Summary:
Some APIs for getting live files, which are used by Checkpoint
and BackupEngine, can optionally trigger and wait for a flush. These
would deadlock when used on a read-only DB. Here we fix that by assuming
the user wants the overall operation to succeed and is OK without
flushing (because the DB is read-only).
Follow-up work: the same or other issues can be hit by directly invoking
some DB functions that are clearly not appropriate for read-only
instance, but are not covered by overrides in DBImplReadOnly and
CompactedDBImpl. These should be fixed to avoid similar problems on
accidental misuse. (Long term, it would be nice to have a DBReadOnly
class without those members, like BackupEngineReadOnly.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10569
Test Plan: tests updated to catch regression (hang before the fix)
Reviewed By: riversand963
Differential Revision: D38995759
Pulled By: pdillinger
fbshipit-source-id: f5f8bc7123e13cb45bd393dd974d7d6eda20bc68
Summary:
The patch improves the bookkeeping around the memory usage of
cached blobs in two ways: 1) it uses `malloc_usable_size`, which accounts
for allocator bin sizes etc., and 2) it also considers the memory usage
of the `BlobContents` object in addition to the blob itself. Note: some unit
tests had been relying on the cache charge being equal to the size of the
cached blob; these were updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10583
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D39060680
Pulled By: ltamasi
fbshipit-source-id: 3583adce2b4ce6e84861f3fadccbfd2e5a3cc482
Summary:
Timer has a limitation that it cannot re-register a task with the same name,
because the cancel only mark the task as invalid and wait for the Timer thread
to clean it up later, before the task is cleaned up, the same task name cannot
be added. Which makes the task option update likely to fail, which basically
cancel and re-register the same task name. Change the periodic task name to a
random unique id and store it in periodic_task_scheduler.
Also refactor the `periodic_work` to `periodic_task` to make each job function
as a `task`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10379
Test Plan: unittests
Reviewed By: ajkr
Differential Revision: D38000615
Pulled By: jay-zhuang
fbshipit-source-id: e4135f9422e3b53aaec8eda54f4e18ce633a279e
Summary:
The patch introduces a new class called `BlobContents`, which represents
a single uncompressed blob value. We currently use `std::string` for this
purpose; `BlobContents` is somewhat smaller but the primary reason for a
dedicated class is that it enables certain improvements and optimizations
like eliding a copy when inserting a blob into the cache, using custom
allocators, or more control over and better accounting of the memory usage
of cached blobs (see https://github.com/facebook/rocksdb/issues/10484).
(We plan to implement these in subsequent PRs.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10571
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D39000965
Pulled By: ltamasi
fbshipit-source-id: f296eddf9dec4fc3e11cad525b462bdf63c78f96
Summary:
WAL append and switch can both happen between `FlushWAL(true /* sync */)`'s sync operations and its call to `MarkLogsSynced()`. We permit this since locks need to be released for the sync operations. Such an appended/switched WAL is both inactive and incompletely synced at the time `MarkLogsSynced()` processes it.
Prior to this PR, `MarkLogsSynced()` assumed all inactive WALs were fully synced and removed them from consideration for future syncs. That was wrong in the scenario described above and led to the latest append(s) never being synced. This PR changes `MarkLogsSynced()` to only remove inactive WALs from consideration for which all flushed data has been synced.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10560
Test Plan: repro unit test for the scenario described above. Without this PR, it fails on "key2" not found
Reviewed By: riversand963
Differential Revision: D38957391
Pulled By: ajkr
fbshipit-source-id: da77175eba97ff251a4219b227b3bb2d4843ed26
Summary:
In UniversalCompactionBuilder::PickCompactionToReduceSortedRuns, we passed start_level to get compression type and options. I think that is wrong and we should use output_level instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10515
Reviewed By: hx235
Differential Revision: D38611335
Pulled By: ajkr
fbshipit-source-id: bb860caed4b6c6bbde8f75fc50cf875a9f04723d
Summary:
This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535
Test Plan:
1. Ensure existing MultiGet unit tests pass, updating them as necessary
2. New unit tests - TODO
3. Run stress test - TODO
No noticeable regression (<1%) without async IO -
Without PR: `multireadrandom : 7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations; 571.6 MB/s (8168992 of 8168992 found)`
With PR: `multireadrandom : 7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations; 568.2 MB/s (8271992 of 8271992 found)`
For a fully cached DB, but with async IO option on, no regression observed (<1%) -
Without PR: `multireadrandom : 5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations; 797.9 MB/s (11540992 of 11540992 found) `
With PR: `multireadrandom : 5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations; 790.7 MB/s (11649992 of 11649992 found) `
Reviewed By: akankshamahajan15
Differential Revision: D38774009
Pulled By: anand1976
fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007
Summary:
The patch adds a new API `GetEntity` that can be used to perform
wide-column point lookups. It also extends the `Get` code path and
the `MemTable` / `MemTableList` and `Version` / `GetContext` logic
accordingly so that wide-column entities can be served from both
memtables and SSTs. If the result of a lookup is a wide-column entity
(`kTypeWideColumnEntity`), it is passed to the application in deserialized
form; if it is a plain old key-value (`kTypeValue`), it is presented as a
wide-column entity with a single default (anonymous) column.
(In contrast, regular `Get` returns plain old key-values as-is, and
returns the value of the default column for wide-column entities, see
https://github.com/facebook/rocksdb/issues/10483 .)
The result of `GetEntity` is a self-contained `PinnableWideColumns` object.
`PinnableWideColumns` contains a `PinnableSlice`, which either stores the
underlying data in its own buffer or holds on to a cache handle. It also contains
a `WideColumns` instance, which indexes the contents of the `PinnableSlice`,
so applications can access the values of columns efficiently.
There are several pieces of functionality which are currently not supported
for wide-column entities: there is currently no `MultiGetEntity` or wide-column
iterator; also, `Merge` and `GetMergeOperands` are not supported, and there
is no `GetEntity` implementation for read-only and secondary instances.
We plan to implement these in future PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10540
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D38847474
Pulled By: ltamasi
fbshipit-source-id: 42311a34ccdfe88b3775e847a5e2a5296e002b5b
Summary:
Previously, the flushes triggered by `WriteBufferManager` could affect
the same CF repeatedly if it happens to get consecutive writes. Such
flushes are not particularly useful for reducing memory usage since
they switch nearly-empty memtables to immutable while they've just begun
filling their first arena block. In fact they may not even reduce the
mutable memory count if they involve replacing one mutable memtable containing
one arena block with a new mutable memtable containing one arena block.
Further, if such switches happen even a few times before a flush finishes,
the immutable memtable limit will be reached and writes will stall.
This PR adds a heuristic to not switch memtables to immutable for CFs
that already have one or more immutable memtables awaiting flush. There
is a memory usage regression if the user continues writing to the same
CF, that DB does not have any CFs eligible for switching, flushes
are not finishing, and the `WriteBufferManager` was constructed with
`allow_stall=false`. Before, it would grow by switching nearly empty
memtables until writes stall. Now, it would grow by filling memtables
until writes stall. This feels like an acceptable behavior change because
users who prefer to stall over violate the memory limit should be using
`allow_stall=true`, which is unaffected by this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6364
Test Plan:
- Command:
`rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num_multi_db=8 -num_column_families=2 -write_buffer_size=4194304 -db_write_buffer_size=16777216 -compression_type=none -statistics=true -target_file_size_base=4194304 -max_bytes_for_level_base=16777216`
- `rocksdb.db.write.stall` count before this PR: 175
- `rocksdb.db.write.stall` count after this PR: 0
Reviewed By: jay-zhuang
Differential Revision: D20167197
Pulled By: ajkr
fbshipit-source-id: 4a64064e9bc33d57c0a35f15547542d0191d0cb7
Summary:
The fix in https://github.com/facebook/rocksdb/issues/10513 was not complete w.r.t range deletion handling. It didn't handle the case where a file with a range tombstone covering a key also overlapped another key in the batch. In that case, ```mget_range``` would be non-empty. However, ```mget_range``` would only have the second key and, therefore, the first key would be skipped when iterating through the range tombstones in ```TableCache::MultiGet```.
Test plan -
1. Add a unit test
2. Run stress tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10534
Reviewed By: akankshamahajan15
Differential Revision: D38773880
Pulled By: anand1976
fbshipit-source-id: dae491dbe52e18bbce5179b77b63f20771a66c00
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461
Reviewed By: siying
Differential Revision: D38672823
Pulled By: ltamasi
fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
Summary:
Fix copyright for two more extra headers to make internal tool happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10525
Reviewed By: jay-zhuang
Differential Revision: D38661390
fbshipit-source-id: ab2d055bfd145dfe82b5bae7a6c25cc338c8de94
Summary:
... so that cache keys can be derived from DB manifest data
before reading the file from storage--so that every part of the file
can potentially go in a persistent cache.
See updated comments in cache_key.cc for technical details. Importantly,
the new cache key encoding uses some fancy but efficient math to pack
data into the cache key without depending on the sizes of the various
pieces. This simplifies some existing code creating cache keys, like
cache warming before the file size is known.
This should provide us an essentially permanent mapping between SST
unique IDs and base cache keys, with the ability to "upgrade" SST
unique IDs (and thus cache keys) with new SST format_versions.
These cache keys are of similar, perhaps indistinguishable quality to
the previous generation. Before this change (see "corrected" days
between collision):
```
./cache_bench -stress_cache_key -sck_keep_bits=43
18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
```
After this change (keep 43 bits, up through 50, to validate "trajectory"
is ok on "corrected" days between collision):
```
19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
```
However, the change does modify (probably weaken) the "guaranteed unique" promise from this
> SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86
to this (see https://github.com/facebook/rocksdb/issues/10388)
> With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits
I don't think this is a practical concern, though.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394
Test Plan: unit tests updated, see simulation results above
Reviewed By: jay-zhuang
Differential Revision: D38667529
Pulled By: pdillinger
fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
Summary:
The info LOG file does not currently give any direct
information about the existence of old, live snapshots, nor how to
estimate wall time from a sequence number within the scope of LOG
history. This change addresses both with:
* Logging smallest and largest seqnos for generated SST files, which can
help associate sequence numbers with write time (based on flushes).
* Logging oldest_snapshot_seqno for each compaction, which (along with
that seqno info) helps us to determine how much old data might be kept
around for old (leaked?) snapshots. Including the date here I thought might
be excessive.
I wanted to log the date and seqno of the oldest snapshot with periodic
stats, but the current structure of the code doesn't really support that
because `DumpDBStats` doesn't have access to the DB object.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10454
Test Plan:
manual inspect LOG from
`KEEP_DB=1 ./db_basic_test --gtest_filter=*CompactBetweenSnapshots*`
Reviewed By: ajkr
Differential Revision: D38326948
Pulled By: pdillinger
fbshipit-source-id: 294918ffc04a419844146cd826045321b4d5c038
Summary:
https://github.com/facebook/rocksdb/pull/10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10518
Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.
Reviewed By: ajkr
Differential Revision: D38603145
fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
Summary:
New blobdb has a bug in compaction filter, where `blob_value_` is not reset for next iterated key. This will cause blob_value_ not empty and previous value read from blob is passed into the filter function for next key, even if its value is not in blob. Fixed by reseting regardless of key type.
Test Case:
Add `FilterByValueLength` test case in `DBBlobCompactionTest`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10391
Reviewed By: riversand963
Differential Revision: D38629900
Pulled By: ltamasi
fbshipit-source-id: 47d23ff2e5ec697958a210db9e6ceeb8b2fc49fa
Summary:
Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519
Reviewed By: jay-zhuang
Differential Revision: D38603291
fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1
Summary:
Moved linux builds to using docker to avoid CI instability caused by dependency installation site down.
Added the `Dockerfile` which is used to build the image.
The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test).
Also fixed a few issues detected while building this:
* `DestoryDB()` Status not checked for a few tests
* nullptr might be used in `inlineskiplist.cc`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D38554200
Pulled By: jay-zhuang
fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd
Summary:
A test in db_block_cache_test.cc was skipping ClockCache due to the 16-byte key length requirement. We fixed this. Along the way, we fixed a bug in ApplyToSomeEntries, which assumed the function being applied could modify handle metadata, and thus took an exclusive reference. This is incompatible with calls that need to inspect every element (including externally referenced ones) to gather stats.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10482
Test Plan: ``make -j24 check``
Reviewed By: anand1976
Differential Revision: D38553073
Pulled By: guidotag
fbshipit-source-id: 0ed63fed4d3b89e5056b35b7091fce579f5647ae
Summary:
A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489
Test Plan: Fix existing unit tests and watch crash tests for a while.
Reviewed By: anand1976
Differential Revision: D38473277
fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366
Summary:
The main purpose is to make debugging easier without sacrificing performance.
Instead of using a boolean variable for `CompactionIterator::valid_`, we can extend it to an `uint8_t`,
using the LSB to denote if the compaction iterator is valid and 4 additional bits to denote where
the iterator is set valid inside `NextFromInput()`. Therefore, when the control flow reaches
`PrepareOutput()` and hits assertion there, we can have a better idea of what has gone wrong.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10505
Test Plan:
make check
```
TEST_TMPDIR=/dev/shm/rocksdb time ./db_bench -compression_type=none -write_buffer_size=1073741824 -benchmarks=fillseq,flush
```
The above command has a 'flush' benchmark which uses `CompactionIterator`. I haven't observed any CPU regression or drop in throughput or latency increase.
Reviewed By: ltamasi
Differential Revision: D38551615
Pulled By: riversand963
fbshipit-source-id: 1250848fc118bb753d71fa9ff8ba840df999f5e0
Summary:
This PR fixes 2 bugs introduced in https://github.com/facebook/rocksdb/issues/10432 -
1. If the bloom filter returned a negative result for all MultiGet keys in a file, the range tombstones in that file were being ignored, resulting in incorrect results if those tombstones covered a key in a higher level.
2. If all the keys in a file were filtered out in `TableCache::MultiGetFilter`, the table cache handle was not being released.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10513
Test Plan: Add a new unit test that fails without this fix
Reviewed By: akankshamahajan15
Differential Revision: D38548739
Pulled By: anand1976
fbshipit-source-id: a741a1e25d2e991d63f038100f126c2dc404a87c
Summary:
We have recently added caching support to BlobDB, and separately,
implemented an optimization where reading blobs from the cache
results in the cache handle being transferred to the target `PinnableSlice`
(as opposed to the contents getting copied). With these changes,
it makes sense to reset the `PinnableSlice` storing the blob value in
`DBIter` as soon as we move to a different iterator position to prevent
us from holding on to the cache handle any longer than necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10490
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D38473630
Pulled By: ltamasi
fbshipit-source-id: 84c045ffac76436c6152fd0f5775b007f4051386
Summary:
The patch adds support for wide-column entities to the existing query
APIs (`Get`, `MultiGet`, and iterator). Namely, when during a query a
wide-column entity is encountered, we will return the value of the default
(anonymous) column as the result. Later, we plan to add wide-column
specific query APIs which will enable retrieving entire wide-column entities
or a subset of their columns.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10483
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D38441881
Pulled By: ltamasi
fbshipit-source-id: 6444e79a31aff2470e866698e3a97985bc2b3543
Summary:
Change tiered compaction feature from `bottommost_temperture` to
`last_level_temperture`. The old option is kept for migration purpose only,
which is behaving the same as `last_level_temperture` and it will be removed in
the next release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471
Test Plan: CI
Reviewed By: siying
Differential Revision: D38450621
Pulled By: jay-zhuang
fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527
Summary:
Current universal compaction picker may cause extra size amplification
compaction if there're more hot data on penultimate level. Improve the picker
to skip the last level for size amp calculation if tiered compaction is
enabled, which can
1. avoid extra unnecessary size amp compaction;
2. typically cold tier (the last level) is not size constrained, so skip size
amp for cold tier is intended;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10467
Test Plan: CI and added unittest
Reviewed By: siying
Differential Revision: D38391350
Pulled By: jay-zhuang
fbshipit-source-id: 103c0731c05e0a7e8f267e9e829d022328be25d2
Summary:
Similarly to https://github.com/facebook/rocksdb/pull/10457, we now have
to explicitly set the `fill_cache` read option when reading blobs in
`DBIter` to prevent the cache from getting polluted by queries with
`fill_cache` set to false. (Before we added support for a blob cache,
the setting had not made any difference either way.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10492
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D38476121
Pulled By: ltamasi
fbshipit-source-id: ea5c5e252f83e4a4e2c74156b37d40308d7e0c80