Summary:
We have recently made some stress test improvements that rely on decoding the "value base" from the values stored in the database. This logic does not currently support the case when some KVs are written by a non-batched ops run and some by a batched ops run. The patch temporarily disables mixing these two.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10821
Reviewed By: riversand963
Differential Revision: D40367326
Pulled By: ltamasi
fbshipit-source-id: 66f2e0cbc097ab6b1f9e4b39b833bd466f1aaab5
Summary:
This has several small improvements.
benchmark.sh
* add BYTES_PER_SYNC as an env variable
* use --prepopulate_block_cache when O_DIRECT is used
* use --undefok to list options that don't work for all 7.x releases
* print "failure" in report.tsv when a benchmark fails
* parse the slightly different throughput line used by db_bench for multireadrandom
* remove the trailing comma for BlobDB size before printing it in report.tsv
* use the last line of the output from /bin/time as there can be more than one line when db_bench has a non-zero exit
* fix more bash lint warnings
* add ",stats" to the --benchmark=... lines to get stats at the end of each benchmark
benchmark_compare.sh
* run revrange immediately after fillseq to let compaction debt get removed
* add --multiread_batched when --benchmarks=multireadrandom is used
* use --benchmarks=overwriteandwait when supported to get a more accurate measure of write-amp
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10803
Test Plan: Run it for leveled, universal and BlobDB
Reviewed By: jay-zhuang
Differential Revision: D40278315
Pulled By: mdcallag
fbshipit-source-id: 793134ddc7d48d05a07436cd8942c375a23983a7
Summary:
This change is motivated by ensuring that `ldb update_manifest` or `UpdateManifestForFilesState` can run without expecting files to open when the old temperature is provided (in case the FileSystem strictly interprets non-kUnknown), but ended up fixing a problem in `OfflineManifestWriter` (used by `ldb unsafe_remove_sst_file`) where it would open some SST files during recovery and expect them to match the prior manifest state, even if not required by the intended new state.
Also update BackupEngine to retry with Temperature kUnknown when reading file with potentially "wrong" temperature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10796
Test Plan: tests added/updated, that fail before the change(s) and now pass
Reviewed By: jay-zhuang
Differential Revision: D40232645
Pulled By: jay-zhuang
fbshipit-source-id: b5aa2688aecfe0c320b80a7da689b315414c20be
Summary:
We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.)
This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768
Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test)
Reviewed By: anand1976
Differential Revision: D40034747
Pulled By: anand1976
fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68
Summary:
`SstFileWriter` currently does not support the `PutEntity` API, so in `TestIngestExternalFile` all key-values are written using regular `Put`s. This violates the assumption that whether or not a key corresponds to a plain old key-value or a wide-column entity can be determined by solely looking at the "value base" used when generating the value. The patch fixes this issue by disabling ingestion when `PutEntity` is enabled in the stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10769
Test Plan: Ran a simple blackbox stress test.
Reviewed By: akankshamahajan15
Differential Revision: D40042132
Pulled By: ltamasi
fbshipit-source-id: 93e75ff55545b7b69fa4ddef1d96093c961158a0
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661
Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case.
| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|
Reviewed By: riversand963
Differential Revision: D39441192
Pulled By: cbi42
fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
Summary:
**Context/Summary:**
Introduce `manual_wal_flush_one_in` as titled.
- When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in`
Incompatibilities fixed along the way:
```
db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed.
```
- It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false.
- As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698
Test Plan:
- Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3`
- Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs
Reviewed By: ajkr
Differential Revision: D39593752
Pulled By: ajkr
fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce
Summary:
The patch adds the `PutEntity` API to the non-batched, batched, and
CF consistency stress tests. Namely, when the new `db_stress` command
line parameter `use_put_entity_one_in` is greater than zero, one in
N writes on average is performed using `PutEntity` rather than `Put`.
The wide-column entity written has the generated value in its default
column; in addition, it contains up to three additional columns where
the original generated value is divided up between the column name and the
column value (with the column name containing the first k characters of
the generated value, and the column value containing the rest). Whether
`PutEntity` is used (and if so, how many columns the entity has) is completely
determined by the "value base" used to generate the value (that is, there is
no randomness involved). Assuming the same `use_put_entity_one_in` setting
is used across `db_stress` invocations, this enables us to reconstruct and
validate the entity during subsequent `db_stress` runs.
Note that `PutEntity` is currently incompatible with `Merge`, transactions, and
user-defined timestamps; these combinations are currently disabled/disallowed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10760
Test Plan: Ran some batched, non-batched, and CF consistency stress tests using the script.
Reviewed By: riversand963
Differential Revision: D39939032
Pulled By: ltamasi
fbshipit-source-id: eafdf124e95993fb7d73158e3b006d11819f7fa9
Summary:
An add-on to https://github.com/facebook/rocksdb/pull/6818 to complete adding single-level universal compaction to stress/crash testing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10732
Test Plan:
- Locally run for 10 min `python3 ./tools/db_crashtest.py whitebox --simple --compaction_style=1 --num_levels=1 -max_key=1000000 -value_size_mult=33 -write_buffer_size=524288 -target_file_size_base=524288 -max_bytes_for_level_base=2097152 --duration=120 --interval=10 --ops_per_thread=1000 --random_kill_odd=887`
- Check LOG to confirm single-level universal compaction is called
- Manual testing and log checking to ensure destroy_db_initially=1 is correctly set across runs with different compaction styles (i.e, in the second half of whitebox testing).
- [ongoing]CI jobs stress test
Reviewed By: ajkr
Differential Revision: D39797612
Pulled By: ajkr
fbshipit-source-id: 16f5c40c3464c57360c06c8305f92118e426149c
Summary:
**Context:**
Prior to this PR, correctness testing with un-sync data loss [disabled](https://github.com/facebook/rocksdb/pull/10605) transaction (`use_txn=1`) thus all of the `txn_write_policy` . This PR improved that by adding support for one policy - WriteCommit (`txn_write_policy=0`).
**Summary:**
They key to this support is (a) handle Mark{Begin, End}Prepare/MarkCommit/MarkRollback in constructing ExpectedState under WriteCommit policy correctly and (b) monitor CI jobs and solve any test incompatibility issue till jobs are stable. (b) will be part of the test plan.
For (a)
- During prepare (i.e, between `MarkBeginPrepare()` and `MarkEndPrepare(xid)`), `ExpectedStateTraceRecordHandler` will buffer all writes by adding all writes to an internal `WriteBatch`.
- On `MarkEndPrepare()`, that `WriteBatch` will be associated with the transaction's `xid`.
- During the commit (i.e, on `MarkCommit(xid)`), `ExpectedStateTraceRecordHandler` will retrieve and iterate the internal `WriteBatch` and finally apply those writes to `ExpectedState`
- During the rollback (i.e, on `MarkRollback(xid)`), `ExpectedStateTraceRecordHandler` will erase the internal `WriteBatch` from the map.
For (b) - one major issue described below:
- TransactionsDB in db stress recovers prepared-but-not-committed txns from the previous crashed run by randomly committing or rolling back it at the start of the current run, see a historical [PR](6d06be22c0) predated correctness testing.
- And we will verify those processed keys in a recovered db against their expected state.
- However since now we turn on `sync_fault_injection=1` where the expected state is constructed from the trace instead of using the LATEST.state from previous run. The expected state now used to verify those processed keys won't contain UNKNOWN_SENTINEL as they should - see test 1 for a failed case.
- Therefore, we decided to manually update its expected state to be UNKNOWN_SENTINEL as part of the processing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10624
Test Plan:
1. Test exposed the major issue described above. This test will fail without setting UNKNOWN_SENTINEL in expected state during the processing and pass after
```
db=/dev/shm/rocksdb_crashtest_blackbox
exp=/dev/shm/rocksdb_crashtest_expected
dbt=$db.tmp
expt=$exp.tmp
rm -rf $db $exp
mkdir -p $exp
echo "RUN 1"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 0.2
sleep 20
kill $pid
sleep 0.2
echo "RUN 2"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 0.2
sleep 20
kill $pid
sleep 0.2
echo "RUN 3"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
```
2. Manual testing to ensure ExpectedState is constructed correctly during recovery by verifying it against previously crashed TransactionDB's WAL.
- Run the following command to crash a TransactionDB with WriteCommit policy. Then `./ldb dump_wal` on its WAL file
```
db=/dev/shm/rocksdb_crashtest_blackbox
exp=/dev/shm/rocksdb_crashtest_expected
rm -rf $db $exp
mkdir -p $exp
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 30
kill $pid
sleep 1
```
- Run the following command to verify recovery of the crashed db under debugger. Compare the step-wise result with WAL records (e.g, WriteBatch content, xid, prepare/commit/rollback marker)
```
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
```
3. Automatic testing by triggering all RocksDB stress/crash test jobs for 3 rounds with no failure.
Reviewed By: ajkr, riversand963
Differential Revision: D39199373
Pulled By: hx235
fbshipit-source-id: 7a1dec0e3e2ee6ea86ddf5dd19ceb5543a3d6f0c
Summary:
As title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10717
Test Plan:
Unit Tests
CI
Reviewed By: riversand963
Differential Revision: D39700707
Pulled By: gitbw95
fbshipit-source-id: 54de27e695535a50159f5f6467da36aaf21bebae
Summary:
`enable_custom_split_merge` is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10690
Test Plan:
Unit Tests
Stress Tests
Reviewed By: anand1976
Differential Revision: D39567604
Pulled By: gitbw95
fbshipit-source-id: f6d1d46200f365220055f793514601dcb0edc4b7
Summary:
This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
* We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
* We want to highlight the key feature: it's fast (especially under parallel load)
* Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
* We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)
Some API detail:
* To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
* Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.
For performance tests see https://github.com/facebook/rocksdb/pull/10626
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684
Test Plan: no interesting functional changes; tests updated
Reviewed By: anand1976
Differential Revision: D39547800
Pulled By: pdillinger
fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
Summary:
Hopefully, we can re-enable the combination of user-defined timestamp and subcompactions
after https://github.com/facebook/rocksdb/issues/10658.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10689
Test Plan:
Make sure the following succeeds on devserver.
make crash_test_with_ts
Reviewed By: ltamasi
Differential Revision: D39556558
Pulled By: riversand963
fbshipit-source-id: 4695f420b1bc9ebf3b24640b693746f4db82c149
Summary:
The patch makes it possible to use the `JemallocNodumpAllocator` with the
block/blob caches in `db_bench`. In addition to its stated purpose of excluding
cache contents from core dumps, `JemallocNodumpAllocator` also uses
a dedicated arena and jemalloc tcaches for cache allocations, which can
reduce fragmentation and thus memory usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10685
Reviewed By: riversand963
Differential Revision: D39552261
Pulled By: ltamasi
fbshipit-source-id: b5c58eab6b7c1baa9a307d9f1248df1d7a77d2b5
Summary:
Same as title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10632
Test Plan: make crash_test -j32
Reviewed By: anand1976
Differential Revision: D39241479
Pulled By: akankshamahajan15
fbshipit-source-id: 5db5b0c007da786bacc1b30d8926d36d6d029b87
Summary:
Expanded `all_params` to include all parameters crash test may set. Previously, `atomic_flush` was not included in `all_params` and thus was not visible to `finalize_and_sanitize()`. The consequence was manual crash test runs could provide unsafe combinations of parameters to `db_stress`. For example, running `db_crashtest.py` with `-atomic_flush=0` could cause `db_stress` to run with `-atomic_flush=0 -disable_wal=1`, which is known to produce inconsistencies across column families.
While expanding `all_params`, I found we cannot have an entry in it for both `db_stress` and `db_crashtest.py`. So I renamed `enable_tiered_storage` to `test_tiered_storage` for `db_crashtest.py`, which appears more conventional anyways.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10654
Reviewed By: hx235
Differential Revision: D39369349
Pulled By: ajkr
fbshipit-source-id: 31d9010c760c868b20d5e9bd78ba75c8ff3ce348
Summary:
**Context/Summary:**
`ExpectedState` is not aware of transaction-related concept so `use_txn=1 ` is not compatible with `sync_fault_injection=1`. Therefore this PR disabled this combination until we expand our correctness testing to transaction related features.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10605
Test Plan:
- Run the following commands to verify `--use_txn` is correctly sanitized
- `python3 ./tools/db_crashtest.py blackbox --use_txn=1 --sync_fault_injection=1 `
- `python3 ./tools/db_crashtest.py blackbox --use_txn=0 --sync_fault_injection=1 `
Reviewed By: ajkr
Differential Revision: D39121287
Pulled By: hx235
fbshipit-source-id: 7d5d6dd32479ea1c07df4f38322650f3a60def9c
Summary:
Currently, `db_bench` and `db_stress` print the blob cache options even if
a shared block/blob cache is configured, i.e. when they are not actually
in effect. The patch changes this so they are only printed when a separate blob
cache is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10614
Test Plan: Tested manually using `db_bench` and `db_stress`.
Reviewed By: akankshamahajan15
Differential Revision: D39144603
Pulled By: ltamasi
fbshipit-source-id: f714304c5d46186f8514746c27ee6f52aa3e4af8
Summary:
**Context:**
Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.
This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.
The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped (which otherwise will internally call`Env/FS::SyncDic()` )
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
```
```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
```
**Summary:**
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
- Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
- Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
- Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
- Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
- `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573
Test Plan:
- `make check`
- Passed the repro db_stress command
- For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.
Reviewed By: ajkr
Differential Revision: D39005886
Pulled By: hx235
fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a
Summary:
CI benchmarks refine configuration
Run only “essential” benchmarks, but for longer
Fix (reduce) the NUM_KEYS to ensure cached behaviour
Reduce level size to try to ensure more levels
Refine test durations again, more time per test, but fewer tests.
In CI benchmark mode, the only read test is readrandom.
There are still 3 mostly-read tests.
Goal is to squeeze complete run a little bit inside 1 hour so it doesn’t clash with the next run (cron scheduled for main branch), but it gets to run as long as possible, so that results are as credible as possible.
Reduce thread count to physical capacity, in an attempt to reduce throughput variance for write heavy tests. See Mark Callaghan’s comments in related documentation..
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10514
Reviewed By: ajkr
Differential Revision: D38952469
Pulled By: jay-zhuang
fbshipit-source-id: 72fa6bba897cc47066ced65facd1fd36e28f30a8
Summary:
The features that cannot work with disable_wal=1 due to unsynced data dropping (ingest_external_file_one_in and enable_compaction_filter) similarly cannot work with sync_fault_injection=1. This PR prevents those features from being used together with sync_fault_injection=1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10559
Reviewed By: hx235
Differential Revision: D38953019
Pulled By: ajkr
fbshipit-source-id: 7e2c7644ec84d7323f632cf976bcee00502d0ed7
Summary:
As mentioned in https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913,
`db_stress` does not have much verification for iterator correctness.
It has a `TestIterate()` function, but that is mainly for comparing results
between two iterators, one with `total_order_seek` and the other optionally
sets auto_prefix, upper/lower bounds. Commit 49a0581ad2462e31aa3f768afa769e0d33390f33
added a new `TestIterateAgainstExpected()` function that compares iterator against
expected state. It locks a range of keys, creates an iterator, does
a random sequence of `Next/Prev` and compares against expected state.
This PR is based on that commit, the main changes include some logs
(for easier debugging if a test fails), a forward and backward scan to
cover the entire locked key range, and a flag for optionally turning on
this version of Iterator testing.
Added constraint that the checks against expected state in
`TestIterateAgainstExpected()` and in `TestGet()` are only turned on
when `--skip_verifydb` flag is not set.
Remove the change log introduced in https://github.com/facebook/rocksdb/issues/10553.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10538
Test Plan:
Run `db_stress` with `--verify_iterator_with_expected_state_one_in=1`,
and a large `--iterpercent` and `--num_iterations`. Checked `op_logs`
manually to ensure expected coverage. Tweaked part of the code in
https://github.com/facebook/rocksdb/issues/10449 and stress test was able to catch it.
- internally run various flavor of crash test
Reviewed By: ajkr
Differential Revision: D38847269
Pulled By: cbi42
fbshipit-source-id: 8b4402a9bba9f6cfa08051943cd672579d489599
Summary:
Optionally issue DeleteRange in `*whilewriting` benchmarks. This happens in `BGWriter` and uses similar logic as in `DoWrite` to issue DeleteRange operations. I added this when I was benchmarking https://github.com/facebook/rocksdb/issues/10547, but this should be an independent PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10552
Test Plan: ran some benchmarks with various delete range options, e.g. `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=100 --writes=200000 --reads=1000000 --disable_auto_compactions --max_num_range_tombstones=10000`
Reviewed By: ajkr
Differential Revision: D38927020
Pulled By: cbi42
fbshipit-source-id: 31ee20cb8127f7173f0816ea0cc2a204ec02aad6
Summary:
After branch 7.6.fb branch is cut, following release process, upgrade version number to 7.7 and add 7.6.fb to format compatibility check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10546
Test Plan: Watch CI
Reviewed By: ajkr
Differential Revision: D38892023
Pulled By: gitbw95
fbshipit-source-id: 94e96dedbd973f5f9713e73d3bed336e4678565b
Summary:
This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535
Test Plan:
1. Ensure existing MultiGet unit tests pass, updating them as necessary
2. New unit tests - TODO
3. Run stress test - TODO
No noticeable regression (<1%) without async IO -
Without PR: `multireadrandom : 7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations; 571.6 MB/s (8168992 of 8168992 found)`
With PR: `multireadrandom : 7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations; 568.2 MB/s (8271992 of 8271992 found)`
For a fully cached DB, but with async IO option on, no regression observed (<1%) -
Without PR: `multireadrandom : 5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations; 797.9 MB/s (11540992 of 11540992 found) `
With PR: `multireadrandom : 5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations; 790.7 MB/s (11649992 of 11649992 found) `
Reviewed By: akankshamahajan15
Differential Revision: D38774009
Pulled By: anand1976
fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461
Reviewed By: siying
Differential Revision: D38672823
Pulled By: ltamasi
fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
Summary:
Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519
Reviewed By: jay-zhuang
Differential Revision: D38603291
fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1
Summary:
Moved linux builds to using docker to avoid CI instability caused by dependency installation site down.
Added the `Dockerfile` which is used to build the image.
The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test).
Also fixed a few issues detected while building this:
* `DestoryDB()` Status not checked for a few tests
* nullptr might be used in `inlineskiplist.cc`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D38554200
Pulled By: jay-zhuang
fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd
Summary:
Change tiered compaction feature from `bottommost_temperture` to
`last_level_temperture`. The old option is kept for migration purpose only,
which is behaving the same as `last_level_temperture` and it will be removed in
the next release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471
Test Plan: CI
Reviewed By: siying
Differential Revision: D38450621
Pulled By: jay-zhuang
fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527
Summary:
Currently user_defined_timestamp is failing in stress test with
subcompactions. So disabling it for now and will re enable it once its
fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10503
Test Plan: make crash_test_with_ts -j32
Reviewed By: riversand963
Differential Revision: D38510485
Pulled By: akankshamahajan15
fbshipit-source-id: 82fd0ec8cf86a96ff6653edd5bad7623cb9e0a15
Summary:
- Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
- db_bench is updated to print out the number of range deletions executed if there is any.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380
Test Plan:
- CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
- Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable.
```
single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100
multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
```
Commit 99cdf16464 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.
Single thread result:
| Max # tombstones | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op | 99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0 |6.68 |6.57 |6.72 |4.72 |4.79 |4.54 |
| 1 |6.67 |6.58 |6.62 |5.41 |4.74 |4.72 |
| 10 |6.59 |6.5 |6.56 |7.83 |4.69 |4.59 |
| 100 |6.62 |6.75 |6.58 |29.57 |5.04 |5.09 |
| 1000 |6.54 |6.82 |6.61 |320.33 |5.22 |5.21 |
32-thread result: note that "Max # tombstones" is per thread.
| Max # tombstones | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op | 99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0 |234.52 |260.25 |239.42 |5.06 |5.38 |5.09 |
| 1 |236.46 |262.0 |231.1 |19.57 |22.14 |5.45 |
| 10 |236.95 |263.84 |251.49 |151.73 |21.61 |5.73 |
| 100 |268.16 |296.8 |280.13 |2308.52 |22.27 |6.57 |
Reviewed By: ajkr
Differential Revision: D37916564
Pulled By: cbi42
fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
Summary:
This PR avoids allocations and copies for the result of `GetMergeOperands()` when the average operand size is at least 256 bytes and the total operands size is at least 32KB. The `GetMergeOperands()` already included `PinnableSlice` but was calling `PinSelf()` (i.e., allocating and copying) for each operand. When this optimization takes effect, we instead call `PinSlice()` to skip that allocation and copy. Resources are pinned in order for the `PinnableSlice` to point to valid memory even after `GetMergeOperands()` returns.
The pinned resources include a referenced `SuperVersion`, a `MergingContext`, and a `PinnedIteratorsManager`. They are bundled into a `GetMergeOperandsState`. We use `SharedCleanablePtr` to share that bundle among all `PinnableSlice`s populated by `GetMergeOperands()`. That way, the last `PinnableSlice` to be `Reset()` will cleanup the bundle, including unreferencing the `SuperVersion`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10458
Test Plan:
- new DB level test
- measured benefit/regression in a number of memtable scenarios
Setup command:
```
$ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -num=$num -writes=16384 -key_size=16 -value_size=$value_sz -compression_type=none -write_buffer_size=1048576000
```
Benchmark command:
```
./db_bench -threads=$threads -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=1048576000 -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -num=$num -duration=10
```
Worst regression is when a key has many tiny operands:
- Parameters: num=1 (implying 16384 operands per key), value_sz=8, threads=1
- `GetMergeOperands()` latency increases 682 micros -> 800 micros (+17%)
The regression disappears into the noise (<1% difference) if we remove the `Reset()` loop and the size counting loop. The former is arguably needed regardless of this PR as the convention in `Get()` and `MultiGet()` is to `Reset()` the input `PinnableSlice`s at the start. The latter could be optimized to count the size as we accumulate operands rather than after the fact.
Best improvement is when a key has large operands and high concurrency:
- Parameters: num=4 (implying 4096 operands per key), value_sz=2KB, threads=32
- `GetMergeOperands()` latency decreases 11492 micros -> 437 micros (-96%).
Reviewed By: cbi42
Differential Revision: D38336578
Pulled By: ajkr
fbshipit-source-id: 48146d127e04cb7f2d4d2939a2b9dff3aba18258
Summary:
If a db_bench process gets hung or runaway on a machine, that
could prevent regression_test.sh from ever making progress. To fix that,
regression_test.sh will now kill any db_bench process that is >12 hours
old. Also made this more reliable by not using string matching (grep) to
get db_bench process IDs.
I also had to make some other updates to get local runs working
reliably:
* Fix some quoting hell and other dubious complexity with db_bench_cmd
* Only save a DB for re-use when building it passes
* Report failed command in more cases
* Add safeguards against "rm -rf ."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10441
Test Plan:
manual (local and remote), with temporary changes e.g. to have
a manageable age threshold etc.
Reviewed By: riversand963
Differential Revision: D38285537
Pulled By: pdillinger
fbshipit-source-id: 4d598876aedc38ac4bd9d8ddf32c5995d8e44db8
Summary:
The secondary cache is randomly disabled or enabled with CompressedSecondaryCache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10442
Test Plan: - To test that the CompressedSecondaryCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test `
Reviewed By: anand1976
Differential Revision: D38290796
Pulled By: gitbw95
fbshipit-source-id: bb7027b39e0ed9c0c62835abe09e759898130ec8
Summary:
deleterandom tests are too fast to get good signal, e.g.
--deletes=31250 in 0.170 seconds vs. --reads=1500000 in 288.491
seconds for readrandom. Removing the special handling (unknown
motivation in faa7eb3b99) should suffice.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10437
Test Plan: watch continuous results
Reviewed By: ltamasi
Differential Revision: D38261185
Pulled By: pdillinger
fbshipit-source-id: 0f1b1b19efccda5689027d36cc2f01307f36031d
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309
Reviewed By: ltamasi
Differential Revision: D38211655
Pulled By: gangliao
fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
Summary:
In this PR we bring ClockCache closer to production quality. We implement the following changes:
1. Fixed a few bugs in ClockCache.
2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table.
3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity.
4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.)
5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache.
As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418
Test Plan:
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
Reviewed By: pdillinger
Differential Revision: D38170673
Pulled By: guidotag
fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a
Summary:
If user runs `db_bench` with `-perf_level=2` or higher, db_bench should
print perf context after each of all benchmarks.
Or make `-perf_level` a per-benchmark switch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10396
Test Plan: ./db_bench -benchmarks=fillseq,readseq -perf_level=2
Reviewed By: ajkr
Differential Revision: D38016324
Pulled By: riversand963
fbshipit-source-id: d83ea4abc34d40ffea394ca6abf0814bc5c0a2e0
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321
Reviewed By: ltamasi
Differential Revision: D37913590
Pulled By: gangliao
fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
Summary:
After branch 7.5.fb branch is cut, following release process, upgrade version number to 7.6 and add 7.5.fb to format compatibility check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10376
Test Plan: Watch CI
Reviewed By: ajkr
Differential Revision: D37927694
fbshipit-source-id: 71b37ae55ebb7c95a1bcc0d7eee643d6ba5f8461
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298
Reviewed By: ltamasi
Differential Revision: D37908743
Pulled By: gangliao
fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
Summary:
ClockCache is still in experimental stage, and currently fails some pre-release fbcode tests. See https://www.internalfb.com/diff/D37772011. API calls to construct ClockCache are done via the function NewClockCache. For now, NewClockCache calls will return an LRUCache (with appropriate arguments), which is stable.
The idea that NewClockCache returns nullptr was also floated, but this would be interpreted as unsupported cache, and a default LRUCache would be constructed instead, potentially causing a performance regression that is harder to identify.
A new version of the NewClockCache function was created for our internal tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10351
Test Plan: ``make -j24 check`` and re-run the pre-release tests.
Reviewed By: pdillinger
Differential Revision: D37802685
Pulled By: guidotag
fbshipit-source-id: 0a8d10612ff21e576f7360cb13e20bc36e244972
Summary:
as title.
Test plan
- make check
- CI on PR
- TEST_TMPDIR=/dev/shm make crash_test_with_multiops_wp_txn (tested with successful run)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10350
Reviewed By: ajkr
Differential Revision: D37792872
Pulled By: riversand963
fbshipit-source-id: ff064093b7f715d0acf387af2e3ae87b1278b52b
Summary:
Previously the version was displayed as $major.$minor
This changes it to $major.$minor.$path
This also adds the git hash for the time from which RocksDB was built to the end of report.tsv. I confirmed that benchmark_log_tool.py still parses it and that the people
who consume/graph these results are OK with it.
Example output:
ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id githash
609488 244.1 1GB 0.0GB, 1.4 0.7 93.3 39 38 0 0 1.6 1.0 4 15 26 5365 15 0.0 0 0.1 0.0 0.5 fillseq.wal_disabled.v400 2022-06-29T13:36:05 7.5.0 6115254416
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10277
Test Plan: Run it
Reviewed By: jay-zhuang
Differential Revision: D37532418
Pulled By: mdcallag
fbshipit-source-id: 55e472640d51265819b228d3373c9fa9b62b660d
Summary:
This adds --undefok to support use of this script with BlobDB for db_bench versions prior
to 7.5 when the options land in a release.
While there is a limit to how far back this script can go WRT backwards compatiblity,
this is an easy change to support early 7.x releases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10276
Test Plan: Run it with versions of db_bench that do not and then do support these options
Reviewed By: gangliao
Differential Revision: D37529299
Pulled By: mdcallag
fbshipit-source-id: 7bb1feec5c68760e6d64792c585bfbde4f5e52d8
Summary:
This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
- Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
- Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
- Middle insertions into the clock list.
- A test that exercises the clock eviction policy.
- Update the Java API of ClockCache and Java calls to C++.
Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37522461
Pulled By: guidotag
fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
Summary:
Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter.
This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of
the presence of timestamps.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259
Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts
Reviewed By: ajkr
Differential Revision: D37459692
Pulled By: riversand963
fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef
Summary:
Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252
Reviewed By: riversand963
Differential Revision: D37432948
Pulled By: ajkr
fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a
Summary:
In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`.
As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202
Reviewed By: ltamasi
Differential Revision: D37325739
Pulled By: gangliao
fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8
Summary:
There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data
block hash index, which was not caught because data block hash index was
never added to stress tests. This change fixes both issues.
Fixes https://github.com/facebook/rocksdb/issues/10186
I intend to pick this into the 7.4.0 release candidate
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220
Test Plan:
Failure quickly reproduces in crash test with
kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing
the failure with a unit test I believe would be too tricky and fragile
to be worthwhile.
Reviewed By: anand1976
Differential Revision: D37315647
Pulled By: pdillinger
fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba
Summary:
In FastLRUCache, we replace the current chained per-shard hash table by an open-addressing hash table. In particular, this allows us to preallocate all handles.
Because all handles are preallocated, this implementation doesn't support strict_capacity_limit = false (i.e., allowing insertions beyond the predefined capacity). This clashes with current assumptions of some tests, namely two tests in cache_test and the crash tests. We have disabled these for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10194
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37296770
Pulled By: guidotag
fbshipit-source-id: 232ff1b8260331d868ebf4e3e5d8ad709390b0ad
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.
Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198
Reviewed By: ltamasi
Differential Revision: D37294735
Pulled By: gangliao
fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.
However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159
Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
- Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work
Reviewed By: ajkr, anand1976
Differential Revision: D37135172
Pulled By: hx235
fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.
Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.
There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037
Test Plan:
- Manual
- Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
- Enabled in MyShadow for 1+ week
- Automated
- Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
- Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection
Reviewed By: cbi42
Differential Revision: D36614569
Pulled By: ajkr
fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
Summary:
In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
the public API, because of its inefficiency. Although we normally maintain read compatibility
on old DBs forever, filters are not required for reading a DB, only for optimizing read
performance. Thus, it should be acceptable to remove this code and the substantial
maintenance burden it carries as useful features are developed and validated (such
as user timestamp).
This change completely removes the code for reading and writing the old block-based
filters, net removing about 1370 lines of code no longer needed. Options removed from
testing / benchmarking tools. The prior existence is only evident in a couple of places:
* `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
a major release to minimize source code incompatibilities.
* A warning is logged when an old table file is opened that used the old block-based
filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
should suffice. Unfortunately, sst_dump does not tell you whether a file uses
block-based filter, and the structure of the code makes it very difficult to fix.
* To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
for metaindex is maintained (for now).
Other notes:
* In some cases where numbers are associated with filter configurations, we have had to
update the assigned numbers so that they all correspond to something that exists.
* Fixed potential stat counting bug by assuming `filter_checked = false` for cases
like `filter == nullptr` rather than assuming `filter_checked = true`
* Removed obsolete `block_offset` and `prefix_extractor` parameters from several
functions.
* Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
because the caller guarantees the prefix extractor exists and is compatible
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184
Test Plan:
tests updated, manually test new warning in LOG using base version to
generate a DB
Reviewed By: riversand963
Differential Revision: D37212647
Pulled By: pdillinger
fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
Summary:
Fix existing usage of non-ASCII and add a check to prevent
future use. Added `-n` option to greps to provide line numbers.
Alternative to https://github.com/facebook/rocksdb/issues/10147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10164
Test Plan:
used new checker to find & fix cases, manually check
db_bench output is preserved
Reviewed By: akankshamahajan15
Differential Revision: D37148792
Pulled By: pdillinger
fbshipit-source-id: 68c8b57e7ab829369540d532590bf756938855c7
Summary:
There is `Options::allow_data_in_errors` that controls whether RocksDB
is allowed to log data, e.g. key, value, etc in LOG files. It is false
by default. However, in db_bench and db_stress, it is often ok to log
data because there is no concern about privacy.
This PR allows db_stress and db_bench to set this option on the command
line, while it remains false by default. Furthermore, make
crash/recovery test driven by db_crashtest.py to opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171
Test Plan: Stress test and db_bench
Reviewed By: hx235
Differential Revision: D37163787
Pulled By: riversand963
fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8
Summary:
We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values.
Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154
Test Plan: `make -j24 check`
Reviewed By: pdillinger
Differential Revision: D37124451
Pulled By: guidotag
fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c
Summary:
See https://github.com/facebook/rocksdb/issues/10082 for more details. Trivial move
isn't done for universal when compaction is from L0 into L0. So a too small value for
num_levels with db_bench means there will be fewer trivial moves with universal and
that means that write-amp will increase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10158
Test Plan: run it
Reviewed By: siying
Differential Revision: D37122519
Pulled By: mdcallag
fbshipit-source-id: 1cb39049676f68a6cc3ea8d105a9965f89d4d09e
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.
It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.
This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.
In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.
This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```
If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.
Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();
// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
std::shared_ptr<TransactionNotifier> notifier,
TxnTimestamp ts,
std::shared_ptr<const Snapshot>* ret);
// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```
The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```
We also provide two additional APIs for stats collection and reporting purposes.
```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
TxnTimestamp ts_lb,
TxnTimestamp ts_ub,
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```
To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```
Before shutdown, RocksDB will release all timestamped snapshots.
Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.
Timestamped snapshots can also be used with user-defined timestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879
Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```
Reviewed By: siying
Differential Revision: D35783919
Pulled By: riversand963
fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
Summary:
A recent diff add a few more fields to one of the db_bench output lines that gets parsed.
This diff updates tools/benchmark.sh to handle that.
overwrite : 7.939 micros/op 125963 ops/sec; 50.5 MB/s
overwrite : 7.854 micros/op 127320 ops/sec 1800.001 seconds 229176999 operations; 51.0 MB/s
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10124
Test Plan: Run it
Reviewed By: jay-zhuang
Differential Revision: D36945137
Pulled By: mdcallag
fbshipit-source-id: 9c96f79491411da997e369a3be9c6b921a21d0fa
Summary:
This PR updates secondary instance testing in stress test by default.
A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary.
Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan
to read some random keys with only very basic verification to make sure no assertion failure is triggered.
Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled.
Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic
catch-up.
In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not
work well when the underlying file is renamed by primary, e.g. SstFileManager.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb make crash_test
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush
```
Reviewed By: ajkr
Differential Revision: D36939458
Pulled By: riversand963
fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e
Summary:
The patch attempts to fix three bugs in `verify_random_db.sh`:
1) https://github.com/facebook/rocksdb/pull/9937 changed the default for
`--try_load_options` to true in the script's use case, so we have to
explicitly set it to false if the corresponding argument of the script
is 0. This should fix the issue we've been seeing with our forward
compatibility tests where 7.3 is unable to open a database created by
the version on main after adding a new configuration option.
2) The script seems to support two "extra parameters"; however,
in practice, if the second one was set, only that one was passed on to
`ldb`. Now both get forwarded.
3) When running the `diff` command, the base DB directory was passed as
the second argument instead of the file containing the `ldb` output
(this actually seems to work, probably accidentally though).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10112
Reviewed By: pdillinger
Differential Revision: D36911363
Pulled By: ltamasi
fbshipit-source-id: fe29db4e28d373cee51a12322c59050fc50e926d
Summary:
db_bench can now run with FastLRUCache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10096
Test Plan:
- Temporarily add an ``assert(false)`` in the execution path that sets up the FastLRUCache. Run ``make -j24 db_bench``. Then test the appropriate code is used by running ``./db_bench -cache_type=fast_lru_cache`` and checking that the assert is called. Repeat for LRUCache.
- Verify that FastLRUCache (currently a clone of LRUCache) produces similar benchmark data than LRUCache, by comparing the outputs of ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=fast_lru_cache`` and ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=lru_cache``.
Reviewed By: gitbw95
Differential Revision: D36898774
Pulled By: guidotag
fbshipit-source-id: f9f6b6f6da124f88b21b3c8dee742fbb04eff773
Summary:
Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.
In order to achieve this, we would like to do the following:
- Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
- Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
- Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
- Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
- Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
- Ideally extend the C and Java bindings with the new option
- Update the BlobDB wiki to document the new option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077
Reviewed By: ltamasi
Differential Revision: D36884156
Pulled By: gangliao
fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
Summary:
Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
```
strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
```
There is one directory file descriptor that is not closed in the strace log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049
Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.
Reviewed By: ajkr, hx235
Differential Revision: D36722135
Pulled By: littlepig2013
fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
Summary:
Stress tests can run with the experimental FastLRUCache. Crash tests randomly choose between LRUCache and FastLRUCache.
Since only LRUCache supports a secondary cache, we validate the `--secondary_cache_uri` and `--cache_type` flags---when `--secondary_cache_uri` is set, the `--cache_type` is set to `lru_cache`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10081
Test Plan:
- To test that the FastLRUCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test_with_atomic_flush`. The cache type should sometimes be `fast_lru_cache`.
- To test the flag validation, run `make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --secondary_cache_uri=x" blackbox_crash_test_with_atomic_flush` multiple times. The test will always be aborted (which is okay). Check that the cache type is always `lru_cache`.
Reviewed By: anand1976
Differential Revision: D36839908
Pulled By: guidotag
fbshipit-source-id: ebcdfdcd12ec04c96c09ae5b9c9d1e613bdd1725
Summary:
Thanks to https://github.com/facebook/rocksdb/issues/9919 and https://github.com/facebook/rocksdb/issues/10051 the known bugs in file ingestion (besides mmap read + file checksum) are fixed. Now we can try again to enable file ingestion in crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9357
Test Plan: stress file ingestion heavily for an hour: `$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=1000000 --ingest_external_file_one_in=100 --duration=3600 --interval=20 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152`
Reviewed By: riversand963
Differential Revision: D33410746
Pulled By: ajkr
fbshipit-source-id: d276431390995a67f68390d61c06a40945fdd280
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10053
Need to exit if ldb command fails, to avoid running db_bench on
empty/bad DB and considering the results valid.
Reviewed By: jay-zhuang
Differential Revision: D36673200
fbshipit-source-id: e0d78a0d397e0e335d82d9349bfd612d38ffb552
Summary:
Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now).
The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973
Test Plan:
- `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter.
- Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart.
- Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb`
- Benchmark:
```
strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db
strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db
```
- db bench on backup and restore to ensure no performance regression.
- backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%)
- restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%)
```
# Set up
./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000
# benchmark
TEST_TMPDIR=/tmp/test_rocksdb
NUM_RUN=50
for ((j=0;j<$NUM_RUN;j++))
do
./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup'
# Restore
#./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db
done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt
```
Reviewed By: hx235
Differential Revision: D36327418
Pulled By: cbi42
fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691
Summary:
Essentially refactored the RangeMayExist implementation in
FullFilterBlockReader to FilterBlockReaderCommon so that it applies to
partitioned filters as well. (The function is not called for the
block-based filter case.) RangeMayExist is essentially a series of checks
around a possible PrefixMayExist, and I'm confident those checks should
be the same for partitioned as for full filters. (I think it's likely
that bugs remain in those checks, but this change is overall a simplifying
one.)
Added auto_prefix_mode support to db_bench
Other small fixes as well
Fixes https://github.com/facebook/rocksdb/issues/10003
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10012
Test Plan:
Expanded unit test that uses statistics to check for filter
optimization, fails without the production code changes here
Performance: populate two DBs with
```
TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters
```
Observe no measurable change in non-partitioned performance
```
TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
```
Before: seekrandom [AVG 15 runs] : 11798 (± 331) ops/sec
After: seekrandom [AVG 15 runs] : 11724 (± 315) ops/sec
Observe big improvement with partitioned (also supported by bloom use statistics)
```
TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
```
Before: seekrandom [AVG 12 runs] : 2942 (± 57) ops/sec
After: seekrandom [AVG 12 runs] : 7489 (± 184) ops/sec
Reviewed By: siying
Differential Revision: D36469796
Pulled By: pdillinger
fbshipit-source-id: bcf1e2a68d347b32adb2b27384f945434e7a266d
Summary:
Start tracking SST unique id in MANIFEST, which is used to verify with
SST properties to make sure the SST file is not overwritten or
misplaced. A DB option `try_verify_sst_unique_id` is introduced to
enable/disable the verification, if enabled, it opens all SST files
during DB-open to read the unique_id from table properties (default is
false), so it's recommended to use it with `max_open_files = -1` to
pre-open the files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990
Test Plan: unittests, format-compatible test, mini-crash
Reviewed By: anand1976
Differential Revision: D36381863
Pulled By: jay-zhuang
fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f
Summary:
**Context:**
Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`.
Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags.
**Summary:**
- Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that
- Added missing db bench/stress test for some memory charging features
- Renamed related test suite to indicate they are under the same theme of memory charging
- Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication
- Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926
Test Plan:
- New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)`
- New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)`
- CI
- db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
#-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465**
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078**
- db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal
Reviewed By: ajkr
Differential Revision: D36054712
Pulled By: hx235
fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af
Summary:
This PR
- since we are testing with disable_wal = true and best_efforts_recovery, we should set column family count to 1, due to the requirement of `ExpectedState` tracking and replaying logic.
- during backup and checkpoint restore, disable best-efforts-recovery. This does not matter now because db_crashtest.py always disables wal when testing best-efforts-recovery. In the future, if we enable wal, then not setting `restore_opitions.best_efforts_recovery` will cause backup db not to recover the WALs, and differ from db (that enables WAL).
- during verification of backup and checkpoint restore, print the key where inconsistency exists between expected state and db.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9986
Test Plan: TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery
Reviewed By: siying
Differential Revision: D36353105
Pulled By: riversand963
fbshipit-source-id: a484da161273e6216a1f7e245bac15a349693917
Summary:
https://github.com/facebook/rocksdb/issues/9961 broke format_compatible check because of `make clean`
referencing TEST_TMPDIR. The Makefile behavior seems reasonable to me,
so here's a fix in check_format_compatible.sh
Apparently I also included removing a redundant part of our CircleCI config.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9970
Test Plan: manual run: SHORT_TEST=1 ./tools/check_format_compatible.sh
Reviewed By: riversand963
Differential Revision: D36258172
Pulled By: pdillinger
fbshipit-source-id: d46507f04614e888b414ff23b88d040ae2b5c294
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
Test Plan: Watch CI tests
Reviewed By: riversand963
Differential Revision: D36176799
fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
Summary:
Previously all fault injection was ignored in release mode. This PR adds it back except for read fault injection (`--read_fault_one_in > 0`) since its dependency (`IGNORE_STATUS_IF_ERROR`) is unavailable in release mode.
Other notable changes include:
- Moved `EnableWriteErrorInjection()` for `--write_fault_one_in > 0` so it's after `DB::Open()` without depending on `SyncPoint`
- Made `--read_fault_one_in > 0` return an error in release mode
- Updated `db_crashtest.py` to always set `--read_fault_one_in=0` in release mode
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9957
Test Plan:
```
$ DEBUG_LEVEL=0 make -j24 db_stress
$ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox
```
Reviewed By: anand1976
Differential Revision: D36193830
Pulled By: ajkr
fbshipit-source-id: 0b97946b4e3f06e3e0f6e7833c2763da08ec5321
Summary:
`db_stress` already tracks expected state history to verify prefix-recoverability when `sync_fault_injection` is enabled. This PR enables `sync_fault_injection` in `db_crashtest.py`.
Previously enabling `sync_fault_injection` would cause whole unsynced files to be dropped. This PR adds a more interesting case of losing only the tail of unsynced data by implementing `TestFSWritableFile::RangeSync()` and enabling `{wal_,}bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9947
Test Plan:
- regular blackbox, blackbox --simple
- various commands to stress this new case, such as `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --write_buffer_size=2097152 --avoid_flush_during_recovery=1 --disable_wal=0 --interval=10 --db_write_buffer_size=0 --sync_fault_injection=1 --wal_compression=none --delpercent=0 --delrangepercent=0 --prefixpercent=0 --iterpercent=0 --writepercent=100 --readpercent=0 --wal_bytes_per_sync=131072 --duration=36000 --sync=0 --open_write_fault_one_in=16`
Reviewed By: riversand963
Differential Revision: D36152775
Pulled By: ajkr
fbshipit-source-id: 44b68a7fad0a4cf74af9fe1f39be01baab8141d8
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
Test Plan: See CI Runs.
Reviewed By: riversand963
Differential Revision: D36173954
fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
Summary:
This is inspired by debugging a regression test that runs for ~0.05 seconds and the short
running time makes it prone to variance. While db_bench ran for ~60 seconds, 59.95 seconds
was spent opening 128 databases (and doing recovery). So it was harder to notice that the
benchmark only ran for 0.05 seconds.
Normally I add output to the end of the line to make life easier for existing tools that parse it
but in this case the output near the end of the line has two optional parts and one of the optional
parts adds an extra newline.
This is for https://github.com/facebook/rocksdb/issues/9856
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9886
Test Plan:
./db_bench --benchmarks=overwrite,readrandom --num=1000000 --threads=4
old output:
DB path: [/tmp/rocksdbtest-2260/dbbench]
overwrite : 14.108 micros/op 283338 ops/sec; 31.3 MB/s
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 7.994 micros/op 496788 ops/sec; 55.0 MB/s (1000000 of 1000000 found)
new output:
DB path: [/tmp/rocksdbtest-2260/dbbench]
overwrite : 14.117 micros/op 282862 ops/sec 14.141 seconds 4000000 operations; 31.3 MB/s
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 8.649 micros/op 458475 ops/sec 8.725 seconds 4000000 operations; 49.8 MB/s (981548 of 1000000 found)
Reviewed By: ajkr
Differential Revision: D36102269
Pulled By: mdcallag
fbshipit-source-id: 5cd8a9e11f5cbe2a46809571afd83335b6b0caa0
Summary:
If the DB path is specified, the user would expect ldb loads the
options from the path, but it's not:
```
$ ldb list_live_files_metadata --db=`pwd`
```
Default `try_load_options` to true in that case. The user can still
disable that by:
```
$ ldb list_live_files_metadata --db=`pwd` --try_load_options=false
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9937
Test Plan:
`ldb list_live_files_metadata --db=`pwd`` is able to work for
a db generated with different options.num_levels.
Reviewed By: ajkr
Differential Revision: D36106708
Pulled By: jay-zhuang
fbshipit-source-id: 2732fdc027a4d172436b2c9b6a9787b56b10c710
Summary:
When compaction filter determines that a key should be removed, it updates the internal key's type
to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction
together with `SingleDelete`, it will cause compaction iterator to return Corruption.
To fix the issue, compaction filter should return more information in addition to the intention of removing
a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing
`kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`.
In addition, I updated db_stress_shared_state.[cc|h] so that `no_overwrite_ids_` becomes `const`. It is easier to
reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()`
when calling from `Open()` so that we can set up the rollback deletion type callback for transactions.
Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does
not quite work with `MultiOpsTxnsStressTest`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929
Test Plan:
make check
make crash_test
make crash_test_with_txn
Reviewed By: anand1976
Differential Revision: D36069678
Pulled By: riversand963
fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92
Summary:
Adds more coverage to `MultiOpsTxnsStressTest` with a focus on write-prepared transactions.
1. Add a hack to manually evict commit cache entries. We currently cannot assign small values to `wp_commit_cache_bits` because it requires a prepared transaction to commit within a certain range of sequence numbers, otherwise it will throw.
2. Add coverage for commit-time-write-batch. If write policy is write-prepared, we need to set `use_only_the_last_commit_time_batch_for_recovery` to true.
3. After each flush/compaction, verify data consistency. This is possible since data size can be small: default numbers of primary/secondary keys are just 1000.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9829
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb_crashtest_blackbox/ make blackbox_crash_test_with_multiops_wp_txn
```
Reviewed By: pdillinger
Differential Revision: D35806678
Pulled By: riversand963
fbshipit-source-id: d7fde7a29fda0fb481a61f553e0ca0c47da93616
Summary:
This patch completes the second part of the task: "Add blob support to the dump and dump_live_files command"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9896
Reviewed By: ltamasi
Differential Revision: D35852667
Pulled By: jowlyzhang
fbshipit-source-id: a006456c881f468a92da689e895134762e9574e1
Summary:
This patch is the first part of adding blob dump support. It only adds blob dump support to the dump command. A follow up patch will add blob dump support to the dump_live_files command.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9881
Reviewed By: ltamasi
Differential Revision: D35796731
Pulled By: jowlyzhang
fbshipit-source-id: 2cc5973b222d505a331ac7b969edcf992b47c5ee
Summary:
This patch completes the first part of the task: "Extend all three commands so they can decode and print blob references if a new option --decode_blob_index is specified"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9870
Reviewed By: ltamasi
Differential Revision: D35753932
Pulled By: jowlyzhang
fbshipit-source-id: 9d2bbba0eef2ed86b982767eba9de1b4881f35c9
Summary:
`InitializeOptionsGeneral()` was overwriting many options that were already configured by OPTIONS file, potentially with the flag default values. This PR changes that function to only overwrite options in limited scenarios, as described at the top of its definition. Block cache is still a violation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9862
Test Plan: ran under various scenarios (multi-DB, single DB, OPTIONS file, flags) and verified options are set as expected
Reviewed By: jay-zhuang
Differential Revision: D35736960
Pulled By: ajkr
fbshipit-source-id: 75b77740af37e6f5741618f8a8f5685df2417d03
Summary:
Since they operate at distinct abstraction layers, I thought it
was prudent to combine with EncryptedEnv CI test for each PR, for efficiency
in testing. Also added supported compressions to sst_dump --help output
so that CI job can verify no compiled-in compression support.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9850
Test Plan: CI, some manual stuff
Reviewed By: riversand963
Differential Revision: D35682346
Pulled By: pdillinger
fbshipit-source-id: be9879c1533fed304ee32c89fd9ba4b07c2b90cc
Summary:
This change only add decode blob index support to dump_live_files command, which is part of a task to add blob support to a few commands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9842
Reviewed By: ltamasi
Differential Revision: D35650167
Pulled By: jowlyzhang
fbshipit-source-id: a78151b98bc38ac6f52c6e01ca6927a3429ddd14
Summary:
Various renaming and fixes to get rid of remaining uses of
"backupable" which is terminology leftover from the original, flawed
design of BackupableDB. Now any DB can be backed up, using BackupEngine.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9792
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D35334386
Pulled By: pdillinger
fbshipit-source-id: 2108a42b4575c8cccdfd791c549aae93ec2f3329
Summary:
There's an existing benchmark, "getmergeoperands", but it is unconventional in that it has multiple phases and hardcoded setup parameters.
This PR adds a different one, "readrandomoperands", that follows the pattern of other benchmarks of having a single phase and taking its configuration from existing flags.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9785
Test Plan:
```
$ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -write_buffer_size=1048576 -max_bytes_for_level_base=4194304 -target_file_size_base=1048576 -compression_type=none -disable_auto_compactions=true
$ ./db_bench -use_existing_db=true -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -disable_auto_compactions=true -duration=10
...
readrandomoperands : 542.082 micros/op 1844 ops/sec; 0.2 MB/s (11980 of 18999 found)
```
Reviewed By: jay-zhuang
Differential Revision: D35290412
Pulled By: ajkr
fbshipit-source-id: fb367ca614b128cef844a75f0e5d9dd7c3328d85
Summary:
Fixes a bug introduced by me in https://github.com/facebook/rocksdb/pull/9733
That PR added a counter so that the per-thread seeds in ThreadState would
be unique even when --benchmarks had more than one test. But it incorrectly
used this counter as the value for ThreadState::tid as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9757
Test Plan:
Confirm that unexpectedly good QPS results on the regression tests return
to normal with this fix. I have confirmed that the QPS increase starts with
the PR 9733 diff.
Reviewed By: jay-zhuang
Differential Revision: D35149303
Pulled By: mdcallag
fbshipit-source-id: dee5cc36b7faaba6c3be6d6a253d3c2eaad72864
Summary:
This is for https://github.com/facebook/rocksdb/issues/9737
I have wasted more than a few hours running db_bench benchmarks where --seed was not set
and getting better than expected results because cache hit rates are great because
multiple invocations of db_bench used the same value for --seed or did not set it,
and then all used 0. The result is that all see the same sequence of keys.
Others have done the same. The problem is worse in that it is easy to miss and the result is a benchmark with results that are misleading.
A good way to avoid this is to set it to the equivalent of gettimeofday() when either
--seed is not set or it is set to 0 (the default).
With this change the actual seed is printed when it was 0 at process start:
Set seed to 1647992570365606 because --seed was 0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9740
Test Plan:
Perf results:
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000
readrandom : 6.469 micros/op 154583 ops/sec; 17.1 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=0
readrandom : 6.565 micros/op 152321 ops/sec; 16.9 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=1
readrandom : 6.461 micros/op 154777 ops/sec; 17.1 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=2
readrandom : 6.525 micros/op 153244 ops/sec; 17.0 MB/s (4000000 of 4000000 found)
Reviewed By: jay-zhuang
Differential Revision: D35145361
Pulled By: mdcallag
fbshipit-source-id: 2b35b153ccec46b27d7c9405997523555fc51267
Summary:
This adds the --slow_usecs option with a default value of 1M. Operations that
take this much time have a message printed when --histogram=1, --stats_interval=0
and --stats_interval_seconds=0. The current code hardwired this to 20,000 usecs
and for some stress tests that reduced throughput by 20% or more.
This is for https://github.com/facebook/rocksdb/issues/9620
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9732
Test Plan:
./db_bench --benchmarks=fillrandom,readrandom --compression_type=lz4 --slow_usecs=100 --histogram=1
./db_bench --benchmarks=fillrandom,readrandom --compression_type=lz4 --slow_usecs=100000 --histogram=1
Reviewed By: jay-zhuang
Differential Revision: D35121522
Pulled By: mdcallag
fbshipit-source-id: daf27f937efd748980545d6395db332712fc078b
Summary:
db_bench quietly parses and ignores bad values for --compaction_fadvice and --value_size_distribution_type
I prefer that it fail for them as it does for bad option values in most other cases. Otherwise a benchmark
result will be provided for the wrong configuration and the result will be misleading.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9741
Test Plan:
These now fail:
./db_bench --compaction_fadvice=noney
Unknown compaction fadvice:noney
./db_bench --value_size_distribution_type=norma
Cannot parse distribution type 'norma'
While correct values continue to work:
./db_bench --value_size_distribution_type=normal
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
./db_bench --compaction_fadvice=none
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Reviewed By: siying
Differential Revision: D35115973
Pulled By: mdcallag
fbshipit-source-id: c2b10de5c2d1ea7c7539e676f5bd556351f5d370
Summary:
When --benchmarks has more than one test then the threads in one benchmark
will use the same set of seeds as the threads in the previous benchmark.
This diff fixe that.
This fixes https://github.com/facebook/rocksdb/issues/9632
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9733
Test Plan:
For this command line the block cache is 8GB, so it caches at most 1024 8KB blocks. Note that without
this diff the second run of readrandom has a much better response time because seed reuse means the
second run reads the same 1000 blocks as the first run and they are cached at that point. But with
this diff that does not happen.
./db_bench --benchmarks=fillseq,flush,compact0,waitforcompaction,levelstats,readrandom,readrandom --compression_type=zlib --num=10000000 --reads=1000 --block_size=8192
...
```
Level Files Size(MB)
--------------------
0 0 0
1 11 238
2 9 253
3 0 0
4 0 0
5 0 0
6 0 0
```
--- perf results without this diff
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 46.212 micros/op 21618 ops/sec; 2.4 MB/s (1000 of 1000 found)
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 21.963 micros/op 45450 ops/sec; 5.0 MB/s (1000 of 1000 found)
--- perf results with this diff
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 47.213 micros/op 21126 ops/sec; 2.3 MB/s (1000 of 1000 found)
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 42.880 micros/op 23299 ops/sec; 2.6 MB/s (1000 of 1000 found)
Reviewed By: jay-zhuang
Differential Revision: D35089763
Pulled By: mdcallag
fbshipit-source-id: 1b50143a07afe876b8c8e5fa50dd94a8ce57fc6b
Summary:
This changes db_bench to fail at startup for invalid compression types. It had been
changing them to Snappy. For other invalid options it fails at startup.
This is for https://github.com/facebook/rocksdb/issues/9621
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9729
Test Plan:
This continues to work:
./db_bench --benchmarks=fillrandom --compression_type=lz4
This now fails rather than changing the compression type to Snappy
./db_bench --benchmarks=fillrandom --compression_type=lz44
Cannot parse compression type 'lz44'
Reviewed By: jay-zhuang
Differential Revision: D35081323
Pulled By: mdcallag
fbshipit-source-id: 9b38c835abddce11aa7feb235df63f53cf829981
Summary:
…in order
This fixes https://github.com/facebook/rocksdb/issues/9650
For db_bench --benchmarks=fillseq --num_multi_db=X it loads databases in sequence
rather than randomly choosing a database per Put. The benefits are:
1) avoids long delays between flushing memtables
2) avoids flushing memtables for all of them at the same point in time
3) puts same number of keys per database so that query tests will find keys as expected
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9713
Test Plan:
Using db_bench.1 without the change and db_bench.2 with the change:
for i in 1 2; do rm -rf /data/m/rx/* ; time ./db_bench.$i --db=/data/m/rx --benchmarks=fillseq --num_multi_db=4 --num=10000000; du -hs /data/m/rx ; done
--- without the change
fillseq : 3.188 micros/op 313682 ops/sec; 34.7 MB/s
real 2m7.787s
user 1m52.776s
sys 0m46.549s
2.7G /data/m/rx
--- with the change
fillseq : 3.149 micros/op 317563 ops/sec; 35.1 MB/s
real 2m6.196s
user 1m51.482s
sys 0m46.003s
2.7G /data/m/rx
Also, temporarily added a printf to confirm that the code switches to the next database at the right time
ZZ switch to db 1 at 10000000
ZZ switch to db 2 at 20000000
ZZ switch to db 3 at 30000000
for i in 1 2; do rm -rf /data/m/rx/* ; time ./db_bench.$i --db=/data/m/rx --benchmarks=fillseq,readrandom --num_multi_db=4 --num=100000; du -hs /data/m/rx ; done
--- without the change, smaller database, note that not all keys are found by readrandom because databases have < and > --num keys
fillseq : 3.176 micros/op 314805 ops/sec; 34.8 MB/s
readrandom : 1.913 micros/op 522616 ops/sec; 57.7 MB/s (99873 of 100000 found)
--- with the change, smaller database, note that all keys are found by readrandom
fillseq : 3.110 micros/op 321566 ops/sec; 35.6 MB/s
readrandom : 1.714 micros/op 583257 ops/sec; 64.5 MB/s (100000 of 100000 found)
Reviewed By: jay-zhuang
Differential Revision: D35030168
Pulled By: mdcallag
fbshipit-source-id: 2a18c4ec571d954cf5a57b00a11802a3608823ee
Summary:
Changes:
* improves monitoring by displaying average size of a Put value and average scan length
* forces the minimum value size to be 10. Before this it was 0 if you didn't set the distribution parameters.
* uses reasonable defaults for the distribution parameters that determine value size and scan length
* includes seeks in "reads ... found" message, before this they were missing
This is for https://github.com/facebook/rocksdb/issues/9672
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9711
Test Plan:
Before this change:
./db_bench --benchmarks=fillseq,mixgraph --mix_get_ratio=50 --mix_put_ratio=25 --mix_seek_ratio=25 --num=100000 --value_k=0.2615 --value_sigma=25.45 --iter_k=2.517 --iter_sigma=14.236
fillseq : 4.289 micros/op 233138 ops/sec; 25.8 MB/s
mixgraph : 18.461 micros/op 54166 ops/sec; 755.0 MB/s ( Gets:50164 Puts:24919 Seek:24917 of 50164 in 75081 found)
After this change:
./db_bench --benchmarks=fillseq,mixgraph --mix_get_ratio=50 --mix_put_ratio=25 --mix_seek_ratio=25 --num=100000 --value_k=0.2615 --value_sigma=25.45 --iter_k=2.517 --iter_sigma=14.236
fillseq : 3.974 micros/op 251553 ops/sec; 27.8 MB/s
mixgraph : 16.722 micros/op 59795 ops/sec; 833.5 MB/s ( Gets:50164 Puts:24919 Seek:24917, reads 75081 in 75081 found, avg size: 36.0 value, 504.9 scan)
Reviewed By: jay-zhuang
Differential Revision: D35030190
Pulled By: mdcallag
fbshipit-source-id: d8f555f28d869f752ddb674a524108884511b151
Summary:
The goal of this change is to allow changes to the "current" (in
FileSystem) file temperatures to feed back into DB metadata, so that
they can inform decisions and stats reporting. In part because of
modular code factoring, it doesn't seem easy to do this automagically,
where opening an SST file and observing current Temperature different
from expected would trigger a change in metadata and DB manifest write
(essentially giving the deep read path access to the write path). It is also
difficult to do this while the DB is open because of the limitations of
LogAndApply.
This change allows updating file temperature metadata on a closed DB
using an experimental utility function UpdateManifestForFilesState()
or `ldb update_manifest --update_temperatures`. This should suffice for
"migration" scenarios where outside tooling has placed or re-arranged DB
files into a (different) tiered configuration without going through
RocksDB itself (currently, only compaction can change temperature
metadata).
Some details:
* Refactored and added unit test for `ldb unsafe_remove_sst_file` because
of shared functionality
* Pulled in autovector.h changes from https://github.com/facebook/rocksdb/issues/9546 to fix SuperVersionContext
move constructor (related to an older draft of this change)
Possible follow-up work:
* Support updating manifest with file checksums, such as when a
new checksum function is used and want existing DB metadata updated
for it.
* It's possible that for some repair scenarios, lighter weight than
full repair, we might want to support UpdateManifestForFilesState() to
modify critical file details like size or checksum using same
algorithm. But let's make sure these are differentiated from modifying
file details in ways that don't suspect corruption (or require extreme
trust).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9683
Test Plan: unit tests added
Reviewed By: jay-zhuang
Differential Revision: D34798828
Pulled By: pdillinger
fbshipit-source-id: cfd83e8fb10761d8c9e7f9c020d68c9106a95554
Summary:
The primary goal of this change is to add support for backing up and
restoring (applying on restore) file temperature metadata, without
committing to either the DB manifest or the FS reported "current"
temperatures being exclusive "source of truth".
To achieve this goal, we need to add temperature information to backup
metadata, which requires updated backup meta schema. Fortunately I
prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
6.19.0 for this kind of schema update. (Previously, backup meta schema
was not extensible! Making this schema update public will allow some
other "nice to have" features like taking backups with hard links, and
avoiding crc32c checksum computation when another checksum is already
available.) While schema version 2 is newly public, the default schema
version is still 1. Until we change the default, users will need to set
to 2 to enable features like temperature data backup+restore. New
metadata like temperature information will be ignored with a warning
in versions before this change and since 6.19.0. The metadata is
considered ignorable because a functioning DB can be restored without
it.
Some detail:
* Some renaming because "future schema" is now just public schema 2.
* Initialize some atomics in TestFs (linter reported)
* Add temperature hint support to SstFileDumper (used by BackupEngine)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660
Test Plan:
related unit test majorly updated for the new functionality,
including some shared testing support for tracking temperatures in a FS.
Some other tests and testing hooks into production code also updated for
making the backup meta schema change public.
Reviewed By: ajkr
Differential Revision: D34686968
Pulled By: pdillinger
fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
Summary:
BlockBasedTableOptions.hash_index_allow_collision is already deprecated and has no effect. Delete it for preparing 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9454
Test Plan: Run all existing tests.
Reviewed By: ajkr
Differential Revision: D33805827
fbshipit-source-id: ed8a436d1d083173ec6aef2a762ba02e1eefdc9d
Summary:
**Summary:**
RocksDB uses a block cache to reduce IO and make queries more efficient. The block cache is based on the LRU algorithm (LRUCache) and keeps objects containing uncompressed data, such as Block, ParsedFullFilterBlock etc. It allows the user to configure a second level cache (rocksdb::SecondaryCache) to extend the primary block cache by holding items evicted from it. Some of the major RocksDB users, like MyRocks, use direct IO and would like to use a primary block cache for uncompressed data and a secondary cache for compressed data. The latter allows us to mitigate the loss of the Linux page cache due to direct IO.
This PR includes a concrete implementation of rocksdb::SecondaryCache that integrates with compression libraries such as LZ4 and implements an LRU cache to hold compressed blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9518
Test Plan:
In this PR, the lru_secondary_cache_test.cc includes the following tests:
1. The unit tests for the secondary cache with either compression or no compression, such as basic tests, fails tests.
2. The integration tests with both primary cache and this secondary cache .
**Follow Up:**
1. Statistics (e.g. compression ratio) will be added in another PR.
2. Once this implementation is ready, I will do some shadow testing and benchmarking with UDB to measure the impact.
Reviewed By: anand1976
Differential Revision: D34430930
Pulled By: gitbw95
fbshipit-source-id: 218d78b672a2f914856d8a90ff32f2f5b5043ded
Summary:
When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs.
Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556
Reviewed By: anand1976
Differential Revision: D34308216
Pulled By: sidroyc
fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.
`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.
There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).
The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424
Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
- setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
- benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`
Reviewed By: hx235
Differential Revision: D33747386
Pulled By: ajkr
fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
Summary:
Opening DB as seconeary instance has been supported in ldb but it is not mentioned in --help. Mention it there. The part of the help message after the modification:
```
commands MUST specify --db=<full_path_to_db_directory> when necessary
commands can optionally specify
--env_uri=<uri_of_environment> or --fs_uri=<uri_of_filesystem> if necessary
--secondary_path=<secondary_path> to open DB as secondary instance. Operations not supported in secondary instance will fail.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9582
Test Plan: Build and run ldb --help
Reviewed By: riversand963
Differential Revision: D34286427
fbshipit-source-id: e56c5290d0548098ab6acc6dde2167f5a64f34f3
Summary:
I did another pass through running CI jobs. It is uncommon now to see
`db_stress` stuck in the setup phase but still happen.
One reason was repeatedly reading/verifying checksum on filter blocks when
`-cache_index_and_filter_blocks=1` and `-cache_size=1048576`. To address
that I increased the cache size.
Another reason was having a WAL with many range tombstones and every
`db_stress` run using `-avoid_flush_during_recovery=1` (in that
scenario, the setup phase spent too much CPU in
`rocksdb::MemTable::NewRangeTombstoneIteratorInternal()`). To address
that I fixed the `-avoid_flush_during_recovery` setting so it is
reevaluated for every `db_stress` run.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9483
Reviewed By: riversand963
Differential Revision: D33922929
Pulled By: ajkr
fbshipit-source-id: 0a298ec7c4df6f6b44620233996047a2dc7ee5f3
Summary:
This change removes the ability to configure the deprecated,
inefficient block-based filter in the public API. Options that would
have enabled it now use "full" (and optionally partitioned) filters.
Existing block-based filters can still be read and used, and a "back
door" way to build them still exists, for testing and in case of trouble.
About the only way this removal would cause an issue for users is if
temporary memory for filter construction greatly increases. In
HISTORY.md we suggest a few possible mitigations: partitioned filters,
smaller SST files, or setting reserve_table_builder_memory=true.
Or users who have customized a FilterPolicy using the
CreateFilter/KeyMayMatch mechanism removed in https://github.com/facebook/rocksdb/issues/9501 will have to upgrade
their code. (It's long past time for people to move to the new
builder/reader customization interface.)
This change also introduces some internal-use-only configuration strings
for testing specific filter implementations while bypassing some
compatibility / intelligence logic. This is intended to hint at a path
toward making FilterPolicy Customizable, but it also gives us a "back
door" way to configure block-based filter.
Aside: updated db_bench so that -readonly implies -use_existing_db
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9535
Test Plan:
Unit tests updated. Specifically,
* BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back
door configuration interface and ignoring of `use_block_based_builder`.
* BlockBasedTableTest.TracingGetTest is migrated from testing
block-based filter access pattern to full filter access patter, by
re-ordering some things.
* Options test (pretty self-explanatory)
Performance test - create with `./db_bench -db=/dev/shm/rocksdb1 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0` with and without `-use_block_based_filter`, which creates a DB with 21 SST files in L0. Read with `./db_bench -db=/dev/shm/rocksdb1 -readonly -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=30`
Without -use_block_based_filter: readrandom 464 ops/sec, 689280 KB DB
With -use_block_based_filter: readrandom 169 ops/sec, 690996 KB DB
No consistent difference with fillrandom
Reviewed By: jay-zhuang
Differential Revision: D34153871
Pulled By: pdillinger
fbshipit-source-id: 31f4a933c542f8f09aca47fa64aec67832a69738
Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33788508
Pulled By: akankshamahajan15
fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
Summary:
After https://github.com/facebook/rocksdb/issues/9481, we are using newer default compiler for
build-format-compatible CircleCI nightly job, which fails on building
2.2.fb.branch branch because it tries to use a pre-compiled libsnappy.a
that is checked into the repo (!). This works around that by setting
SNAPPY_LDFLAGS=-lsnappy, which is only understood by such old versions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9517
Test Plan:
Run check_format_compatible.sh on Ubuntu 20 AWS machine,
watch nightly run
Reviewed By: hx235
Differential Revision: D34055561
Pulled By: pdillinger
fbshipit-source-id: 45f9d428dd082f026773bfa8d9dd4dad66fc9378
Summary:
The patch does some cleanup in and around `VersionStorageInfo`:
* Renames the method `PrepareApply` to `PrepareAppend` in `Version`
to make it clear that it is to be called before appending the `Version` to
`VersionSet` (via `AppendVersion`), not before applying any `VersionEdit`s.
* Introduces a helper method `VersionStorageInfo::PrepareForVersionAppend`
(called by `Version::PrepareAppend`) that encapsulates the population of the
various derived data structures in `VersionStorageInfo`, and turns the
methods computing the derived structures (`UpdateNumNonEmptyLevels`,
`CalculateBaseBytes` etc.) into private helpers.
* Changes `Version::PrepareAppend` so it only calls `UpdateAccumulatedStats`
if the `update_stats` flag is set. (Earlier, this was checked by the callee.)
Related to this, it also moves the call to `ComputeCompensatedSizes` to
`VersionStorageInfo::PrepareForVersionAppend`.
* Updates and cleans up `version_builder_test`, `version_set_test`, and
`compaction_picker_test` so `PrepareForVersionAppend` is called anytime
a new `VersionStorageInfo` is set up or saved. This cleanup also involves
splitting `VersionStorageInfoTest.MaxBytesForLevelDynamic`
into multiple smaller test cases.
* Fixes up a bunch of comments that were outdated or just plain incorrect.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9494
Test Plan: Ran `make check` and the crash test script for a while.
Reviewed By: riversand963
Differential Revision: D33971666
Pulled By: ltamasi
fbshipit-source-id: fda52faac7783041126e4f8dec0fe01bdcadf65a
Summary:
Added a CountedFileSystem that tracks a number of file operations (opens, closes, deletes, renames, flushes, syncs, fsyncs, reads, writes). This class was based on the ReportFileOpEnv from db_bench.
This is a stepping stone PR to be able to change the SpecialEnv into a SpecialFileSystem, where several of the file varieties wish to do operation counting.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9283
Reviewed By: pdillinger
Differential Revision: D33062004
Pulled By: mrambacher
fbshipit-source-id: d0d297a7fb9c48c06cbf685e5fa755c27193b6f5
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Summary:
Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum
This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.
Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.
**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption
- See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
- Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342
Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
- For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
- FastLocalBloom
- Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
- After change:
- `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
- Standard128Ribbon
- Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
- After change:
- `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.
Reviewed By: pdillinger
Differential Revision: D33746928
Pulled By: hx235
fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
Summary:
Apparently setting total_order_seek=true for DB::Get was
intended to allow accurate read semantics if the current prefix
extractor doesn't match what was used to generate SST files on
disk. But since prefix_extractor was made a mutable option in 5.14.0, we
have been able to detect this case and provide the correct semantics
regardless of the total_order_seek option. Since that time, the option
has only made Get() slower in a reasonably common case: prefix_extractor
unchanged and whole_key_filtering=false.
So this change primarily removes unnecessary effect of
total_order_seek on Get. Also cleans up some related comments.
Also adds a -total_order_seek option to db_bench and canonicalizes
handling of ReadOptions in db_bench so that command line options have
the expected association with library features. (There is potential
for change in regression test behavior, but the old behavior is likely
indefensible, or some other inconsistency would need to be fixed.)
TODO in follow-up work: there should be no reason for Get() to depend on
current prefix extractor at all.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9427
Test Plan:
Unit tests updated.
Performance (using db_bench update)
Create DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12 -whole_key_filtering=0`
Test with and without `-total_order_seek` on `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=readrandom -num=10000000 -duration=40 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`
Before this change, total_order_seek=false: 25188 ops/sec
Before this change, total_order_seek=true: 1222 ops/sec (~20x slower)
After this change, total_order_seek=false: 24570 ops/sec
After this change, total_order_seek=true: 25012 ops/sec (indistinguishable)
Reviewed By: siying
Differential Revision: D33753458
Pulled By: pdillinger
fbshipit-source-id: bf892f34907a5e407d9c40bd4d42f0adbcbe0014
Summary:
Despite attempts to optimize `db_stress` setup phase (i.e.,
pre-`OperateDb()`) latency in https://github.com/facebook/rocksdb/issues/9470 and https://github.com/facebook/rocksdb/issues/9475, it still always took tens
of seconds. Since we still aren't able to setup a 100M key `db_stress`
quickly, we should reduce the number of keys. This PR reduces it 4x
while increasing `value_size_mult` 4x (from its default value of 8) so
that memtables and SST files fill at a similar rate compared to before this PR.
Also disabled bzip2 compression since we'll probably never use it and
I noticed many CI runs spending majority of CPU on bzip2 decompression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9476
Reviewed By: siying
Differential Revision: D33898520
Pulled By: ajkr
fbshipit-source-id: 855021784ad9664f2be5bce21f0339a1cf93230d
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds has been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9455
Test Plan: Rely on my eyeball and CI
Reviewed By: ajkr
Differential Revision: D33811664
Pulled By: hx235
fbshipit-source-id: 866859427fe710354a90f1095057f80116365ff0
Summary:
In RocksDB DBOptions::skip_log_error_on_recovery is marked as
"NOT SUPPORTED" for a long time, and setting this option does not have
any effect on the behavior of RocksDB library. Therefore, we are removing it
in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9434
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33763015
Pulled By: akankshamahajan15
fbshipit-source-id: 11f09643298da6c02d3dcdb090b996f4c3cfdd76
Summary:
Even after https://github.com/facebook/rocksdb/issues/9461 could see
```
Error: please specify prefix_size for test_batches_snapshots test!
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9463
Test Plan:
run `make blackbox_crashtest` for a long time. (Unfortunately,
it's taking a long time to reproduce these failures)
Reviewed By: akankshamahajan15
Differential Revision: D33838152
Pulled By: pdillinger
fbshipit-source-id: b9a73c5bbb68df53f14c22b9b52f61d1f7ef38af
Summary:
The API is deprecated long time ago. Clean up the codebase by
removing it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462
Test Plan: CI, fake release: D33835220
Reviewed By: riversand963
Differential Revision: D33835103
Pulled By: jay-zhuang
fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184
Summary:
Changes in https://github.com/facebook/rocksdb/issues/9453 could trigger
```
stderr:
Error: prefixpercent is non-zero while prefix_size is not positive!
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9461
Test Plan: run `make blackbox_crashtest` for a long time
Reviewed By: ajkr
Differential Revision: D33830751
Pulled By: pdillinger
fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476
Summary:
This also removes the obsolete names BackupableDBOptions
and UtilityDB. API users must now use BackupEngineOptions and
DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced
`rocksdb_backup_engine_*`. Similar renaming in Java API.
In reference to https://github.com/facebook/rocksdb/issues/9389
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D33780269
Pulled By: pdillinger
fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac
Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.
This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)
Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453
Test Plan:
new unit test added. Added memtable Bloom to crash test.
Performance testing
--------------------
Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```
Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)
Before & after run simultaneously, average over 10 iterations per data point, ops/sec.
MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066
MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990
MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before: 12081715
After: 12117603
MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591
Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.
Reviewed By: ajkr
Differential Revision: D33805025
Pulled By: pdillinger
fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452
Test Plan: Rely on my eyeball and CI
Reviewed By: ajkr
Differential Revision: D33804938
Pulled By: hx235
fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945
Summary:
Add an option to set the WAL compression algorithm - wal_compression.
TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432
Reviewed By: pdillinger
Differential Revision: D33797275
Pulled By: sidroyc
fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7
Summary:
This PR is one proposal to resolve https://github.com/facebook/rocksdb/issues/9382.
Looking at the code, I can't think of a reason why rdb is an internal component of RocksDB: it does not require
any header files NOT in `include/rocksdb`. It's a better idea to host it somewhere else.
Plus, rdb requires python2 which is not supported any more. No fixes or improvements will be made, even for potential
security bugs (https://www.python.org/doc/sunset-python-2/).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9399
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D33641965
Pulled By: riversand963
fbshipit-source-id: 2a6a74693e5de36834f355e41d6865db206af48b
Summary:
This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point,
people can start from the example repo and fork.
Java/JNI is not included yet, and needs to be done later if necessary.
The goal is to include this commit in RocksDB 7.0 release.
Reference:
https://github.com/ajkr/dedupfs by ajkr
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170
Test Plan:
Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress.
make check
Reviewed By: ajkr
Differential Revision: D33751662
Pulled By: riversand963
fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f
Summary:
In response to https://github.com/facebook/rocksdb/issues/9354, this PR adds a way for users to "opt out"
of extra checks that can impact peak write performance, which
currently only includes force_consistency_checks. I considered including
some other options but did not see a db_bench performance difference.
Also clarify in comment for force_consistency_checks that it can "slow
down saturated writing."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9363
Test Plan:
basic coverage in unit tests
Using my perf test in https://github.com/facebook/rocksdb/issues/9354 comment, I see
force_consistency_checks=true -> 725360 ops/s
force_consistency_checks=false -> 783072 ops/s
Reviewed By: mrambacher
Differential Revision: D33636559
Pulled By: pdillinger
fbshipit-source-id: 25bfd006f4844675e7669b342817dd4c6a641e84
Summary:
As title.
This is part of an fb-internal task.
First, remove all `using namespace` statements if applicable.
Next, utilize multiple build platforms and see if anything is broken.
Should anything become broken, fix the compilation errors with as little extra change as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369
Test Plan:
internal build and make check
make clean && make static_lib && cd examples && make all
Reviewed By: pdillinger
Differential Revision: D33517260
Pulled By: riversand963
fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40
Summary:
Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (https://github.com/facebook/rocksdb/issues/8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled.
To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py.
Depends on https://github.com/facebook/rocksdb/issues/9305.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9338
Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan.
Reviewed By: riversand963
Differential Revision: D33345333
Pulled By: ajkr
fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293
Reviewed By: pdillinger, zhichao-cao
Differential Revision: D33181591
Pulled By: mrambacher
fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
Summary:
Several improvements to SimulatedHybridFileSystem:
(1) Allow a mode where all I/Os to all files simulate HDD. This can be enabled in db_bench using -simulate_hdd
(2) Latency calculation is slightly more accurate
(3) Allow to simulate more than one HDD spindles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9301
Test Plan: Run db_bench and observe the results are reasonable.
Reviewed By: jay-zhuang
Differential Revision: D33141662
fbshipit-source-id: b736e58c4ba910d06899cc9ccec79b628275f4fa
Summary:
db_crashtest.py uses multiple CFs only when run without flag `--simple`.
The previous config set `-test_batches_snapshots=1` in that case for
blackbox mode. But `-test_batches_snapshots=1` cannot verify recovery
correctness, so it should not always be set for multi-CF blackbox tests.
We can instead randomly toggle it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9303
Reviewed By: riversand963
Differential Revision: D33155229
Pulled By: ajkr
fbshipit-source-id: 4a6fdc4eddccc8ece664063baf6393ce1c5de6b7
Summary:
SimulatedHybridFileSystem now takes a more thorough simualtion of an HDD:
1. cover writes too, not just read
2. Latency and throughput is now simulated as seek + read time, using a rate limiter
This implementation can be modified to simulate full HDD behavior, which is not yet done.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9259
Test Plan: Run db_bench and observe the desired behavior.
Reviewed By: jay-zhuang
Differential Revision: D32903039
fbshipit-source-id: a83f5d72143e114d5e75edf39d647bf0b71978e1
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.
Reviewed By: ltamasi
Differential Revision: D31721350
fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
Summary:
When table_options.prepopulate_block_cache is set to
BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and
table_options.partition_filters is also set true, then there is
segmentation failure when top level filter is fetched because its
entered with wrong type in cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263
Test Plan:
Updated unit tests;
Ran db_stress: make crash_test -j32
Reviewed By: pdillinger
Differential Revision: D32936566
Pulled By: akankshamahajan15
fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426
Summary:
Fix a bug that causes file temperature not preserved after DB is restarted, or options.max_manifest_file_size is hit.
Also, pass temperature information to NewRandomAccessFile() to allow users to hack a solution where they don't preserve tiering information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9242
Test Plan: Add a unit test that would fail without the fix.
Reviewed By: jay-zhuang
Differential Revision: D32818150
fbshipit-source-id: 36aa3f148c60107f7b8e9d65b63b039f9e1a1eec
Summary:
Disable the QPS verification in test temporally, which causes the test failure due to different system delays.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9190
Test Plan: make check
Reviewed By: siying
Differential Revision: D32576289
Pulled By: zhichao-cao
fbshipit-source-id: 1df972e77dd82eed5af3462e5db5e141aadf8fae
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
Test Plan: Ran `make check` and the stress/crash test.
Reviewed By: riversand963
Differential Revision: D32565512
Pulled By: ltamasi
fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
Summary:
Fix the analyzer test failure caused by inaccurate timing wait. The wait time at different system might be different or cause the delay, now we do not accurately count the lines. Only in a very rare extreme case, test will ignore the part exceed the timing of 1 second.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9181
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D32511319
Pulled By: zhichao-cao
fbshipit-source-id: e694c8cb465c750cfa5a43dab3eff6707b9a11c8
Summary:
RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB.
This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block.
1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer), readahead_size will decrease by 8KB.
2. It maintains readahead_size for L1 - Ln levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056
Test Plan:
Added new unit tests
Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled.
Reviewed By: anand1976
Differential Revision: D31773640
Pulled By: akankshamahajan15
fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98
Summary:
Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087
Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc.
Reviewed By: riversand963
Differential Revision: D32007222
Pulled By: anand1976
fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792
Summary:
It is useful to add options.manual_wal_flush to db_bench
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9132
Test Plan: Run the benchamrk with the option.
Reviewed By: ltamasi
Differential Revision: D32188060
fbshipit-source-id: a70835d3cad0f30095218dfda1daff0a432892e5
Summary:
Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles.
When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed.
In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749
Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end.
Reviewed By: jay-zhuang
Differential Revision: D30735261
fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.
Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument
This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.
DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.
Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.
### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
As you can see, especially once warmed up, xxh3 is fastest.
### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
Test
for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
Results (ops/sec)
for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3
Reviewed By: mrambacher
Differential Revision: D31905249
Pulled By: pdillinger
fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
Summary:
This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty.
In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control.
Two set of write benchmarks are run:
1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163
2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655
Test Plan: Add unit tests to the functionality.
Reviewed By: ajkr
Differential Revision: D31787034
fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
Summary:
https://github.com/facebook/rocksdb/issues/8031 broke internal tests. This should fix but also preserve
the intended capability of getting git commit id when hg not used
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9047
Test Plan: already broken ¯\\_(ツ)_/¯
Reviewed By: zhichao-cao
Differential Revision: D31732198
Pulled By: pdillinger
fbshipit-source-id: 7dba8531ddca55a6de5e04978a1a1601aae4cee9
Summary:
Set the perf_level in ```tools/regression_test.sh``` in order to exercise ```PerfContext``` counters in regression tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8031
Test Plan: Manually run the script
Reviewed By: ajkr
Differential Revision: D31508269
Pulled By: anand1976
fbshipit-source-id: 20ddfd1cbca37f1439eed2870086a86d90653b44
Summary:
The patch adds a new BlobDB benchmarking script called `run_blob_bench.sh`.
It is a thin wrapper around `benchmark.sh` (similarly to `run_flash_bench.sh`):
it actually calls `benchmark.sh` a number of times, cycling through six workloads,
two write-only ones (bulk load and overwrite), two read/write ones (point lookups
while writing, range scans while writing), and two read-only ones (point lookups
and range scans).
Note: this is a simpler/cleaned up/reworked version of the script used to produce the
benchmark results in http://rocksdb.org/blog/2021/05/26/integrated-blob-db.html .
The new version takes advantage of several recent `benchmark.sh` improvements
like the ability to pass in arbitrary `db_bench` options or the possibility of using a
job ID.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9015
Test Plan: Ran the script manually with different parameter combinations.
Reviewed By: riversand963
Differential Revision: D31555277
Pulled By: ltamasi
fbshipit-source-id: 0e151b2f7b2cf6f66ed7f95455571492ad7ea87f
Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.
In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).
These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)
This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994
Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.
Reviewed By: riversand963
Differential Revision: D31489850
Pulled By: ltamasi
fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.
The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995
Test Plan:
- Verified it fixes the following failure:
```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```
- `make check -j48`
Reviewed By: ltamasi
Differential Revision: D31495388
Pulled By: ajkr
fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
Summary:
Enable SingleDelete with user defined timestamp in db_bench,
db_stress and crash test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8971
Test Plan:
1. For db_stress, ran the command for full duration: i) python3 -u tools/db_crashtest.py
--enable_ts whitebox --nooverwritepercent=100
ii) make crash_test_with_ts
2. For db_bench, ran: ./db_bench -benchmarks=randomreplacekeys
-user_timestamp_size=8 -use_single_deletes=true
Reviewed By: riversand963
Differential Revision: D31246558
Pulled By: akankshamahajan15
fbshipit-source-id: 29cd8740c9921341e52f09242fca3c44d75a12b7
Summary:
The ldb list_live_files_metadata command does not print any information about blob files currently. We would like to add this functionality. Note that list_live_files_metadata has two different modes of operation: the one shown above, which shows the LSM tree structure, and another one, which can be enabled using the flag --sort_by_filename and simply lists the files in numerical order regardless of level. We would like to show blob files in both modes.
Changes:
1. Using GetAllColumnFamilyMetaData API instead of GetLiveFilesMetaData API for fetching live files data.
Testing:
1. Created a sample rocksdb instance using dbbench command (this creates both SST and blob files)
2. Checked if the blob files are listed or not by using ldb commands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8976
Reviewed By: ltamasi
Differential Revision: D31316061
Pulled By: pradeepambati
fbshipit-source-id: d15cdea192febf7a45f28deee2ba40615d3d84ab
Summary:
This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead.
Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
Hopefully, this sort of code hygiene cleanup will speed up the builds...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
Reviewed By: pdillinger
Differential Revision: D31142788
Pulled By: mrambacher
fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
Summary:
This is a precursor refactoring to enable an upcoming feature: persistence failure correctness testing.
- Changed `--expected_values_path` to `--expected_values_dir` and migrated "db_crashtest.py" to use the new flag. For persistence failure correctness testing there are multiple possible correct states since unsynced data is allowed to be dropped. Making it possible to restore all these possible correct states will eventually involve files containing snapshots of expected values and DB trace files.
- The expected values directory is managed by an `ExpectedStateManager` instance. Managing expected state files is separated out of `SharedState` to prevent `SharedState` from becoming too complex when the new files and features (snapshotting, tracing, and restoring) are introduced.
- Migrated expected values file access/management out of `SharedState` into a separate class called `ExpectedState`. This is not exposed directly to the test but rather the `ExpectedState` for the latest values file is accessed via a pass-through API on `ExpectedStateManager`. This forces the test to always access the single latest `ExpectedState`.
- Changed the initialization of the latest expected values file to use a tempfile followed by rename, and also add cleanup logic for possible stranded tempfiles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8913
Test Plan:
run in several ways; try to make sure it's not obviously broken.
- crashtest blackbox without TEST_TMPDIR
```
$ python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest blackbox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest whitebox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py whitebox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none --random_kill_odd=88887
```
- db_stress without expected_values_dir
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true
```
- db_stress with expected_values_dir and manual corruption
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true --expected_values_dir=./
// modify one byte in "./LATEST.state"
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=false --expected_values_dir=./
...
Verification failed for column family 0 key 0000000000000000 (0): Value not found: NotFound:
...
```
Reviewed By: riversand963
Differential Revision: D30921951
Pulled By: ajkr
fbshipit-source-id: babfe218062e55d018c9b046536c0289fb78f41c
Summary:
For now, disable it since the below command indicates it can cause a
failure. Running that command with `-experimental_mempurge_threshold=0`
has been running successfully for several minutes, whereas before it
failed in seconds.
```
$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8958
Reviewed By: ltamasi
Differential Revision: D31187059
Pulled By: ajkr
fbshipit-source-id: 04d5bfb4fcc4f5b66233e691427dfd940c67037f
Summary:
Several improvements to MultiRead:
1. Fix a bug in stress test which causes false positive when both MultiRead() return and individual read request have failure injected.
2. Add two more types of fault that should be handled: empty read results and checksum mismatch
3. Add a message indicating which type of fault is injected
4. Increase the failure rate
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8937
Reviewed By: anand1976
Differential Revision: D31085930
fbshipit-source-id: 3a04994a3cadebf9a64d25e1fe12b14b7a272fba
Summary:
In FileChecksumTestHelper::VerifyEachFileChecksum(), we query the file list, and then for each file in the list verify the checksum. However, compaction can delete those files in the mean time and cause failures. To prevent it from happening, disable file deletion during the validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8942
Test Plan: Run exsiting test and see it doesn't fail.
Reviewed By: pdillinger
Differential Revision: D31086488
fbshipit-source-id: 554608f36d2dd3bf0a20dfc4039c68bd8533d7f8
Summary:
In case of IO uring bugs, we need to provide a way for users to turn it off.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8931
Test Plan: Manually run db_bench with/without the option and verify the behavior
Reviewed By: pdillinger
Differential Revision: D31040252
Pulled By: anand1976
fbshipit-source-id: 56f2537d6ac8488c9e126296d8190ad9e0158f70
Summary:
As title. The reason is that after loading customized options, the env is not set back to the correct one. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8929
Test Plan: Manually validate in an environment where the command failed.
Reviewed By: riversand963
Differential Revision: D31026931
fbshipit-source-id: c25dc788bf80ed5bf4b24922c442781943bcd65b
Summary:
Because even 32-bit systems can have large files
This is a "change" that I don't want intermingled with an upcoming refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8926
Test Plan: CI
Reviewed By: zhichao-cao
Differential Revision: D31020974
Pulled By: pdillinger
fbshipit-source-id: ca9eb4510697df6f1f55e37b37730b88b1809a92
Summary:
* Started on some proper usage text to document the options
* Added a `JOB_ID` parameter, so that we can trace jobs and relate them to other assets
* Now generates a correct TSV file of the summary
* Summary has new additional fields:
* RocksDB Version
* Date
* Job ID
* db_bench log files now also include the Job ID
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8730
Reviewed By: mrambacher
Differential Revision: D30747344
Pulled By: jay-zhuang
fbshipit-source-id: 87eb78d20959b6d95804aebf129606fa9c71f407
Summary:
These tests would frequently fail to find SST files due to race
condition in running ldb (read-only) on an open DB which might do automatic
compaction. But only sometimes would that failure translate into test
failure because the implementation of ldb file_checksum_dump would
swallow many errors. Now,
* DB closed while running ldb to avoid unnecessary race condition
* Detect and report/propagate more failures in `ldb file_checksum_dump`
* Use --hex so that random binary data is not printed to console
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8898
Test Plan: ./ldb_cmd_test --gtest_filter=*Checksum* --gtest_repeat=100
Reviewed By: zhichao-cao
Differential Revision: D30848738
Pulled By: pdillinger
fbshipit-source-id: 20290b517eeceba99bb538bb5a17088f7e878405
Summary:
This allows the wrapper classes to own the wrapped object and eliminates confusion as to ownership. Previously, many classes implemented their own ownership solutions. Fixes https://github.com/facebook/rocksdb/issues/8606
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8618
Reviewed By: pdillinger
Differential Revision: D30136064
Pulled By: mrambacher
fbshipit-source-id: d0bf471df8818dbb1770a86335fe98f761cca193
Summary:
Make the Statistics object into a Customizable object. Statistics can now be stored and created to/from the Options file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8637
Reviewed By: zhichao-cao
Differential Revision: D30530550
Pulled By: mrambacher
fbshipit-source-id: 5fc7d01d8431f37b2c205bbbd8342c9f697023bd
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
Reviewed By: zhichao-cao
Differential Revision: D29558961
Pulled By: mrambacher
fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
Summary:
* Don't hardcode namespace rocksdb (use ROCKSDB_NAMESPACE)
* Don't #include <rocksdb/...> (use double quotes)
* Support putting NOCOMMIT (any case) in source code that should not be
committed/pushed in current state.
These will be run with `make check` and in GitHub actions
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8821
Test Plan: existing tests, manually try out new checks
Reviewed By: zhichao-cao
Differential Revision: D30791726
Pulled By: pdillinger
fbshipit-source-id: 399c883f312be24d9e55c58951d4013e18429d92
Summary:
Regression test is broken and not running:
1. failed test is not reporting, fix it by add `set -e`
2. internal regression test is not run inside github, removing that
3. fix a few minor issues to pass the test
4. delete unused binary size build, and regression test is reporting binary size now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8753
Test Plan: CI: https://www.internalfb.com/intern/sandcastle/job/13510799359573861
Reviewed By: ltamasi
Differential Revision: D30754380
Pulled By: jay-zhuang
fbshipit-source-id: 0cfa008327fff31bc61118a3fe642924090d28e1
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736
Test Plan: existing tests
Reviewed By: mrambacher
Differential Revision: D30700039
Pulled By: pdillinger
fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
Summary:
Current implementation does not support user-defined timestamp when
block-based filter is used. Will implement the support in the future, or
wait to see if block-based filter can be deprecated and removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8703
Test Plan: make whitebox_crash_test_with_ts
Reviewed By: pdillinger
Differential Revision: D30528931
Pulled By: riversand963
fbshipit-source-id: 60dd74ee0a6194e69072069d8c4bd876f249f38d
Summary:
Handler functions now use a common output function to output to stdout/files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8697
Test Plan: `trace_analyzer_test` can pass.
Reviewed By: zhichao-cao
Differential Revision: D30527696
Pulled By: autopear
fbshipit-source-id: c626cf4d53a39665a9c4bcf0cb019c448434abe4
Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.
So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)
I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.
C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.
BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679
Test Plan: new + updated tests, including crash test
Reviewed By: jay-zhuang
Differential Revision: D30445797
Pulled By: pdillinger
fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.
New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".
`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657
Reviewed By: ajkr
Differential Revision: D30290216
Pulled By: autopear
fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.
User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.
Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611
Reviewed By: ajkr
Differential Revision: D30266329
Pulled By: autopear
fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
Summary:
The last few releases overlooked adding to this test. This
change fixes that.
This change also fixes the problem of older branches not understanding
ROCKSDB_NO_FBCODE and referencing compilers no longer supported.
During the test, build_detect_platform is patched to force no FBCODE
compiler usage. (We should not need to update old branches perpetually.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8651
Test Plan: local run reproduces regression described in https://github.com/facebook/rocksdb/issues/8650
Reviewed By: jay-zhuang, zhichao-cao
Differential Revision: D30261872
Pulled By: pdillinger
fbshipit-source-id: 02b447d224d7e0eb8613c63185437ded146713bc
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
Reviewed By: pdillinger
Differential Revision: D30149315
Pulled By: bjlemaire
fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
Summary:
`GenericRateLimiter` slow path handles requests that cannot be satisfied
immediately. Such requests enter a queue, and their thread stays in `Request()`
until they are granted or the rate limiter is stopped. These threads are
responsible for unblocking themselves. The work to do so is split into two main
duties.
(1) Waiting for the next refill time.
(2) Refilling the bytes and granting requests.
Prior to this PR, the slow path logic involved a leader election algorithm to
pick one thread to perform (1) followed by (2). It elected the thread whose
request was at the front of the highest priority non-empty queue since that
request was most likely to be granted. This algorithm was efficient in terms of
reducing intermediate wakeups, which is a thread waking up only to resume
waiting after finding its request is not granted. However, the conceptual
complexity of this algorithm was too high. It took me a long time to draw a
timeline to understand how it works for just one edge case yet there were so
many.
This PR drops the leader election to reduce conceptual complexity. Now, the two
duties can be performed by whichever thread acquires the lock first. The risk
of this change is increasing the number of intermediate wakeups, however, we
took steps to mitigate that.
- `wait_until_refill_pending_` flag ensures only one thread performs (1). This\
prevents the thundering herd problem at the next refill time. The remaining\
threads wait on their condition variable with an unbounded duration -- thus we\
must remember to notify them to ensure forward progress.
- (1) is typically done by a thread at the front of a queue. This is trivial\
when the queues are initially empty as the first choice that arrives must be\
the only entry in its queue. When queues are initially non-empty, we achieve\
this by having (2) notify a thread at the front of a queue (preferring higher\
priority) to perform the next duty.
- We do not require any additional wakeup for (2). Typically it will just be\
done by the thread that finished (1).
Combined, the second and third bullet points above suggest the refill/granting
will typically be done by a request at the front of its queue. This is
important because one wakeup is saved when a granted request happens to be in an
already running thread.
Note there are a few cases that still lead to intermediate wakeup, however. The
first two are existing issues that also apply to the old algorithm, however, the
third (including both subpoints) is new.
- No request may be granted (only possible when rate limit dynamically\
decreases).
- Requests from a different queue may be granted.
- (2) may be run by a non-front request thread causing it to not be granted even\
if some requests in that same queue are granted. It can happen for a couple\
(unlikely) reasons.
- A new request may sneak in and grab the lock at the refill time, before the\
thread finishing (1) can wake up and grab it.
- A new request may sneak in and grab the lock and execute (1) before (2)'s\
chosen candidate can wake up and grab the lock. Then that non-front request\
thread performing (1) can carry over to perform (2).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8602
Test Plan:
- Use existing tests. The edge cases listed in the comment are all performance\
related; I could not really think of any related to correctness. The logic\
looks the same whether a thread wakes up/finishes its work early/on-time/late,\
or whether the thread is chosen vs. "steals" the work.
- Verified write throughput and CPU overhead are basically the same with and\
without this change, even in a rate limiter heavy workload:
Test command:
```
$ rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=fillrandom -num_multi_db=64 -num_low_pri_threads=64 -num_high_pri_threads=64 -write_buffer_size=262144 -target_file_size_base=262144 -max_bytes_for_level_base=1048576 -rate_limiter_bytes_per_sec=16777216 -key_size=24 -value_size=1000 -num=10000 -compression_type=none -rate_limiter_refill_period_us=1000
```
Results before this PR:
```
fillrandom : 108.463 micros/op 9219 ops/sec; 9.0 MB/s
7.40user 8.84system 1:26.20elapsed 18%CPU (0avgtext+0avgdata 256140maxresident)k
```
Results after this PR:
```
fillrandom : 108.108 micros/op 9250 ops/sec; 9.0 MB/s
7.45user 8.23system 1:26.68elapsed 18%CPU (0avgtext+0avgdata 255688maxresident)k
```
Reviewed By: hx235
Differential Revision: D30048013
Pulled By: ajkr
fbshipit-source-id: 6741bba9d9dfbccab359806d725105817fef818b
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310
Test Plan: Add several unit tests.
Reviewed By: ajkr
Differential Revision: D28493792
fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501