Summary:
Start tracking SST unique id in MANIFEST, which is used to verify with
SST properties to make sure the SST file is not overwritten or
misplaced. A DB option `try_verify_sst_unique_id` is introduced to
enable/disable the verification, if enabled, it opens all SST files
during DB-open to read the unique_id from table properties (default is
false), so it's recommended to use it with `max_open_files = -1` to
pre-open the files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990
Test Plan: unittests, format-compatible test, mini-crash
Reviewed By: anand1976
Differential Revision: D36381863
Pulled By: jay-zhuang
fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f
Summary:
**Context:**
Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`.
Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags.
**Summary:**
- Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that
- Added missing db bench/stress test for some memory charging features
- Renamed related test suite to indicate they are under the same theme of memory charging
- Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication
- Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926
Test Plan:
- New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)`
- New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)`
- CI
- db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
#-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465**
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078**
- db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal
Reviewed By: ajkr
Differential Revision: D36054712
Pulled By: hx235
fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af
Summary:
This PR
- since we are testing with disable_wal = true and best_efforts_recovery, we should set column family count to 1, due to the requirement of `ExpectedState` tracking and replaying logic.
- during backup and checkpoint restore, disable best-efforts-recovery. This does not matter now because db_crashtest.py always disables wal when testing best-efforts-recovery. In the future, if we enable wal, then not setting `restore_opitions.best_efforts_recovery` will cause backup db not to recover the WALs, and differ from db (that enables WAL).
- during verification of backup and checkpoint restore, print the key where inconsistency exists between expected state and db.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9986
Test Plan: TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery
Reviewed By: siying
Differential Revision: D36353105
Pulled By: riversand963
fbshipit-source-id: a484da161273e6216a1f7e245bac15a349693917
Summary:
https://github.com/facebook/rocksdb/issues/9961 broke format_compatible check because of `make clean`
referencing TEST_TMPDIR. The Makefile behavior seems reasonable to me,
so here's a fix in check_format_compatible.sh
Apparently I also included removing a redundant part of our CircleCI config.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9970
Test Plan: manual run: SHORT_TEST=1 ./tools/check_format_compatible.sh
Reviewed By: riversand963
Differential Revision: D36258172
Pulled By: pdillinger
fbshipit-source-id: d46507f04614e888b414ff23b88d040ae2b5c294
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
Test Plan: Watch CI tests
Reviewed By: riversand963
Differential Revision: D36176799
fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
Summary:
Previously all fault injection was ignored in release mode. This PR adds it back except for read fault injection (`--read_fault_one_in > 0`) since its dependency (`IGNORE_STATUS_IF_ERROR`) is unavailable in release mode.
Other notable changes include:
- Moved `EnableWriteErrorInjection()` for `--write_fault_one_in > 0` so it's after `DB::Open()` without depending on `SyncPoint`
- Made `--read_fault_one_in > 0` return an error in release mode
- Updated `db_crashtest.py` to always set `--read_fault_one_in=0` in release mode
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9957
Test Plan:
```
$ DEBUG_LEVEL=0 make -j24 db_stress
$ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox
```
Reviewed By: anand1976
Differential Revision: D36193830
Pulled By: ajkr
fbshipit-source-id: 0b97946b4e3f06e3e0f6e7833c2763da08ec5321
Summary:
`db_stress` already tracks expected state history to verify prefix-recoverability when `sync_fault_injection` is enabled. This PR enables `sync_fault_injection` in `db_crashtest.py`.
Previously enabling `sync_fault_injection` would cause whole unsynced files to be dropped. This PR adds a more interesting case of losing only the tail of unsynced data by implementing `TestFSWritableFile::RangeSync()` and enabling `{wal_,}bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9947
Test Plan:
- regular blackbox, blackbox --simple
- various commands to stress this new case, such as `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --write_buffer_size=2097152 --avoid_flush_during_recovery=1 --disable_wal=0 --interval=10 --db_write_buffer_size=0 --sync_fault_injection=1 --wal_compression=none --delpercent=0 --delrangepercent=0 --prefixpercent=0 --iterpercent=0 --writepercent=100 --readpercent=0 --wal_bytes_per_sync=131072 --duration=36000 --sync=0 --open_write_fault_one_in=16`
Reviewed By: riversand963
Differential Revision: D36152775
Pulled By: ajkr
fbshipit-source-id: 44b68a7fad0a4cf74af9fe1f39be01baab8141d8
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
Test Plan: See CI Runs.
Reviewed By: riversand963
Differential Revision: D36173954
fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
Summary:
This is inspired by debugging a regression test that runs for ~0.05 seconds and the short
running time makes it prone to variance. While db_bench ran for ~60 seconds, 59.95 seconds
was spent opening 128 databases (and doing recovery). So it was harder to notice that the
benchmark only ran for 0.05 seconds.
Normally I add output to the end of the line to make life easier for existing tools that parse it
but in this case the output near the end of the line has two optional parts and one of the optional
parts adds an extra newline.
This is for https://github.com/facebook/rocksdb/issues/9856
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9886
Test Plan:
./db_bench --benchmarks=overwrite,readrandom --num=1000000 --threads=4
old output:
DB path: [/tmp/rocksdbtest-2260/dbbench]
overwrite : 14.108 micros/op 283338 ops/sec; 31.3 MB/s
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 7.994 micros/op 496788 ops/sec; 55.0 MB/s (1000000 of 1000000 found)
new output:
DB path: [/tmp/rocksdbtest-2260/dbbench]
overwrite : 14.117 micros/op 282862 ops/sec 14.141 seconds 4000000 operations; 31.3 MB/s
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 8.649 micros/op 458475 ops/sec 8.725 seconds 4000000 operations; 49.8 MB/s (981548 of 1000000 found)
Reviewed By: ajkr
Differential Revision: D36102269
Pulled By: mdcallag
fbshipit-source-id: 5cd8a9e11f5cbe2a46809571afd83335b6b0caa0
Summary:
If the DB path is specified, the user would expect ldb loads the
options from the path, but it's not:
```
$ ldb list_live_files_metadata --db=`pwd`
```
Default `try_load_options` to true in that case. The user can still
disable that by:
```
$ ldb list_live_files_metadata --db=`pwd` --try_load_options=false
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9937
Test Plan:
`ldb list_live_files_metadata --db=`pwd`` is able to work for
a db generated with different options.num_levels.
Reviewed By: ajkr
Differential Revision: D36106708
Pulled By: jay-zhuang
fbshipit-source-id: 2732fdc027a4d172436b2c9b6a9787b56b10c710
Summary:
When compaction filter determines that a key should be removed, it updates the internal key's type
to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction
together with `SingleDelete`, it will cause compaction iterator to return Corruption.
To fix the issue, compaction filter should return more information in addition to the intention of removing
a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing
`kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`.
In addition, I updated db_stress_shared_state.[cc|h] so that `no_overwrite_ids_` becomes `const`. It is easier to
reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()`
when calling from `Open()` so that we can set up the rollback deletion type callback for transactions.
Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does
not quite work with `MultiOpsTxnsStressTest`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929
Test Plan:
make check
make crash_test
make crash_test_with_txn
Reviewed By: anand1976
Differential Revision: D36069678
Pulled By: riversand963
fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92
Summary:
Adds more coverage to `MultiOpsTxnsStressTest` with a focus on write-prepared transactions.
1. Add a hack to manually evict commit cache entries. We currently cannot assign small values to `wp_commit_cache_bits` because it requires a prepared transaction to commit within a certain range of sequence numbers, otherwise it will throw.
2. Add coverage for commit-time-write-batch. If write policy is write-prepared, we need to set `use_only_the_last_commit_time_batch_for_recovery` to true.
3. After each flush/compaction, verify data consistency. This is possible since data size can be small: default numbers of primary/secondary keys are just 1000.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9829
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb_crashtest_blackbox/ make blackbox_crash_test_with_multiops_wp_txn
```
Reviewed By: pdillinger
Differential Revision: D35806678
Pulled By: riversand963
fbshipit-source-id: d7fde7a29fda0fb481a61f553e0ca0c47da93616
Summary:
This patch completes the second part of the task: "Add blob support to the dump and dump_live_files command"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9896
Reviewed By: ltamasi
Differential Revision: D35852667
Pulled By: jowlyzhang
fbshipit-source-id: a006456c881f468a92da689e895134762e9574e1
Summary:
This patch is the first part of adding blob dump support. It only adds blob dump support to the dump command. A follow up patch will add blob dump support to the dump_live_files command.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9881
Reviewed By: ltamasi
Differential Revision: D35796731
Pulled By: jowlyzhang
fbshipit-source-id: 2cc5973b222d505a331ac7b969edcf992b47c5ee
Summary:
This patch completes the first part of the task: "Extend all three commands so they can decode and print blob references if a new option --decode_blob_index is specified"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9870
Reviewed By: ltamasi
Differential Revision: D35753932
Pulled By: jowlyzhang
fbshipit-source-id: 9d2bbba0eef2ed86b982767eba9de1b4881f35c9
Summary:
`InitializeOptionsGeneral()` was overwriting many options that were already configured by OPTIONS file, potentially with the flag default values. This PR changes that function to only overwrite options in limited scenarios, as described at the top of its definition. Block cache is still a violation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9862
Test Plan: ran under various scenarios (multi-DB, single DB, OPTIONS file, flags) and verified options are set as expected
Reviewed By: jay-zhuang
Differential Revision: D35736960
Pulled By: ajkr
fbshipit-source-id: 75b77740af37e6f5741618f8a8f5685df2417d03
Summary:
Since they operate at distinct abstraction layers, I thought it
was prudent to combine with EncryptedEnv CI test for each PR, for efficiency
in testing. Also added supported compressions to sst_dump --help output
so that CI job can verify no compiled-in compression support.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9850
Test Plan: CI, some manual stuff
Reviewed By: riversand963
Differential Revision: D35682346
Pulled By: pdillinger
fbshipit-source-id: be9879c1533fed304ee32c89fd9ba4b07c2b90cc
Summary:
This change only add decode blob index support to dump_live_files command, which is part of a task to add blob support to a few commands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9842
Reviewed By: ltamasi
Differential Revision: D35650167
Pulled By: jowlyzhang
fbshipit-source-id: a78151b98bc38ac6f52c6e01ca6927a3429ddd14
Summary:
Various renaming and fixes to get rid of remaining uses of
"backupable" which is terminology leftover from the original, flawed
design of BackupableDB. Now any DB can be backed up, using BackupEngine.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9792
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D35334386
Pulled By: pdillinger
fbshipit-source-id: 2108a42b4575c8cccdfd791c549aae93ec2f3329
Summary:
There's an existing benchmark, "getmergeoperands", but it is unconventional in that it has multiple phases and hardcoded setup parameters.
This PR adds a different one, "readrandomoperands", that follows the pattern of other benchmarks of having a single phase and taking its configuration from existing flags.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9785
Test Plan:
```
$ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -write_buffer_size=1048576 -max_bytes_for_level_base=4194304 -target_file_size_base=1048576 -compression_type=none -disable_auto_compactions=true
$ ./db_bench -use_existing_db=true -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -disable_auto_compactions=true -duration=10
...
readrandomoperands : 542.082 micros/op 1844 ops/sec; 0.2 MB/s (11980 of 18999 found)
```
Reviewed By: jay-zhuang
Differential Revision: D35290412
Pulled By: ajkr
fbshipit-source-id: fb367ca614b128cef844a75f0e5d9dd7c3328d85
Summary:
Fixes a bug introduced by me in https://github.com/facebook/rocksdb/pull/9733
That PR added a counter so that the per-thread seeds in ThreadState would
be unique even when --benchmarks had more than one test. But it incorrectly
used this counter as the value for ThreadState::tid as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9757
Test Plan:
Confirm that unexpectedly good QPS results on the regression tests return
to normal with this fix. I have confirmed that the QPS increase starts with
the PR 9733 diff.
Reviewed By: jay-zhuang
Differential Revision: D35149303
Pulled By: mdcallag
fbshipit-source-id: dee5cc36b7faaba6c3be6d6a253d3c2eaad72864
Summary:
This is for https://github.com/facebook/rocksdb/issues/9737
I have wasted more than a few hours running db_bench benchmarks where --seed was not set
and getting better than expected results because cache hit rates are great because
multiple invocations of db_bench used the same value for --seed or did not set it,
and then all used 0. The result is that all see the same sequence of keys.
Others have done the same. The problem is worse in that it is easy to miss and the result is a benchmark with results that are misleading.
A good way to avoid this is to set it to the equivalent of gettimeofday() when either
--seed is not set or it is set to 0 (the default).
With this change the actual seed is printed when it was 0 at process start:
Set seed to 1647992570365606 because --seed was 0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9740
Test Plan:
Perf results:
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000
readrandom : 6.469 micros/op 154583 ops/sec; 17.1 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=0
readrandom : 6.565 micros/op 152321 ops/sec; 16.9 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=1
readrandom : 6.461 micros/op 154777 ops/sec; 17.1 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=2
readrandom : 6.525 micros/op 153244 ops/sec; 17.0 MB/s (4000000 of 4000000 found)
Reviewed By: jay-zhuang
Differential Revision: D35145361
Pulled By: mdcallag
fbshipit-source-id: 2b35b153ccec46b27d7c9405997523555fc51267
Summary:
This adds the --slow_usecs option with a default value of 1M. Operations that
take this much time have a message printed when --histogram=1, --stats_interval=0
and --stats_interval_seconds=0. The current code hardwired this to 20,000 usecs
and for some stress tests that reduced throughput by 20% or more.
This is for https://github.com/facebook/rocksdb/issues/9620
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9732
Test Plan:
./db_bench --benchmarks=fillrandom,readrandom --compression_type=lz4 --slow_usecs=100 --histogram=1
./db_bench --benchmarks=fillrandom,readrandom --compression_type=lz4 --slow_usecs=100000 --histogram=1
Reviewed By: jay-zhuang
Differential Revision: D35121522
Pulled By: mdcallag
fbshipit-source-id: daf27f937efd748980545d6395db332712fc078b
Summary:
db_bench quietly parses and ignores bad values for --compaction_fadvice and --value_size_distribution_type
I prefer that it fail for them as it does for bad option values in most other cases. Otherwise a benchmark
result will be provided for the wrong configuration and the result will be misleading.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9741
Test Plan:
These now fail:
./db_bench --compaction_fadvice=noney
Unknown compaction fadvice:noney
./db_bench --value_size_distribution_type=norma
Cannot parse distribution type 'norma'
While correct values continue to work:
./db_bench --value_size_distribution_type=normal
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
./db_bench --compaction_fadvice=none
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Reviewed By: siying
Differential Revision: D35115973
Pulled By: mdcallag
fbshipit-source-id: c2b10de5c2d1ea7c7539e676f5bd556351f5d370
Summary:
When --benchmarks has more than one test then the threads in one benchmark
will use the same set of seeds as the threads in the previous benchmark.
This diff fixe that.
This fixes https://github.com/facebook/rocksdb/issues/9632
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9733
Test Plan:
For this command line the block cache is 8GB, so it caches at most 1024 8KB blocks. Note that without
this diff the second run of readrandom has a much better response time because seed reuse means the
second run reads the same 1000 blocks as the first run and they are cached at that point. But with
this diff that does not happen.
./db_bench --benchmarks=fillseq,flush,compact0,waitforcompaction,levelstats,readrandom,readrandom --compression_type=zlib --num=10000000 --reads=1000 --block_size=8192
...
```
Level Files Size(MB)
--------------------
0 0 0
1 11 238
2 9 253
3 0 0
4 0 0
5 0 0
6 0 0
```
--- perf results without this diff
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 46.212 micros/op 21618 ops/sec; 2.4 MB/s (1000 of 1000 found)
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 21.963 micros/op 45450 ops/sec; 5.0 MB/s (1000 of 1000 found)
--- perf results with this diff
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 47.213 micros/op 21126 ops/sec; 2.3 MB/s (1000 of 1000 found)
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 42.880 micros/op 23299 ops/sec; 2.6 MB/s (1000 of 1000 found)
Reviewed By: jay-zhuang
Differential Revision: D35089763
Pulled By: mdcallag
fbshipit-source-id: 1b50143a07afe876b8c8e5fa50dd94a8ce57fc6b
Summary:
This changes db_bench to fail at startup for invalid compression types. It had been
changing them to Snappy. For other invalid options it fails at startup.
This is for https://github.com/facebook/rocksdb/issues/9621
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9729
Test Plan:
This continues to work:
./db_bench --benchmarks=fillrandom --compression_type=lz4
This now fails rather than changing the compression type to Snappy
./db_bench --benchmarks=fillrandom --compression_type=lz44
Cannot parse compression type 'lz44'
Reviewed By: jay-zhuang
Differential Revision: D35081323
Pulled By: mdcallag
fbshipit-source-id: 9b38c835abddce11aa7feb235df63f53cf829981
Summary:
…in order
This fixes https://github.com/facebook/rocksdb/issues/9650
For db_bench --benchmarks=fillseq --num_multi_db=X it loads databases in sequence
rather than randomly choosing a database per Put. The benefits are:
1) avoids long delays between flushing memtables
2) avoids flushing memtables for all of them at the same point in time
3) puts same number of keys per database so that query tests will find keys as expected
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9713
Test Plan:
Using db_bench.1 without the change and db_bench.2 with the change:
for i in 1 2; do rm -rf /data/m/rx/* ; time ./db_bench.$i --db=/data/m/rx --benchmarks=fillseq --num_multi_db=4 --num=10000000; du -hs /data/m/rx ; done
--- without the change
fillseq : 3.188 micros/op 313682 ops/sec; 34.7 MB/s
real 2m7.787s
user 1m52.776s
sys 0m46.549s
2.7G /data/m/rx
--- with the change
fillseq : 3.149 micros/op 317563 ops/sec; 35.1 MB/s
real 2m6.196s
user 1m51.482s
sys 0m46.003s
2.7G /data/m/rx
Also, temporarily added a printf to confirm that the code switches to the next database at the right time
ZZ switch to db 1 at 10000000
ZZ switch to db 2 at 20000000
ZZ switch to db 3 at 30000000
for i in 1 2; do rm -rf /data/m/rx/* ; time ./db_bench.$i --db=/data/m/rx --benchmarks=fillseq,readrandom --num_multi_db=4 --num=100000; du -hs /data/m/rx ; done
--- without the change, smaller database, note that not all keys are found by readrandom because databases have < and > --num keys
fillseq : 3.176 micros/op 314805 ops/sec; 34.8 MB/s
readrandom : 1.913 micros/op 522616 ops/sec; 57.7 MB/s (99873 of 100000 found)
--- with the change, smaller database, note that all keys are found by readrandom
fillseq : 3.110 micros/op 321566 ops/sec; 35.6 MB/s
readrandom : 1.714 micros/op 583257 ops/sec; 64.5 MB/s (100000 of 100000 found)
Reviewed By: jay-zhuang
Differential Revision: D35030168
Pulled By: mdcallag
fbshipit-source-id: 2a18c4ec571d954cf5a57b00a11802a3608823ee
Summary:
Changes:
* improves monitoring by displaying average size of a Put value and average scan length
* forces the minimum value size to be 10. Before this it was 0 if you didn't set the distribution parameters.
* uses reasonable defaults for the distribution parameters that determine value size and scan length
* includes seeks in "reads ... found" message, before this they were missing
This is for https://github.com/facebook/rocksdb/issues/9672
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9711
Test Plan:
Before this change:
./db_bench --benchmarks=fillseq,mixgraph --mix_get_ratio=50 --mix_put_ratio=25 --mix_seek_ratio=25 --num=100000 --value_k=0.2615 --value_sigma=25.45 --iter_k=2.517 --iter_sigma=14.236
fillseq : 4.289 micros/op 233138 ops/sec; 25.8 MB/s
mixgraph : 18.461 micros/op 54166 ops/sec; 755.0 MB/s ( Gets:50164 Puts:24919 Seek:24917 of 50164 in 75081 found)
After this change:
./db_bench --benchmarks=fillseq,mixgraph --mix_get_ratio=50 --mix_put_ratio=25 --mix_seek_ratio=25 --num=100000 --value_k=0.2615 --value_sigma=25.45 --iter_k=2.517 --iter_sigma=14.236
fillseq : 3.974 micros/op 251553 ops/sec; 27.8 MB/s
mixgraph : 16.722 micros/op 59795 ops/sec; 833.5 MB/s ( Gets:50164 Puts:24919 Seek:24917, reads 75081 in 75081 found, avg size: 36.0 value, 504.9 scan)
Reviewed By: jay-zhuang
Differential Revision: D35030190
Pulled By: mdcallag
fbshipit-source-id: d8f555f28d869f752ddb674a524108884511b151
Summary:
The goal of this change is to allow changes to the "current" (in
FileSystem) file temperatures to feed back into DB metadata, so that
they can inform decisions and stats reporting. In part because of
modular code factoring, it doesn't seem easy to do this automagically,
where opening an SST file and observing current Temperature different
from expected would trigger a change in metadata and DB manifest write
(essentially giving the deep read path access to the write path). It is also
difficult to do this while the DB is open because of the limitations of
LogAndApply.
This change allows updating file temperature metadata on a closed DB
using an experimental utility function UpdateManifestForFilesState()
or `ldb update_manifest --update_temperatures`. This should suffice for
"migration" scenarios where outside tooling has placed or re-arranged DB
files into a (different) tiered configuration without going through
RocksDB itself (currently, only compaction can change temperature
metadata).
Some details:
* Refactored and added unit test for `ldb unsafe_remove_sst_file` because
of shared functionality
* Pulled in autovector.h changes from https://github.com/facebook/rocksdb/issues/9546 to fix SuperVersionContext
move constructor (related to an older draft of this change)
Possible follow-up work:
* Support updating manifest with file checksums, such as when a
new checksum function is used and want existing DB metadata updated
for it.
* It's possible that for some repair scenarios, lighter weight than
full repair, we might want to support UpdateManifestForFilesState() to
modify critical file details like size or checksum using same
algorithm. But let's make sure these are differentiated from modifying
file details in ways that don't suspect corruption (or require extreme
trust).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9683
Test Plan: unit tests added
Reviewed By: jay-zhuang
Differential Revision: D34798828
Pulled By: pdillinger
fbshipit-source-id: cfd83e8fb10761d8c9e7f9c020d68c9106a95554
Summary:
The primary goal of this change is to add support for backing up and
restoring (applying on restore) file temperature metadata, without
committing to either the DB manifest or the FS reported "current"
temperatures being exclusive "source of truth".
To achieve this goal, we need to add temperature information to backup
metadata, which requires updated backup meta schema. Fortunately I
prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
6.19.0 for this kind of schema update. (Previously, backup meta schema
was not extensible! Making this schema update public will allow some
other "nice to have" features like taking backups with hard links, and
avoiding crc32c checksum computation when another checksum is already
available.) While schema version 2 is newly public, the default schema
version is still 1. Until we change the default, users will need to set
to 2 to enable features like temperature data backup+restore. New
metadata like temperature information will be ignored with a warning
in versions before this change and since 6.19.0. The metadata is
considered ignorable because a functioning DB can be restored without
it.
Some detail:
* Some renaming because "future schema" is now just public schema 2.
* Initialize some atomics in TestFs (linter reported)
* Add temperature hint support to SstFileDumper (used by BackupEngine)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660
Test Plan:
related unit test majorly updated for the new functionality,
including some shared testing support for tracking temperatures in a FS.
Some other tests and testing hooks into production code also updated for
making the backup meta schema change public.
Reviewed By: ajkr
Differential Revision: D34686968
Pulled By: pdillinger
fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
Summary:
BlockBasedTableOptions.hash_index_allow_collision is already deprecated and has no effect. Delete it for preparing 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9454
Test Plan: Run all existing tests.
Reviewed By: ajkr
Differential Revision: D33805827
fbshipit-source-id: ed8a436d1d083173ec6aef2a762ba02e1eefdc9d
Summary:
**Summary:**
RocksDB uses a block cache to reduce IO and make queries more efficient. The block cache is based on the LRU algorithm (LRUCache) and keeps objects containing uncompressed data, such as Block, ParsedFullFilterBlock etc. It allows the user to configure a second level cache (rocksdb::SecondaryCache) to extend the primary block cache by holding items evicted from it. Some of the major RocksDB users, like MyRocks, use direct IO and would like to use a primary block cache for uncompressed data and a secondary cache for compressed data. The latter allows us to mitigate the loss of the Linux page cache due to direct IO.
This PR includes a concrete implementation of rocksdb::SecondaryCache that integrates with compression libraries such as LZ4 and implements an LRU cache to hold compressed blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9518
Test Plan:
In this PR, the lru_secondary_cache_test.cc includes the following tests:
1. The unit tests for the secondary cache with either compression or no compression, such as basic tests, fails tests.
2. The integration tests with both primary cache and this secondary cache .
**Follow Up:**
1. Statistics (e.g. compression ratio) will be added in another PR.
2. Once this implementation is ready, I will do some shadow testing and benchmarking with UDB to measure the impact.
Reviewed By: anand1976
Differential Revision: D34430930
Pulled By: gitbw95
fbshipit-source-id: 218d78b672a2f914856d8a90ff32f2f5b5043ded
Summary:
When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs.
Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556
Reviewed By: anand1976
Differential Revision: D34308216
Pulled By: sidroyc
fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.
`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.
There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).
The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424
Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
- setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
- benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`
Reviewed By: hx235
Differential Revision: D33747386
Pulled By: ajkr
fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
Summary:
Opening DB as seconeary instance has been supported in ldb but it is not mentioned in --help. Mention it there. The part of the help message after the modification:
```
commands MUST specify --db=<full_path_to_db_directory> when necessary
commands can optionally specify
--env_uri=<uri_of_environment> or --fs_uri=<uri_of_filesystem> if necessary
--secondary_path=<secondary_path> to open DB as secondary instance. Operations not supported in secondary instance will fail.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9582
Test Plan: Build and run ldb --help
Reviewed By: riversand963
Differential Revision: D34286427
fbshipit-source-id: e56c5290d0548098ab6acc6dde2167f5a64f34f3
Summary:
I did another pass through running CI jobs. It is uncommon now to see
`db_stress` stuck in the setup phase but still happen.
One reason was repeatedly reading/verifying checksum on filter blocks when
`-cache_index_and_filter_blocks=1` and `-cache_size=1048576`. To address
that I increased the cache size.
Another reason was having a WAL with many range tombstones and every
`db_stress` run using `-avoid_flush_during_recovery=1` (in that
scenario, the setup phase spent too much CPU in
`rocksdb::MemTable::NewRangeTombstoneIteratorInternal()`). To address
that I fixed the `-avoid_flush_during_recovery` setting so it is
reevaluated for every `db_stress` run.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9483
Reviewed By: riversand963
Differential Revision: D33922929
Pulled By: ajkr
fbshipit-source-id: 0a298ec7c4df6f6b44620233996047a2dc7ee5f3
Summary:
This change removes the ability to configure the deprecated,
inefficient block-based filter in the public API. Options that would
have enabled it now use "full" (and optionally partitioned) filters.
Existing block-based filters can still be read and used, and a "back
door" way to build them still exists, for testing and in case of trouble.
About the only way this removal would cause an issue for users is if
temporary memory for filter construction greatly increases. In
HISTORY.md we suggest a few possible mitigations: partitioned filters,
smaller SST files, or setting reserve_table_builder_memory=true.
Or users who have customized a FilterPolicy using the
CreateFilter/KeyMayMatch mechanism removed in https://github.com/facebook/rocksdb/issues/9501 will have to upgrade
their code. (It's long past time for people to move to the new
builder/reader customization interface.)
This change also introduces some internal-use-only configuration strings
for testing specific filter implementations while bypassing some
compatibility / intelligence logic. This is intended to hint at a path
toward making FilterPolicy Customizable, but it also gives us a "back
door" way to configure block-based filter.
Aside: updated db_bench so that -readonly implies -use_existing_db
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9535
Test Plan:
Unit tests updated. Specifically,
* BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back
door configuration interface and ignoring of `use_block_based_builder`.
* BlockBasedTableTest.TracingGetTest is migrated from testing
block-based filter access pattern to full filter access patter, by
re-ordering some things.
* Options test (pretty self-explanatory)
Performance test - create with `./db_bench -db=/dev/shm/rocksdb1 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0` with and without `-use_block_based_filter`, which creates a DB with 21 SST files in L0. Read with `./db_bench -db=/dev/shm/rocksdb1 -readonly -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=30`
Without -use_block_based_filter: readrandom 464 ops/sec, 689280 KB DB
With -use_block_based_filter: readrandom 169 ops/sec, 690996 KB DB
No consistent difference with fillrandom
Reviewed By: jay-zhuang
Differential Revision: D34153871
Pulled By: pdillinger
fbshipit-source-id: 31f4a933c542f8f09aca47fa64aec67832a69738
Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33788508
Pulled By: akankshamahajan15
fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
Summary:
After https://github.com/facebook/rocksdb/issues/9481, we are using newer default compiler for
build-format-compatible CircleCI nightly job, which fails on building
2.2.fb.branch branch because it tries to use a pre-compiled libsnappy.a
that is checked into the repo (!). This works around that by setting
SNAPPY_LDFLAGS=-lsnappy, which is only understood by such old versions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9517
Test Plan:
Run check_format_compatible.sh on Ubuntu 20 AWS machine,
watch nightly run
Reviewed By: hx235
Differential Revision: D34055561
Pulled By: pdillinger
fbshipit-source-id: 45f9d428dd082f026773bfa8d9dd4dad66fc9378
Summary:
The patch does some cleanup in and around `VersionStorageInfo`:
* Renames the method `PrepareApply` to `PrepareAppend` in `Version`
to make it clear that it is to be called before appending the `Version` to
`VersionSet` (via `AppendVersion`), not before applying any `VersionEdit`s.
* Introduces a helper method `VersionStorageInfo::PrepareForVersionAppend`
(called by `Version::PrepareAppend`) that encapsulates the population of the
various derived data structures in `VersionStorageInfo`, and turns the
methods computing the derived structures (`UpdateNumNonEmptyLevels`,
`CalculateBaseBytes` etc.) into private helpers.
* Changes `Version::PrepareAppend` so it only calls `UpdateAccumulatedStats`
if the `update_stats` flag is set. (Earlier, this was checked by the callee.)
Related to this, it also moves the call to `ComputeCompensatedSizes` to
`VersionStorageInfo::PrepareForVersionAppend`.
* Updates and cleans up `version_builder_test`, `version_set_test`, and
`compaction_picker_test` so `PrepareForVersionAppend` is called anytime
a new `VersionStorageInfo` is set up or saved. This cleanup also involves
splitting `VersionStorageInfoTest.MaxBytesForLevelDynamic`
into multiple smaller test cases.
* Fixes up a bunch of comments that were outdated or just plain incorrect.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9494
Test Plan: Ran `make check` and the crash test script for a while.
Reviewed By: riversand963
Differential Revision: D33971666
Pulled By: ltamasi
fbshipit-source-id: fda52faac7783041126e4f8dec0fe01bdcadf65a
Summary:
Added a CountedFileSystem that tracks a number of file operations (opens, closes, deletes, renames, flushes, syncs, fsyncs, reads, writes). This class was based on the ReportFileOpEnv from db_bench.
This is a stepping stone PR to be able to change the SpecialEnv into a SpecialFileSystem, where several of the file varieties wish to do operation counting.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9283
Reviewed By: pdillinger
Differential Revision: D33062004
Pulled By: mrambacher
fbshipit-source-id: d0d297a7fb9c48c06cbf685e5fa755c27193b6f5
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Summary:
Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum
This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.
Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.
**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption
- See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
- Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342
Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
- For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
- FastLocalBloom
- Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
- After change:
- `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
- Standard128Ribbon
- Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
- After change:
- `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.
Reviewed By: pdillinger
Differential Revision: D33746928
Pulled By: hx235
fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
Summary:
Apparently setting total_order_seek=true for DB::Get was
intended to allow accurate read semantics if the current prefix
extractor doesn't match what was used to generate SST files on
disk. But since prefix_extractor was made a mutable option in 5.14.0, we
have been able to detect this case and provide the correct semantics
regardless of the total_order_seek option. Since that time, the option
has only made Get() slower in a reasonably common case: prefix_extractor
unchanged and whole_key_filtering=false.
So this change primarily removes unnecessary effect of
total_order_seek on Get. Also cleans up some related comments.
Also adds a -total_order_seek option to db_bench and canonicalizes
handling of ReadOptions in db_bench so that command line options have
the expected association with library features. (There is potential
for change in regression test behavior, but the old behavior is likely
indefensible, or some other inconsistency would need to be fixed.)
TODO in follow-up work: there should be no reason for Get() to depend on
current prefix extractor at all.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9427
Test Plan:
Unit tests updated.
Performance (using db_bench update)
Create DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12 -whole_key_filtering=0`
Test with and without `-total_order_seek` on `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=readrandom -num=10000000 -duration=40 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`
Before this change, total_order_seek=false: 25188 ops/sec
Before this change, total_order_seek=true: 1222 ops/sec (~20x slower)
After this change, total_order_seek=false: 24570 ops/sec
After this change, total_order_seek=true: 25012 ops/sec (indistinguishable)
Reviewed By: siying
Differential Revision: D33753458
Pulled By: pdillinger
fbshipit-source-id: bf892f34907a5e407d9c40bd4d42f0adbcbe0014
Summary:
Despite attempts to optimize `db_stress` setup phase (i.e.,
pre-`OperateDb()`) latency in https://github.com/facebook/rocksdb/issues/9470 and https://github.com/facebook/rocksdb/issues/9475, it still always took tens
of seconds. Since we still aren't able to setup a 100M key `db_stress`
quickly, we should reduce the number of keys. This PR reduces it 4x
while increasing `value_size_mult` 4x (from its default value of 8) so
that memtables and SST files fill at a similar rate compared to before this PR.
Also disabled bzip2 compression since we'll probably never use it and
I noticed many CI runs spending majority of CPU on bzip2 decompression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9476
Reviewed By: siying
Differential Revision: D33898520
Pulled By: ajkr
fbshipit-source-id: 855021784ad9664f2be5bce21f0339a1cf93230d
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds has been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9455
Test Plan: Rely on my eyeball and CI
Reviewed By: ajkr
Differential Revision: D33811664
Pulled By: hx235
fbshipit-source-id: 866859427fe710354a90f1095057f80116365ff0
Summary:
In RocksDB DBOptions::skip_log_error_on_recovery is marked as
"NOT SUPPORTED" for a long time, and setting this option does not have
any effect on the behavior of RocksDB library. Therefore, we are removing it
in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9434
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33763015
Pulled By: akankshamahajan15
fbshipit-source-id: 11f09643298da6c02d3dcdb090b996f4c3cfdd76
Summary:
Even after https://github.com/facebook/rocksdb/issues/9461 could see
```
Error: please specify prefix_size for test_batches_snapshots test!
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9463
Test Plan:
run `make blackbox_crashtest` for a long time. (Unfortunately,
it's taking a long time to reproduce these failures)
Reviewed By: akankshamahajan15
Differential Revision: D33838152
Pulled By: pdillinger
fbshipit-source-id: b9a73c5bbb68df53f14c22b9b52f61d1f7ef38af
Summary:
The API is deprecated long time ago. Clean up the codebase by
removing it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462
Test Plan: CI, fake release: D33835220
Reviewed By: riversand963
Differential Revision: D33835103
Pulled By: jay-zhuang
fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184
Summary:
Changes in https://github.com/facebook/rocksdb/issues/9453 could trigger
```
stderr:
Error: prefixpercent is non-zero while prefix_size is not positive!
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9461
Test Plan: run `make blackbox_crashtest` for a long time
Reviewed By: ajkr
Differential Revision: D33830751
Pulled By: pdillinger
fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476
Summary:
This also removes the obsolete names BackupableDBOptions
and UtilityDB. API users must now use BackupEngineOptions and
DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced
`rocksdb_backup_engine_*`. Similar renaming in Java API.
In reference to https://github.com/facebook/rocksdb/issues/9389
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D33780269
Pulled By: pdillinger
fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac
Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.
This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)
Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453
Test Plan:
new unit test added. Added memtable Bloom to crash test.
Performance testing
--------------------
Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```
Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)
Before & after run simultaneously, average over 10 iterations per data point, ops/sec.
MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066
MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990
MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before: 12081715
After: 12117603
MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591
Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.
Reviewed By: ajkr
Differential Revision: D33805025
Pulled By: pdillinger
fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452
Test Plan: Rely on my eyeball and CI
Reviewed By: ajkr
Differential Revision: D33804938
Pulled By: hx235
fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945
Summary:
Add an option to set the WAL compression algorithm - wal_compression.
TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432
Reviewed By: pdillinger
Differential Revision: D33797275
Pulled By: sidroyc
fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7
Summary:
This PR is one proposal to resolve https://github.com/facebook/rocksdb/issues/9382.
Looking at the code, I can't think of a reason why rdb is an internal component of RocksDB: it does not require
any header files NOT in `include/rocksdb`. It's a better idea to host it somewhere else.
Plus, rdb requires python2 which is not supported any more. No fixes or improvements will be made, even for potential
security bugs (https://www.python.org/doc/sunset-python-2/).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9399
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D33641965
Pulled By: riversand963
fbshipit-source-id: 2a6a74693e5de36834f355e41d6865db206af48b
Summary:
This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point,
people can start from the example repo and fork.
Java/JNI is not included yet, and needs to be done later if necessary.
The goal is to include this commit in RocksDB 7.0 release.
Reference:
https://github.com/ajkr/dedupfs by ajkr
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170
Test Plan:
Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress.
make check
Reviewed By: ajkr
Differential Revision: D33751662
Pulled By: riversand963
fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f
Summary:
In response to https://github.com/facebook/rocksdb/issues/9354, this PR adds a way for users to "opt out"
of extra checks that can impact peak write performance, which
currently only includes force_consistency_checks. I considered including
some other options but did not see a db_bench performance difference.
Also clarify in comment for force_consistency_checks that it can "slow
down saturated writing."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9363
Test Plan:
basic coverage in unit tests
Using my perf test in https://github.com/facebook/rocksdb/issues/9354 comment, I see
force_consistency_checks=true -> 725360 ops/s
force_consistency_checks=false -> 783072 ops/s
Reviewed By: mrambacher
Differential Revision: D33636559
Pulled By: pdillinger
fbshipit-source-id: 25bfd006f4844675e7669b342817dd4c6a641e84
Summary:
As title.
This is part of an fb-internal task.
First, remove all `using namespace` statements if applicable.
Next, utilize multiple build platforms and see if anything is broken.
Should anything become broken, fix the compilation errors with as little extra change as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369
Test Plan:
internal build and make check
make clean && make static_lib && cd examples && make all
Reviewed By: pdillinger
Differential Revision: D33517260
Pulled By: riversand963
fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40
Summary:
Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (https://github.com/facebook/rocksdb/issues/8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled.
To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py.
Depends on https://github.com/facebook/rocksdb/issues/9305.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9338
Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan.
Reviewed By: riversand963
Differential Revision: D33345333
Pulled By: ajkr
fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293
Reviewed By: pdillinger, zhichao-cao
Differential Revision: D33181591
Pulled By: mrambacher
fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
Summary:
Several improvements to SimulatedHybridFileSystem:
(1) Allow a mode where all I/Os to all files simulate HDD. This can be enabled in db_bench using -simulate_hdd
(2) Latency calculation is slightly more accurate
(3) Allow to simulate more than one HDD spindles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9301
Test Plan: Run db_bench and observe the results are reasonable.
Reviewed By: jay-zhuang
Differential Revision: D33141662
fbshipit-source-id: b736e58c4ba910d06899cc9ccec79b628275f4fa
Summary:
db_crashtest.py uses multiple CFs only when run without flag `--simple`.
The previous config set `-test_batches_snapshots=1` in that case for
blackbox mode. But `-test_batches_snapshots=1` cannot verify recovery
correctness, so it should not always be set for multi-CF blackbox tests.
We can instead randomly toggle it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9303
Reviewed By: riversand963
Differential Revision: D33155229
Pulled By: ajkr
fbshipit-source-id: 4a6fdc4eddccc8ece664063baf6393ce1c5de6b7
Summary:
SimulatedHybridFileSystem now takes a more thorough simualtion of an HDD:
1. cover writes too, not just read
2. Latency and throughput is now simulated as seek + read time, using a rate limiter
This implementation can be modified to simulate full HDD behavior, which is not yet done.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9259
Test Plan: Run db_bench and observe the desired behavior.
Reviewed By: jay-zhuang
Differential Revision: D32903039
fbshipit-source-id: a83f5d72143e114d5e75edf39d647bf0b71978e1
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.
Reviewed By: ltamasi
Differential Revision: D31721350
fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
Summary:
When table_options.prepopulate_block_cache is set to
BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and
table_options.partition_filters is also set true, then there is
segmentation failure when top level filter is fetched because its
entered with wrong type in cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263
Test Plan:
Updated unit tests;
Ran db_stress: make crash_test -j32
Reviewed By: pdillinger
Differential Revision: D32936566
Pulled By: akankshamahajan15
fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426
Summary:
Fix a bug that causes file temperature not preserved after DB is restarted, or options.max_manifest_file_size is hit.
Also, pass temperature information to NewRandomAccessFile() to allow users to hack a solution where they don't preserve tiering information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9242
Test Plan: Add a unit test that would fail without the fix.
Reviewed By: jay-zhuang
Differential Revision: D32818150
fbshipit-source-id: 36aa3f148c60107f7b8e9d65b63b039f9e1a1eec
Summary:
Disable the QPS verification in test temporally, which causes the test failure due to different system delays.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9190
Test Plan: make check
Reviewed By: siying
Differential Revision: D32576289
Pulled By: zhichao-cao
fbshipit-source-id: 1df972e77dd82eed5af3462e5db5e141aadf8fae
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
Test Plan: Ran `make check` and the stress/crash test.
Reviewed By: riversand963
Differential Revision: D32565512
Pulled By: ltamasi
fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
Summary:
Fix the analyzer test failure caused by inaccurate timing wait. The wait time at different system might be different or cause the delay, now we do not accurately count the lines. Only in a very rare extreme case, test will ignore the part exceed the timing of 1 second.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9181
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D32511319
Pulled By: zhichao-cao
fbshipit-source-id: e694c8cb465c750cfa5a43dab3eff6707b9a11c8
Summary:
RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB.
This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block.
1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer), readahead_size will decrease by 8KB.
2. It maintains readahead_size for L1 - Ln levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056
Test Plan:
Added new unit tests
Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled.
Reviewed By: anand1976
Differential Revision: D31773640
Pulled By: akankshamahajan15
fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98
Summary:
Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087
Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc.
Reviewed By: riversand963
Differential Revision: D32007222
Pulled By: anand1976
fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792