Commit graph

393 commits

Author SHA1 Message Date
anand76 3362a730dc Avoid usage of ReopenWritableFile in db_stress (#9649)
Summary:
The UniqueIdVerifier constructor currently calls ReopenWritableFile on
the FileSystem, which might not be supported. Instead of relying on
reopening the unique IDs file for writing, create a new file and copy
the original contents.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9649

Test Plan: Run db_stress

Reviewed By: pdillinger

Differential Revision: D34572307

Pulled By: anand1976

fbshipit-source-id: 3a777908582d79dae57488d4278bad126774f698
2022-03-04 10:30:10 -08:00
Jay Zhuang d3a2f284d9 Add Temperature info in NewSequentialFile() (#9499)
Summary:
Add Temperature hints information from RocksDB in API
`NewSequentialFile()`. backup and checkpoint operations need to open the
source files with `NewSequentialFile()`, which will have the temperature
hints. Other operations are not covered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9499

Test Plan: Added unittest

Reviewed By: pdillinger

Differential Revision: D34006115

Pulled By: jay-zhuang

fbshipit-source-id: 568b34602b76520e53128672bd07e9d886786a2f
2022-02-18 18:23:07 -08:00
Andrew Kryczka babe56ddba Add rate limiter priority to ReadOptions (#9424)
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.

`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.

There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).

The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424

Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
  - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
  - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`

Reviewed By: hx235

Differential Revision: D33747386

Pulled By: ajkr

fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
2022-02-16 23:18:14 -08:00
Yanqin Jin 685044dff2 Remove timestamp from key in expected state (#9525)
Summary:
The keys as part of write batch read from trace file can contain trailing timestamps.
This PR removes them before calling `ExpectedState`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9525

Test Plan:
make check
make crash_test_with_ts

Reviewed By: ajkr

Differential Revision: D34082358

Pulled By: riversand963

fbshipit-source-id: 78c925659e2a19e4a8278fb4a8ddf5070e265c04
2022-02-09 09:50:54 -08:00
Akanksha Mahajan 9745c68eb1 Remove deprecated option new_table_reader_for_compaction_inputs (#9443)
Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443

Test Plan: CircleCI

Reviewed By: ajkr

Differential Revision: D33788508

Pulled By: akankshamahajan15

fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
2022-02-08 19:31:28 -08:00
satyajanga 036bbab6f7 Use the comparator from the sst file table properties in sst_dump_tool (#9491)
Summary:
We introduced a new Comparator for timestamp in user keys. In the sst_dump_tool by default we use BytewiseComparator to read sst files. This change allows us to read comparator_name from table properties in meta data block and use it to read.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9491

Test Plan:
added unittests for new functionality.
make check
![image](https://user-images.githubusercontent.com/4923556/152915444-28b88a1f-7b4e-47d0-815f-7011552bd9a2.png)
![image](https://user-images.githubusercontent.com/4923556/152916196-bea3d2a1-a3d5-4362-b911-036131b83e8d.png)

Reviewed By: riversand963

Differential Revision: D33993614

Pulled By: satyajanga

fbshipit-source-id: 4b5cf938e6d2cb3931d763bef5baccc900b8c536
2022-02-08 12:15:35 -08:00
Yanqin Jin 629e3e1d77 Fix spelling in public API (#9490)
Summary:
I feel it would be nice if we can fix this spelling error.

In `SizeApproximationOptions`, the `include_memtabtles` should be `include_memtables`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9490

Test Plan: make check

Reviewed By: hx235

Differential Revision: D33949862

Pulled By: riversand963

fbshipit-source-id: b2be67501b65d4aabb6b8df1bf25eb8d54cc1466
2022-02-03 15:15:23 -08:00
Yanqin Jin 3122cb4358 Revise APIs related to user-defined timestamp (#8946)
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".

According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).

For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.

Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.

The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.

Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946

Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0

Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```

Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom   :       1.831 micros/op 546235 ops/sec;   60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom   :       1.820 micros/op 549404 ops/sec;   60.8 MB/s
```

Reviewed By: ltamasi

Differential Revision: D33721359

Pulled By: riversand963

fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
2022-02-01 22:19:01 -08:00
Hui Xiao 920386f2b7 Detect (new) Bloom/Ribbon Filter construction corruption (#9342)
Summary:
Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum

This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.

Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.

**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption
   - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
   -  Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342

Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
   - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true  -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
   -  FastLocalBloom
       - Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
       - After change:
          -  `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
    -  Standard128Ribbon
       - Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
       - After change:
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.

Reviewed By: pdillinger

Differential Revision: D33746928

Pulled By: hx235

fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
2022-02-01 17:42:35 -08:00
Levi Tamasi 7cd5763274 Fix a copy-paste bug related to background threads in db_stress (#9485)
Summary:
Fixes a typo introduced in https://github.com/facebook/rocksdb/pull/9466.

Fixes https://github.com/facebook/rocksdb/issues/9482

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9485

Test Plan:
```
COMPILE_WITH_TSAN=1 make db_stress -j24
./db_stress --ops_per_thread=1000 --reopen=5
```

Reviewed By: ajkr

Differential Revision: D33928601

Pulled By: ltamasi

fbshipit-source-id: 3e01a0ca5fffb56c268c811cbe045413b225059a
2022-02-01 15:56:17 -08:00
Andrew Kryczka ed75dddc35 Optimize db_stress setup phase (#9475)
Summary:
It is too slow that our `db_crashtest.py` often kills `db_stress` before
the setup phase completes. Profiled it and found a few ways to optimize.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9475

Test Plan:
Measured setup phase time reduced 22% (36 -> 28 seconds) for first run, and
36% (38 -> 24 seconds) for non-first run on empty-ish DB.

- first run benchmark command: `rm -rf /dev/shm/dbstress*/ && mkdir -p /dev/shm/dbstress_expected/ && ./db_stress -max_key=100000000 -destroy_db_initially=1 -expected_values_dir=/dev/shm/dbstress_expected/ -db=/dev/shm/dbstress/ --clear_column_family_one_in=0 --reopen=0 --nooverwritepercent=1`

output before this PR:

```
2022/01/31-11:14:05  Initializing db_stress
...
2022/01/31-11:14:41  Starting database operations
```

output after this PR:

```
...
2022/01/31-11:12:23  Initializing db_stress
...
2022/01/31-11:12:51  Starting database operations
```

- non-first run benchmark command: `./db_stress -max_key=100000000 -destroy_db_initially=0 -expected_values_dir=/dev/shm/dbstress_expected/ -db=/dev/shm/dbstress/ --clear_column_family_one_in=0 --reopen=0 --nooverwritepercent=1`

output before this PR:

```
2022/01/31-11:20:45  Initializing db_stress
...
2022/01/31-11:21:23  Starting database operations
```

output after this PR:

```
2022/01/31-11:22:02  Initializing db_stress
...
2022/01/31-11:22:26  Starting database operations
```

- ran minified crash test a while: `DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --interval=10 --max_key=1000000 --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --value_size_mult=33`

Reviewed By: anand1976

Differential Revision: D33897793

Pulled By: ajkr

fbshipit-source-id: 0d7b2c93e1e2a9f8a878e87632c2455406313087
2022-02-01 11:47:28 -08:00
Andrew Kryczka c7ce03dce1 db_stress begin tracking expected state after verification (#9470)
Summary:
Previously we enabled tracking expected state changes during
`FinishInitDb()`, as soon as the DB was opened. This meant tracing was
enabled during `VerifyDb()`. This cost extra CPU by requiring
`DBImpl::trace_mutex_` to be acquired on each read operation. It was
unnecessary since we know there are no expected state changes during the
`VerifyDb()` phase. So, this PR delays tracking expected state changes
until after the `VerifyDb()` phase has completed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9470

Test Plan:
Measured this PR reduced `VerifyDb()` 76% (387 -> 92 seconds) with
`-disable_wal=1` (i.e., expected state tracking enabled).

- benchmark command: `./db_stress -max_key=100000000 -ops_per_thread=1 -destroy_db_initially=1 -expected_values_dir=/dev/shm/dbstress_expected/ -db=/dev/shm/dbstress/ --clear_column_family_one_in=0 --disable_wal=1 --reopen=0`
- without this PR, `VerifyDb()` takes 387 seconds:

```
2022/01/30-21:43:04  Initializing worker threads
Crash-recovery verification passed :)
2022/01/30-21:49:31  Starting database operations
```

- with this PR, `VerifyDb()` takes 92 seconds

```
2022/01/30-21:59:06  Initializing worker threads
Crash-recovery verification passed :)
2022/01/30-22:00:38  Starting database operations
```

Reviewed By: riversand963

Differential Revision: D33884596

Pulled By: ajkr

fbshipit-source-id: 5f259de8087de5b0531f088e11297f37ed2f7685
2022-01-31 13:35:32 -08:00
Levi Tamasi f07c56928f Set the number of threads up front in db_stress (#9466)
Summary:
With the code on main, `RunStressTest` increments the number of threads
one by one as the threads are created and started. This results in a
data race with `NonBatchedOpsStressTest::VerifyDb`, which reads this
value without synchronization, and is also not correct in the sense
that `VerifyDb` assumes that the number of threads already has its final
value set (e.g. it's checking whether the current thread is the last
one). The patch fixes this by setting the number of threads before
creating/starting any threads. This also eliminates the need for locking
the mutex during thread startup.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9466

Test Plan: Ran the blackbox crash test under TSAN for a while.

Reviewed By: ajkr

Differential Revision: D33858856

Pulled By: ltamasi

fbshipit-source-id: 8a6515a83fd1808b8b8dca61978777c4404f04cc
2022-01-29 10:45:41 -08:00
Peter Dillinger 78aee6fedc Remove obsolete backupable_db.h, utility_db.h (#9438)
Summary:
This also removes the obsolete names BackupableDBOptions
and UtilityDB. API users must now use BackupEngineOptions and
DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced
`rocksdb_backup_engine_*`. Similar renaming in Java API.

In reference to https://github.com/facebook/rocksdb/issues/9389

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438

Test Plan: CI

Reviewed By: mrambacher

Differential Revision: D33780269

Pulled By: pdillinger

fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac
2022-01-27 15:45:30 -08:00
Hui Xiao 1e0e883ca5 Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452)
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452

Test Plan: Rely on my eyeball and CI

Reviewed By: ajkr

Differential Revision: D33804938

Pulled By: hx235

fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945
2022-01-27 13:01:09 -08:00
Aravind Ramesh 2eac6bb120 db_stress: db_stress fails on custom filesystems. (#9352)
Summary:
db_stress listener service always uses default filesystem to operate,
causing it to not recognize custom filesystem (like ZenFS plugin FS).
Pass the env to db_stress listener with the correct filesystem
information, so it can open the user intended filesystem.

Signed-off-by: Aravind Ramesh <Aravind.Ramesh@wdc.com>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9352

Reviewed By: riversand963

Differential Revision: D33776762

Pulled By: pdillinger

fbshipit-source-id: e79bb9a544384f80ae9dd0108241ab9c83223954
2022-01-25 16:22:58 -08:00
Yanqin Jin 50135c1bf3 Move HDFS support to separate repo (#9170)
Summary:
This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point,
people can start from the example repo and fork.

Java/JNI is not included yet, and needs to be done later if necessary.

The goal is to include this commit in RocksDB 7.0 release.

Reference:
https://github.com/ajkr/dedupfs by ajkr

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170

Test Plan:
Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress.

make check

Reviewed By: ajkr

Differential Revision: D33751662

Pulled By: riversand963

fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f
2022-01-24 20:23:54 -08:00
Andrew Kryczka 6892f19b11 Test correctness with WAL disabled in non-txn blackbox crash tests (#9338)
Summary:
Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (https://github.com/facebook/rocksdb/issues/8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled.

To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py.

Depends on https://github.com/facebook/rocksdb/issues/9305.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9338

Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan.

Reviewed By: riversand963

Differential Revision: D33345333

Pulled By: ajkr

fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce
2022-01-05 16:23:37 -08:00
mrambacher fe31dc53ca Make the Env class Customizable (#9293)
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293

Reviewed By: pdillinger, zhichao-cao

Differential Revision: D33181591

Pulled By: mrambacher

fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
2022-01-04 16:45:49 -08:00
Andrew Kryczka aa2b3bf675 Added TraceOptions::preserve_write_order (#9334)
Summary:
This option causes trace records to be written in the serialized write thread. That way, the write records in the trace must follow the same order as writes that are logged to WAL and writes that are applied to the DB.

By default I left it disabled to match existing behavior. I enabled it in `db_stress`, though, as that use case requires order of write records in trace matches the order in WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9334

Test Plan:
- See if below unsynced data loss crash test can run  for 24h straight. It used to crash after a few hours when reaching an unlucky trace ordering.

```
DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 --test_batches_snapshots=0 --duration=86400
```

Reviewed By: zhichao-cao

Differential Revision: D33301990

Pulled By: ajkr

fbshipit-source-id: 82d97559727adb4462a7af69758449c8725b22d3
2021-12-28 15:04:26 -08:00
Andrew Kryczka 2ee20a669d Extend trace filtering to more operation types (#9335)
Summary:
- Extended trace filtering to cover `MultiGet()`, `Seek()`, and `SeekForPrev()`. Now all user ops that can be traced support filtering.
- Enabled the new filter masks in `db_stress` since it only cares to trace writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9335

Test Plan:
- trace-heavy `db_stress` command reduced 30% elapsed time  (79.21 -> 55.47 seconds)

Benchmark command:
```
$ /usr/bin/time ./db_stress -ops_per_thread=100000 -sync_fault_injection=1 --db=/dev/shm/rocksdb_stress_db/ --expected_values_dir=/dev/shm/rocksdb_stress_expected/ --clear_column_family_one_in=0
```

- replay-heavy `db_stress` command reduced 12.4% elapsed time (23.69 -> 20.75 seconds)

Setup command:
```
$  ./db_stress -ops_per_thread=100000000 -sync_fault_injection=1 -db=/dev/shm/rocksdb_stress_db/ -expected_values_dir=/dev/shm/rocksdb_stress_expected --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress
```

Benchmark command:
```
$ /usr/bin/time ./db_stress -ops_per_thread=1 -reopen=0 -expected_values_dir=/dev/shm/rocksdb_stress_expected/ -db=/dev/shm/rocksdb_stress_db/ --clear_column_family_one_in=0 --destroy_db_initially=0
```

Reviewed By: zhichao-cao

Differential Revision: D33304580

Pulled By: ajkr

fbshipit-source-id: 0df10f87c1fc506e9484b6b42cea2ef96c7ecd65
2021-12-28 11:46:30 -08:00
Andrew Kryczka dfff1cecff Filter Get()s from db_stress traces (#9315)
Summary:
`db_stress` traces are used for tracking unsynced changes. For that purpose, we
only need to track writes and not reads. Currently `TraceOptions` only
supports excluding `Get()`s from the trace, so this PR only excludes
`Get()`s. In the future it would be good to exclude `MultiGet()`s and
iterator operations too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9315

Test Plan:
- trace-heavy `db_stress` command elapsed time reduced 37%

Benchmark:
```
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_stress -ops_per_thread=100000 -sync_fault_injection=1 -expected_values_dir=/dev/shm/dbstress_expected --clear_column_family_one_in=0
```

- replay-heavy `db_stress` command elapsed time reduced 38%

Setup:
```
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_stress -ops_per_thread=100000000 -sync_fault_injection=1 -expected_values_dir=/dev/shm/dbstress_expected --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress
```
Benchmark:
```
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_stress -ops_per_thread=1 -reopen=0 -expected_values_dir=/dev/shm/dbstress_expected --clear_column_family_one_in=0 --destroy_db_initially=0
```

Reviewed By: zhichao-cao

Differential Revision: D33229900

Pulled By: ajkr

fbshipit-source-id: 0e4251c674d236ddbc4548e9bbfdd608bf3cdc93
2021-12-22 14:17:45 -08:00
Andrew Kryczka 82670fb17b db_stress print hex key for MultiGet() inconsistency (#9324)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9324

Reviewed By: riversand963

Differential Revision: D33248178

Pulled By: ajkr

fbshipit-source-id: c8a7382ed613f9ac3a0a2e3fa7d3c6fe9c95ef85
2021-12-20 23:29:43 -08:00
Andrew Kryczka b448b71222 db_stress tolerate incomplete tail records in trace file (#9316)
Summary:
I saw the following error when running crash test for a while with
unsynced data loss:

```
Error restoring historical expected values: Corruption: Corrupted trace file.
```

The trace file turned out to have an incomplete tail record. This is
normal considering blackbox kills `db_stress` while trace can be
ongoing.

In the case where the trace file is not otherwise corrupted, there
should be enough records already seen to sync up the expected state with
the recovered DB. This PR ignores any `Status::Corruption` the
`Replayer` returns when that happens.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9316

Reviewed By: jay-zhuang

Differential Revision: D33230579

Pulled By: ajkr

fbshipit-source-id: 9814af4e39e57f00d85be7404363211762f9b41b
2021-12-20 13:08:49 -08:00
Andrew Kryczka 791723c1ec Fix race condition in db_stress thread setup (#9314)
Summary:
We need to grab `SharedState`'s mutex while calling `IncThreads()` or `IncBgThreads()`. Otherwise the newly launched threads can simultaneously access the thread counters to check if every thread has finished initializing.

Repro command:

```
$ rm -rf /dev/shm/rocksdb/rocksdb_crashtest_{whitebox,expected}/ && mkdir -p /dev/shm/rocksdb/rocksdb_crashtest_{whitebox,expected}/ && ./db_stress --acquire_snapshot_one_in=10000 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=131.8094496796033 --bottommost_compression_type=zlib --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_style=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --fail_if_options_file_error=1 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=22 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --open_files=500000 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=20000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --subcompactions=2 --sync=0 --sync_fault_injection=False --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=1 --test_cf_consistency=1 --top_level_index_pinning=0 --unpartitioned_pinning=0 --use_block_based_filter=1 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35
```

TSAN error:

```
WARNING: ThreadSanitizer: data race (pid=2750142)
  Read of size 4 at 0x7ffc21d7f58c by thread T39 (mutexes: write M670895590377780496):
    #0 rocksdb::SharedState::AllInitialized() const db_stress_tool/db_stress_shared_state.h:204 (db_stress+0x4fd307)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::ThreadBody(void*) db_stress_tool/db_stress_driver.cc:26 (db_stress+0x4fd307)
    https://github.com/facebook/rocksdb/issues/2 StartThreadWrapper env/env_posix.cc:454 (db_stress+0x84472f)

  Previous write of size 4 at 0x7ffc21d7f58c by main thread:
    #0 rocksdb::SharedState::IncThreads() db_stress_tool/db_stress_shared_state.h:194 (db_stress+0x4fd779)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::RunStressTest(rocksdb::StressTest*) db_stress_tool/db_stress_driver.cc:78 (db_stress+0x4fd779)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::db_stress_tool(int, char**) db_stress_tool/db_stress_tool.cc:348 (db_stress+0x4b97dc)
    https://github.com/facebook/rocksdb/issues/3 main db_stress_tool/db_stress.cc:21 (db_stress+0x47a351)

  Location is stack of main thread.

  Location is global '<null>' at 0x000000000000 ([stack]+0x00000001d58c)

  Mutex M670895590377780496 is already destroyed.

  Thread T39 (tid=2750211, running) created by main thread at:
    #0 pthread_create /home/engshare/third-party2/gcc/9.x/src/gcc-10.x/libsanitizer/tsan/tsan_interceptors.cc:964 (libtsan.so.0+0x613c3)
    https://github.com/facebook/rocksdb/issues/1 StartThread env/env_posix.cc:464 (db_stress+0x8463c2)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::CompositeEnvWrapper::StartThread(void (*)(void*), void*) env/composite_env_wrapper.h:288 (db_stress+0x4bcd20)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::EnvWrapper::StartThread(void (*)(void*), void*) include/rocksdb/env.h:1475 (db_stress+0x4bb950)
    https://github.com/facebook/rocksdb/issues/4 rocksdb::RunStressTest(rocksdb::StressTest*) db_stress_tool/db_stress_driver.cc:80 (db_stress+0x4fd9d2)
    https://github.com/facebook/rocksdb/issues/5 rocksdb::db_stress_tool(int, char**) db_stress_tool/db_stress_tool.cc:348 (db_stress+0x4b97dc)
    https://github.com/facebook/rocksdb/issues/6 main db_stress_tool/db_stress.cc:21 (db_stress+0x47a351)

 ThreadSanitizer: data race db_stress_tool/db_stress_shared_state.h:204 in rocksdb::SharedState::AllInitialized() const
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9314

Test Plan: verified repro command works after this PR.

Reviewed By: jay-zhuang

Differential Revision: D33217698

Pulled By: ajkr

fbshipit-source-id: 79358fe5adb779fc9dcf80643cc102d4b467fc38
2021-12-20 13:05:23 -08:00
Andrew Kryczka 863c78d2c9 Fix unsynced data loss correctness test with mixed -test_batches_snapshots (#9302)
Summary:
This fixes two bugs in the recently committed DB verification following
crash-recovery with unsynced data loss (https://github.com/facebook/rocksdb/issues/8966):

The first bug was in crash test runs involving mixed values for
`-test_batches_snapshots`. The problem was we were neither restoring
expected values nor enabling tracing when `-test_batches_snapshots=1`.
This caused a future `-test_batches_snapshots=0` run to not find enough
trace data to restore expected values. The fix is to restore expected
values at the start of `-test_batches_snapshots=1` runs, but still leave
tracing disabled as we do not need to track those KVs.

The second bug was in `db_stress` runs that restore the expected values
file and use compaction filter. The compaction filter was initialized to use
the pre-restore expected values, which would be `munmap()`'d during
`FileExpectedStateManager::Restore()`. Then compaction filter would run
into a segfault. The fix is just to reorder compaction filter init after expected
values restore.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9302

Test Plan:
- To verify the first problem, the below sequence used to fail; now it passes.

```
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=0
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=1
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=0
```

- The second problem occurred rarely in the form of a SIGSEGV on a file that was `munmap()`d. I have not seen it after this PR though this doesn't prove much.

Reviewed By: jay-zhuang

Differential Revision: D33155283

Pulled By: ajkr

fbshipit-source-id: 66fd0f0edf34015a010c30015f14f104734e964e
2021-12-17 22:05:29 -08:00
Andrew Kryczka 84228e21e8 Fix shutdown in db_stress with -test_batches_snapshots=1 (#9313)
Summary:
The `SharedState` constructor had an early return in case of
`-test_batches_snapshots=1`. This early return caused `num_bg_threads_`
to never be incremented. Consequently, the driver thread could cleanup
objects like the `SharedState` while BG threads were still running and
accessing it, leading to crash.

The fix is to move the logic for counting threads (both FG and BG) to
the place they are launched. That way we can be sure the counts are
consistent, at least for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9313

Test Plan:
below command used to fail, now it passes.

```
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=1
```

Reviewed By: jay-zhuang

Differential Revision: D33198670

Pulled By: ajkr

fbshipit-source-id: 126592dc1eb31998bc8f82ffbf5a0d4eb8dec317
2021-12-17 17:31:40 -08:00
mrambacher 423538a816 Make MemoryAllocator into a Customizable class (#8980)
Summary:
- Make MemoryAllocator and its implementations into a Customizable class.
- Added a "DefaultMemoryAllocator" which uses new and delete
- Added a "CountedMemoryAllocator" that counts the number of allocs and free
- Updated the existing tests to use these new allocators
- Changed the memkind allocator test into a generic test that can test the various allocators.
- Added tests for creating all of the allocators
- Added tests to verify/create the JemallocNodumpAllocator using its options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980

Reviewed By: zhichao-cao

Differential Revision: D32990403

Pulled By: mrambacher

fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76
2021-12-17 04:20:47 -08:00
Andrew Kryczka c9818b3325 db_stress verify with lost unsynced operations (#8966)
Summary:
When a previous run left behind historical state/trace files (implying it was run with --sync_fault_injection set), this PR uses them to restore the expected state according to the DB's recovered sequence number. That way, a tail of latest unsynced operations are permitted to be dropped, as is the case when data in page cache or certain `Env`s is lost. The point of the verification in this scenario is just to ensure there is no hole in the recovered data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8966

Test Plan:
- ran it a while, made sure it is restoring expected values using the historical state/trace files:
```
$ rm -rf ./tmp-db/ ./exp/ && mkdir -p ./tmp-db/ ./exp/ && while ./db_stress -compression_type=none -clear_column_family_one_in=0 -expected_values_dir=./exp -sync_fault_injection=1 -destroy_db_initially=0 -db=./tmp-db -max_key=1000000 -ops_per_thread=10000 -reopen=0 -threads=32 ; do : ; done
```

Reviewed By: pdillinger

Differential Revision: D31219445

Pulled By: ajkr

fbshipit-source-id: f0e1d51fe5b35465b00565c33331190ea38ba0ad
2021-12-15 12:54:44 -08:00
Yanqin Jin e05c2bb549 Stress test for RocksDB transactions (#8936)
Summary:
Current db_stress does not cover complex read-write transactions. Therefore, this PR adds
coverage for emulated MyRocks-style transactions in `MultiOpsTxnsStressTest`. To achieve this, we need:

- Add a new operation type 'customops' so that we can add new complex groups of operations, e.g. transactions involving multiple read-write operations.
- Implement three read-write transactions and two read-only ones to emulate MyRocks-style transactions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8936

Test Plan:
```
make check
./db_stress -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 -readpercent=20 -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=100000
```

Next step is to add more configurability and refine input generation and result reporting, which will done in separate follow-up PRs.

Reviewed By: zhichao-cao

Differential Revision: D31071795

Pulled By: riversand963

fbshipit-source-id: 50d7c828346ec643311336b904848a1588a37006
2021-12-14 13:34:43 -08:00
Akanksha Mahajan 9e4d56f2c9 Fix segmentation fault in table_options.prepopulate_block_cache when used with partition_filters (#9263)
Summary:
When table_options.prepopulate_block_cache is set to
BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and
table_options.partition_filters is also set true, then there is
segmentation failure when top level filter is fetched because its
entered with wrong type in cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263

Test Plan:
Updated unit tests;
Ran db_stress: make crash_test -j32

Reviewed By: pdillinger

Differential Revision: D32936566

Pulled By: akankshamahajan15

fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426
2021-12-08 12:44:38 -08:00
Hui Xiao 66b31c5098 Fix -Werror=maybe-uninitialized in db_stress_tool (#9265)
Summary:
Context/Summary:
Uninitialized variable `SequenceNumber old_saved_seqno` causes asan related compilation error/warning below:

```
db_stress_tool/expected_state.cc:308:55: error: ‘old_saved_seqno’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  308 |   if (s.ok() && old_saved_seqno != kMaxSequenceNumber &&
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```

Fix it by initializing to 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9265

Test Plan:
- make clean && COMPILE_WITH_ASAN=1 make -j48 db_stress_tool/expected_state.o
- monitor if same error happens again after merging

Reviewed By: ajkr

Differential Revision: D32939630

Pulled By: hx235

fbshipit-source-id: 41697515fd11ada8427f606b5dceb4e58d12cb80
2021-12-07 22:42:30 -08:00
Andrew Kryczka ce42ae6ffd Fix Statistics in db_stress (#9260)
Summary:
The `Statistics` objects are meant to be shared across translation
units, but this was prevented by declaring them static. We need to
ensure they are defined once in the program. The effect is now
`StressTest::PrintStatistics()` can actually print statistics since it
now sees non-null values when `--statistics=1`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9260

Reviewed By: zhichao-cao

Differential Revision: D32910162

Pulled By: ajkr

fbshipit-source-id: c926d6f556177987bee5fa3cbc87597803b230ee
2021-12-07 16:24:22 -08:00
Andrew Kryczka a6a6aad74e db_stress support tracking historical values (#8960)
Summary:
When `--sync_fault_injection` is set, this PR takes a snapshot of the expected values and starts an operation trace when the DB is opened. These files are stored in `--expected_values_dir`. They will be used for recovering the expected state of the DB following a crash where a suffix of unsynced operations are allowed to be lost.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8960

Test Plan: injected crashed at various points in `FileExpectedStateManager` and verified the next run recovers the state/trace file with highest seqno and removes all older/temporary files. Note we don't use sync_fault_injection in CI crash tests yet.

Reviewed By: pdillinger

Differential Revision: D31194941

Pulled By: ajkr

fbshipit-source-id: b0f935a529a0186c5a9c7709fcaa8829de8a84cf
2021-12-07 13:41:48 -08:00
Aravind Ramesh 9c932816cf db_stress: db_stress segmentation fault (#9219)
Summary:
db_stress asserts/seg-faults with below command (on debug and release builds)

```
"rm -rf /tmp/rocksdbtest*; db_stress --ops_per_thread=1000 --reopen=5"
=======================================
Error opening unique id file for append: IO error: No such file or directory:
While open a file for appending: /tmp/rocksdbtest-0/dbstress/.unique_ids:
No such file or directory
Choosing random keys with no overwrite
Creating 2621440 locks
Starting continuous_verification_thread
2021/11/15-08:46:49  Initializing worker threads
2021/11/15-08:46:49  Starting database operations
2021/11/15-08:46:49  Reopening database for the 1th time
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
DB path: [/tmp/rocksdbtest-0/dbstress]
Segmentation fault
=======================================

```
StressTest() constructor deletes the directory "dbstress" because
the option --destroy_db_initially is true by default in db_stress.

This Seg fault happens on a new database, UniqueIdVerifier's constructor
tries to read the ".unique_ids" file, if the file is not present,
ReopenWritableFile() tries to create .unique_ids file, but fails
as the directory db_stress is not available. The data_file_writer_
is set as an invalid(null) pointer and in subsequent calls (~UniqueIdVerifier()
and UniqueIdVerifier::Verify()) it accesses this null pointer and crashes.

This patch creates db_stress directory if it is missing, so the .unique_ids file
is created.

Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9219

Reviewed By: ajkr

Differential Revision: D32730151

Pulled By: pdillinger

fbshipit-source-id: f47baba56b380d93c3ba5608904756e86bbf14f5
2021-11-30 14:56:13 -08:00
Yanqin Jin 42fef0224f Fix build for msvc (#9230)
Summary:
Test plan

With Visual Studio 2017.
```
cd rocksdb
mkdir build && cd build
cmake -G "Visual Studio 15 Win64" -DWITH_GFLAGS=1 ..
MSBuild rocksdb.sln /m /TARGET:cache_bench /TARGET:db_bench /TARGET:db_stress
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9230

Reviewed By: akankshamahajan15

Differential Revision: D32705095

Pulled By: riversand963

fbshipit-source-id: 101e3533f5178b24c0535ddc47a39347ccfcf92c
2021-11-29 14:27:48 -08:00
Levi Tamasi dc5de45af8 Support readahead during compaction for blob files (#9187)
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187

Test Plan: Ran `make check` and the stress/crash test.

Reviewed By: riversand963

Differential Revision: D32565512

Pulled By: ltamasi

fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
2021-11-19 17:53:47 -08:00
anand76 78556c14dd Secondary cache error injection (#9002)
Summary:
Implement secondary cache error injection in db_stress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9002

Reviewed By: zhichao-cao

Differential Revision: D31874896

Pulled By: anand1976

fbshipit-source-id: 8cf04c061a4a44efa0fe88423d05cade67b85f73
2021-11-08 10:27:27 -08:00
Yanqin Jin d263505417 Avoid div-by-zero error in db_stress (#9086)
Summary:
If a column family has 0 levels, then existing `TestCompactFiles(...)` may hit
divide-by-zero. To fix, return early if the cf is empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9086

Test Plan: TBD

Reviewed By: ajkr

Differential Revision: D31986799

Pulled By: riversand963

fbshipit-source-id: 48f7dfb2b2b47cfc1315cb71ca80eb230d947f17
2021-10-31 22:16:03 -07:00
Peter Dillinger 0534393fc8 Fix stress/crash test handling of SST unique IDs (#9054)
Summary:
Was not handling the case of OnTableFileCreated invoked for
table file NOT created.

Also improved error reporting and caught a missing status check.

Also strengthened the db_stress listener to require file_size > 0 when
status.ok(). We would be violating the API contract if status is OK and
we didn't create a valid SST file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9054

Test Plan: make blackbox_crash_test for a while

Reviewed By: zhichao-cao

Differential Revision: D31765200

Pulled By: pdillinger

fbshipit-source-id: 7c527f5531bc239a5efd7a7b018545d480f926e2
2021-10-19 11:52:07 -07:00
Peter Dillinger ad5325a736 Experimental support for SST unique IDs (#8990)
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).

Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990

Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D31582865

Pulled By: pdillinger

fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
2021-10-18 23:32:01 -07:00
Levi Tamasi 3e1bf771a3 Make it possible to force the garbage collection of the oldest blob files (#8994)
Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.

In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).

These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)

This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994

Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.

Reviewed By: riversand963

Differential Revision: D31489850

Pulled By: ltamasi

fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
2021-10-11 18:03:01 -07:00
Andrew Kryczka a282eff3d1 Protect existing files in FaultInjectionTest{Env,FS}::ReopenWritableFile() (#8995)
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.

The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995

Test Plan:
- Verified it fixes the following failure:

```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```

- `make check -j48`

Reviewed By: ltamasi

Differential Revision: D31495388

Pulled By: ajkr

fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
2021-10-11 16:23:18 -07:00
Akanksha Mahajan 84d71f30c4 Enable SingleDelete with user defined ts in db_bench and crash tests (#8971)
Summary:
Enable SingleDelete with user defined timestamp in db_bench,
db_stress and crash test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8971

Test Plan:
1. For db_stress, ran the command for full duration: i) python3 -u tools/db_crashtest.py
--enable_ts whitebox --nooverwritepercent=100
ii) make crash_test_with_ts

2. For db_bench, ran:  ./db_bench -benchmarks=randomreplacekeys
-user_timestamp_size=8 -use_single_deletes=true

Reviewed By: riversand963

Differential Revision: D31246558

Pulled By: akankshamahajan15

fbshipit-source-id: 29cd8740c9921341e52f09242fca3c44d75a12b7
2021-10-01 16:48:01 -07:00
mrambacher 13ae16c315 Cleanup includes in dbformat.h (#8930)
Summary:
This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.

Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.

Hopefully, this sort of code hygiene cleanup will speed up the builds...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930

Reviewed By: pdillinger

Differential Revision: D31142788

Pulled By: mrambacher

fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
2021-09-29 04:04:40 -07:00
Andrew Kryczka 559943cdc0 Refactor expected state in stress/crash test (#8913)
Summary:
This is a precursor refactoring to enable an upcoming feature: persistence failure correctness testing.

- Changed `--expected_values_path` to `--expected_values_dir` and migrated "db_crashtest.py" to use the new flag. For persistence failure correctness testing there are multiple possible correct states since unsynced data is allowed to be dropped. Making it possible to restore all these possible correct states will eventually involve files containing snapshots of expected values and DB trace files.
- The expected values directory is managed by an `ExpectedStateManager` instance. Managing expected state files is separated out of `SharedState` to prevent `SharedState` from becoming too complex when the new files and features (snapshotting, tracing, and restoring) are introduced.
- Migrated expected values file access/management out of `SharedState` into a separate class called `ExpectedState`. This is not exposed directly to the test but rather the `ExpectedState` for the latest values file is accessed via a pass-through API on `ExpectedStateManager`. This forces the test to always access the single latest `ExpectedState`.
- Changed the initialization of the latest expected values file to use a tempfile followed by rename, and also add cleanup logic for possible stranded tempfiles.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8913

Test Plan:
run in several ways; try to make sure it's not obviously broken.

- crashtest blackbox without TEST_TMPDIR
```
$ python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest blackbox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest whitebox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py whitebox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none --random_kill_odd=88887
```
- db_stress without expected_values_dir
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true
```
- db_stress with expected_values_dir and manual corruption
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true --expected_values_dir=./
// modify one byte in "./LATEST.state"
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=false --expected_values_dir=./
...
Verification failed for column family 0 key 0000000000000000 (0): Value not found: NotFound:
...
```

Reviewed By: riversand963

Differential Revision: D30921951

Pulled By: ajkr

fbshipit-source-id: babfe218062e55d018c9b046536c0289fb78f41c
2021-09-28 14:13:33 -07:00
Andrew Kryczka 791bff5b4e Prevent deadlock in db_stress with DbStressCompactionFilter (#8956)
Summary:
The cyclic dependency was:

- `StressTest::OperateDb()` locks the mutex for key 'k'
- `StressTest::OperateDb()` calls a function like `PauseBackgroundWork()`, which waits for pending compaction to complete.
- The pending compaction reaches key `k` and `DbStressCompactionFilter::FilterV2()` calls `Lock()` on that key's mutex, which hangs forever.

The cycle can be broken by using a new function, `port::Mutex::TryLock()`, which returns immediately upon failure to acquire a lock. In that case `DbStressCompactionFilter::FilterV2()` can just decide to keep the key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8956

Reviewed By: riversand963

Differential Revision: D31183718

Pulled By: ajkr

fbshipit-source-id: 329e4a31ce43085af174cf367ef560b5a04399c5
2021-09-24 16:54:02 -07:00
sdong 9320067703 Improve fault injection to MultiRead (#8937)
Summary:
Several improvements to MultiRead:
1. Fix a bug in stress test which causes false positive when both MultiRead() return and individual read request have failure injected.
2. Add two more types of fault that should be handled: empty read results and checksum mismatch
3. Add a message indicating which type of fault is injected
4. Increase the failure rate

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8937

Reviewed By: anand1976

Differential Revision: D31085930

fbshipit-source-id: 3a04994a3cadebf9a64d25e1fe12b14b7a272fba
2021-09-21 14:48:15 -07:00
Andrew Kryczka 5c92aa38ea Avoid overwriting first non-OK Status in db_stress setup (#8907)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8907

Reviewed By: zhichao-cao

Differential Revision: D30922081

Pulled By: ajkr

fbshipit-source-id: ad7a32c21d0049342fd20c9b7f555e93674c3671
2021-09-15 14:28:09 -07:00
Yanqin Jin 2a2b3e03a5 Allow WriteBatch to have keys with different timestamp sizes (#8725)
Summary:
In the past, we unnecessarily requires all keys in the same write batch
to be from column families whose timestamps' formats are the same for
simplicity. Specifically, we cannot use the same write batch to write to
two column families, one of which enables timestamp while the other
disables it.

The limitation is due to the member `timestamp_size_` that used to exist
in each `WriteBatch` object. We pass a timestamp_size to the constructor
of `WriteBatch`. Therefore, users can simply use the old
`WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while
the internal implementation of `WriteBatch` will take care of memory
allocation for timestamps.

The above is not necessary.
One the one hand, users can set up a memory buffer to store user key and
then contiguously append the timestamp to the user key. Then the user
can pass this buffer to the `WriteBatch::Put(Slice&)` API.
On the other hand, users can set up a SliceParts object which is an
array of Slices and let the last Slice to point to the memory buffer
storing timestamp. Then the user can pass the SliceParts object to the
`WriteBatch::Put(SliceParts&)` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8725

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D30654499

Pulled By: riversand963

fbshipit-source-id: 9d848c77ad3c9dd629aa5fc4e2bc16fb0687b4a2
2021-09-12 15:34:26 -07:00
Peter Dillinger bda8d93ba9 Fix and detect headers with missing dependencies (#8893)
Summary:
It's always annoying to find a header does not include its own
dependencies and only works when included after other includes. This
change adds `make check-headers` which validates that each header can
be included at the top of a file. Some headers are excluded e.g. because
of platform or external dependencies.

rocksdb_namespace.h had to be re-worked slightly to enable checking for
failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.)

Fixes mostly involve adding and cleaning up #includes, but for
FileTraceWriter, a constructor was out-of-lined to make a forward
declaration sufficient.

This check is not currently run with `make check` but is added to
CircleCI build-linux-unity since that one is already relatively fast.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893

Test Plan: existing tests and resolving issues detected by new check

Reviewed By: mrambacher

Differential Revision: D30823300

Pulled By: pdillinger

fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572
2021-09-10 10:00:26 -07:00
Peter Dillinger cb5b851ff8 Add (& fix) some simple source code checks (#8821)
Summary:
* Don't hardcode namespace rocksdb (use ROCKSDB_NAMESPACE)
* Don't #include <rocksdb/...> (use double quotes)
* Support putting NOCOMMIT (any case) in source code that should not be
committed/pushed in current state.

These will be run with `make check` and in GitHub actions

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8821

Test Plan: existing tests, manually try out new checks

Reviewed By: zhichao-cao

Differential Revision: D30791726

Pulled By: pdillinger

fbshipit-source-id: 399c883f312be24d9e55c58951d4013e18429d92
2021-09-07 21:19:27 -07:00
Peter Dillinger c9cd5d25a8 Remove some unneeded code (#8736)
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736

Test Plan: existing tests

Reviewed By: mrambacher

Differential Revision: D30700039

Pulled By: pdillinger

fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
2021-09-01 14:28:58 -07:00
Peter Dillinger 2a383f21f4 Add Bloom/Ribbon hybrid API support (#8679)
Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.

So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)

I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.

C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.

BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679

Test Plan: new + updated tests, including crash test

Reviewed By: jay-zhuang

Differential Revision: D30445797

Pulled By: pdillinger

fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
2021-08-20 18:00:16 -07:00
Jay Zhuang 0729b287e9 Exclude property kLiveSstFilesSizeAtTemperature from stress_test (#8668)
Summary:
Just like other per_level properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8668

Test Plan: stress_test

Reviewed By: zhichao-cao

Differential Revision: D30360967

Pulled By: jay-zhuang

fbshipit-source-id: 70da2557b95c55e8081b04ebf1a909a0fe69488f
2021-08-17 09:06:01 -07:00
Baptiste Lemaire e3a96c4823 Memtable sampling for mempurge heuristic. (#8628)
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628

Reviewed By: pdillinger

Differential Revision: D30149315

Pulled By: bjlemaire

fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
2021-08-10 18:09:03 -07:00
Akanksha Mahajan 052c24a668 Fix db_stress failure (#8632)
Summary:
FaultInjectionTestFS injects error in Rename operation. Because
of injected error, info.log fails to be created if rename  returns error and info_log is set to nullptr which leads to this assertion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8632

Test Plan: run the db_stress job locally

Reviewed By: ajkr

Differential Revision: D30167387

Pulled By: akankshamahajan15

fbshipit-source-id: 8d08c4c33e8f0cabd368bbb498d21b9de0660067
2021-08-07 09:21:03 -07:00
Baptiste Lemaire d6006f9c9b Add experimental mempurge policy flag to db_stress. (#8588)
Summary:
Add `experimental_mempurge_policy` flag to `db_stress` and `db_crashtest.py`.
This flag is only read if the `experimental_allow_mempurge` flag is set to `true`. This flag can take the following values: `kAlways`, and `kAlternate` (default).
- `kAlways`: a flush is always redirected to a mempurge. If the mempurge aborts, the a regular flush proceeds.
- `kAlternate`: if one or more of the flush input memtables is an mempurge output memtable, then a flush is performed, else a mempurge is carried out. Similar to kAlways, if a mempurge aborts, the FlushJob proceeds to a regular flush to storage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8588

Reviewed By: pdillinger

Differential Revision: D29934251

Pulled By: bjlemaire

fbshipit-source-id: 90c1debed2029b9915d066914556547507c33dae
2021-07-28 13:27:58 -07:00
mrambacher 3aee4fbd41 Make EventListener into a Customizable Class (#8473)
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "".  If there is no name, the listener cannot be loaded from the ObjectRegistry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473

Reviewed By: zhichao-cao

Differential Revision: D29901488

Pulled By: mrambacher

fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
2021-07-27 07:47:02 -07:00
sdong 9b41082d4a Complete the fix of stress open WAL drop fix (#8570)
Summary:
https://github.com/facebook/rocksdb/pull/8548 is not complete. We should instead cover all cases writable files are buffered, not just when failures are ingested. Extend it to any case where failures are ingested in DB open.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8570

Test Plan: Run db_stress and see it doesn't break

Reviewed By: jay-zhuang

Differential Revision: D29830415

fbshipit-source-id: 94449a0468fb2f7eec17423724008c9c63b2445d
2021-07-21 16:08:53 -07:00
Jay Zhuang 66ca5ac427 Cleanup cf handlers before deleting db (#8564)
Summary:
Delete column family handlers before deleting db to avoid `last_ref`
assert.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8564

Test Plan: Inject compaction test in db_stress test

Reviewed By: pdillinger

Differential Revision: D29797375

Pulled By: jay-zhuang

fbshipit-source-id: e8baf4d279f4db5d963db95c9445454156205501
2021-07-20 14:59:40 -07:00
Peter Dillinger d5f3b77f23 Add GetMapProperty to db_stress (#8551)
Summary:
Already has good coverage for GetProperty and GetIntProperty
but this one was missing.

This should add more confidence to https://github.com/facebook/rocksdb/issues/8538

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8551

Test Plan:
brief local run with boosted probability showed no immediate
issues

Reviewed By: siying

Differential Revision: D29746383

Pulled By: pdillinger

fbshipit-source-id: 9f9f525bc1a7607f85e563e33bda1979ef197127
2021-07-19 08:10:29 -07:00
sdong 39a07c9651 DB Stress Reopen write failure to skip WAL (#8548)
Summary:
When DB Stress enables write failure in reopen, WAL files are also created with a wrapper writalbe file which buffers write until fsync. However, crash test currently expects all writes to WAL is persistent. This is at odd with the unsynced bytes dropped. To work it around temporarily, we disable WAL write failure for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8548

Test Plan: Run db_stress. Manual printf to make sure only WAL files are skipped.

Reviewed By: jay-zhuang

Differential Revision: D29745095

fbshipit-source-id: 1879dd2c01abad7879ca243ee94570ec47c347f3
2021-07-16 16:09:33 -07:00
Baptiste Lemaire 0229a88dfe Crashtest mempurge (#8545)
Summary:
Add `experiemental_allow_mempurge` flag support for `db_stress` and `db_crashtest.py`, with a `false` default value.
I succesfully tested locally both `whitebox` and `blackbox` crash tests with `experiemental_allow_mempurge` flag set as true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8545

Reviewed By: akankshamahajan15

Differential Revision: D29734513

Pulled By: bjlemaire

fbshipit-source-id: 24316c0eccf6caf409e95c035f31d822c66714ae
2021-07-16 10:20:22 -07:00
sdong b1a53db327 FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync() to recover… (#8501)
Summary:
… small overwritten files.
If a file is overwritten with renamed and the parent path is not synced, FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync() will delete the file. However, RocksDB relies on file renaming to be atomic no matter whether the parent directory is synced or not, and the current behavior breaks the assumption and caused some false positive: https://github.com/facebook/rocksdb/pull/8489

Since the atomic renaming is used in CURRENT files, to fix the problem, in FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync(), we recover the state of overwritten file if the file is small.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8501

Test Plan: Run stress test for a while and see it doesn't break.

Reviewed By: anand1976

Differential Revision: D29594384

fbshipit-source-id: 589b5c2f0a9d2aca53752d7bdb0231efa5b3ae92
2021-07-07 16:23:23 -07:00
anand76 fcd8088333 Temporarily disable file deletion after open failure in db_stress (#8489)
Summary:
Write and metadata error injection during DB open was enabled in https://github.com/facebook/rocksdb/issues/8474. This causes crash tests to fail very frequently due to another fault injection feature that deletes files created after the last dir sync during DB open. In real life, a similar failure would happen if the FS returns error on the CURRENT file rename, but the rename actually succeeded and got partially persisted (dir entry for the old CURRENT file got removed, but the entry for the new one is not persisted). Temporarily disable the fault injection feature until we figure out the likelihood of this bug happening and the proper way to fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8489

Test Plan: Stress test can open the DB successfully

Reviewed By: siying

Differential Revision: D29564516

Pulled By: anand1976

fbshipit-source-id: ffd1650715ea3c5bf7131936b0ca6fcf66f4e14e
2021-07-06 14:16:57 -07:00
sdong f33611d5e9 Stress test to inject read failures in DB reopen (#8476)
Summary:
Inject read failures in DB reopen, just as what we do for metadata writes and writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8476

Test Plan: Some manual tests and make sure failures are triggered.

Reviewed By: anand1976

Differential Revision: D29507283

fbshipit-source-id: d04da0163973447041038bd87701686a417c4e0c
2021-07-06 11:05:27 -07:00
mrambacher 570248aeff Make SecondaryCache Customizable (#8480)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8480

Reviewed By: zhichao-cao

Differential Revision: D29528740

Pulled By: mrambacher

fbshipit-source-id: fd0f70d15f66611c8498257a9973f7e98ca13839
2021-07-06 09:18:08 -07:00
Zhichao Cao a95a776d75 Inject fatal write failures to db_stress when DB is running (#8479)
Summary:
add the injest_error_severity to control if it is a retryable IO Error or a fatal or unrecoverable error. Use a flag to indicate, if fatal error comes, the flag is set and db is stopped (but not corrupted).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8479

Test Plan: run  ./db_stress --reopen=0 --read_fault_one_in=1000 --write_fault_one_in=5 --disable_wal=true --write_buffer_size=3000000 -writepercent=5 -readpercent=50 --injest_error_severity=2 --column_families=1, make check

Reviewed By: anand1976

Differential Revision: D29524271

Pulled By: zhichao-cao

fbshipit-source-id: 1aa9fb9b5655b0adba6f5ad12005ca8c074c795b
2021-07-01 14:16:47 -07:00
anand76 41d32152ce Enable crash test to run using fbcode components (#8471)
Summary:
Add a new test ```fbcode_crash_test``` to rocksdb-lego-determinator. This test allows the crash test to be run on Facebook Sandcastle infra using fbcode components. Also use the default Env in db_stress to access the expected values path as it requires a memory mapped file and may not work with custom Envs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8471

Reviewed By: ajkr

Differential Revision: D29474722

Pulled By: anand1976

fbshipit-source-id: 7d086d82dd7091ae48e08cb4ace763ce3e3b87ef
2021-07-01 12:23:01 -07:00
sdong ba224b75c7 Stress Test to inject write failures in reopen (#8474)
Summary:
Previously Stress can inject metadata write failures when reopening a DB. We extend it to file append too, in the same way.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8474

Test Plan: manually run crash test with various setting and make sure the failures are triggered as expected.

Reviewed By: zhichao-cao

Differential Revision: D29503116

fbshipit-source-id: e73a446e80ccbd09301a579280e56ff949381fab
2021-06-30 16:46:41 -07:00
anand76 6f9ed59b1d Allow db_stress to use a secondary cache (#8455)
Summary:
Add a ```-secondary_cache_uri``` to db_stress to allow the user to specify a custom ```SecondaryCache``` object from the object registry. Also allow db_crashtest.py to be run with an alternate db_stress location. Together, these changes will allow us to run db_stress using FB internal components.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8455

Reviewed By: zhichao-cao

Differential Revision: D29371972

Pulled By: anand1976

fbshipit-source-id: dd1b1fd80ebbedc11aa63d9246ea6ae49edb77c4
2021-06-27 23:54:39 -07:00
Akanksha Mahajan be8199cdb9 Run Merge with Integrated BlobDB in stress, crash and db_bench (#8461)
Summary:
Run Merge with Intergrated BlobDB in stress tests, crash tests and db_bench.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8461

Test Plan:
1. python3 -u tools/db_crashtest.py --simple whitebox
---use_merge=1 --enable_blob_files=1
           2.  ./db_bench --benchmarks="readwhilemerging"
--merge_operator=uint64add --enable_blob_files=true

Reviewed By: ltamasi

Differential Revision: D29394824

Pulled By: akankshamahajan15

fbshipit-source-id: 0a8e492b13129673e088fb8af3402ab678bb473a
2021-06-25 10:45:52 -07:00
Peter Dillinger 865a25101d Mark Ribbon filter and optimize_filters_for_memory as production (#8408)
Summary:
Marked the Ribbon filter and optimize_filters_for_memory features
as production-ready, each enabling memory savings for Bloom-like filters.
Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use
Ribbon filters instead of Bloom, or `ribbonfilter` in place of
`bloomfilter` in configuration string.

Some small refactoring in db_stress.

Removed/refactored unused code in db_bench, in part preparing for future
default possibly being different from "disabled."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408

Test Plan:
Lots of prior automated, ad-hoc, and "real world" testing.
Updated tests for new API names. Quick db_bench test:

bloom fillrandom
77730 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 89929384

ribbon fillrandom
71492 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 64531384

Reviewed By: mrambacher

Differential Revision: D29140805

Pulled By: pdillinger

fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a
2021-06-17 12:29:16 -07:00
mrambacher 281ac9c89e Add CreateFrom methods to Env/FileSystem (#8174)
Summary:
- Added CreateFromString method to Env and FilesSystem to replace LoadEnv/Load.  This method/signature is a precursor to making these classes extend Customizable.

- Added CreateFromSystem to Env.  This method standardizes creating an Env from the environment variables.  Previously, some places would check TEST_ENV_URI and others would also check TEST_FS_URI.  Now the code is more command/standardized.

- Added CreateFromFlags to Env.  These method allows Env to be create from string options (such as GFLAGS options) in a more standard way.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8174

Reviewed By: zhichao-cao

Differential Revision: D28999603

Pulled By: mrambacher

fbshipit-source-id: 88e6911e7e91f908458a7fe10a20e93ecbc275fb
2021-06-15 03:43:48 -07:00
sdong 7f3a0f5bc6 db_stress: wait for compaction to finish after open with failure injection (#8270)
Summary:
When injecting in DB open, error can happen in background threads, causing DB open succeed, but DB is soon made read-only and subsequence writes will fail, which is not expected. To prevent it from happening, wait for compaction to finish before serving the traffic. If there is a failure, reopen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8270

Test Plan: Run the test.

Reviewed By: ajkr

Differential Revision: D28230537

fbshipit-source-id: e2e97888904f9b9bb50c35ccf95b88c2319ef5c3
2021-05-05 16:41:45 -07:00
sdong e19908cba6 Refactor kill point (#8241)
Summary:
Refactor kill point to one single class, rather than several extern variables. The intention was to drop unflushed data before killing to simulate some job, and I tried to a pointer to fault ingestion fs to the killing class, but it ended up with harder than I thought. Perhaps we'll need to do this in another way. But I thought the refactoring itself is good so I send it out.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8241

Test Plan: make release and run crash test for a while.

Reviewed By: anand1976

Differential Revision: D28078486

fbshipit-source-id: f9182c1455f52e6851c13f88a21bade63bcec45f
2021-05-05 15:50:29 -07:00
Andrew Kryczka 0f42e50fec Fix GetLiveFiles() returning OPTIONS-000000 (#8268)
Summary:
See release note in HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8268

Test Plan: unit test repro

Reviewed By: siying

Differential Revision: D28227901

Pulled By: ajkr

fbshipit-source-id: faf61d13b9e43a761e3d5dcf8203923126b51339
2021-05-05 12:54:46 -07:00
sdong cde69a7cfd db_stress to add --open_metadata_write_fault_one_in (#8235)
Summary:
DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place.
If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully.
This option is enabled in crash test in half of the time.
Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8235

Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by https://github.com/facebook/rocksdb/pull/8192 and a similar one that fails when fsyncing parent directory.

Reviewed By: anand1976

Differential Revision: D28010944

fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558
2021-04-28 10:58:05 -07:00
Peter Dillinger 95f6add746 Revert Ribbon starting level support from #8198 (#8212)
Summary:
This partially reverts commit 10196d7edc.

The problem with this change is because of important filter use cases:
FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so
would only use Ribbon filters if specifically including level 0 for the
Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate
unknown level, and this would be treated the same as level 0 unless
fixed.

We are keeping the part about committing to permanent schema, which is
only changes to API comments and HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212

Test Plan: CI

Reviewed By: jay-zhuang

Differential Revision: D27896468

Pulled By: pdillinger

fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e
2021-04-20 19:46:40 -07:00
Peter Dillinger 10196d7edc Ribbon long-term support, starting level support (#8198)
Summary:
Since the Ribbon filter schema seems good (compatible back to
6.15.0), this change commits to long term support of the SST schema,
even though we expect the API for enabling Ribbon to change (still
called NewExperimentalRibbonFilterPolicy).

This also adds support for "hybrid" configuration in which some levels
use Bloom (higher levels, lower numbered) for speed and the rest use
Ribbon (lower levels, higher numbered) for memory space efficiency.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198

Test Plan: unit test added, crash test support

Reviewed By: jay-zhuang

Differential Revision: D27831232

Pulled By: pdillinger

fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab
2021-04-16 15:43:08 -07:00
Akanksha Mahajan 0be89e87fd Enable backup/restore for Integrated BlobDB in stress and crash tests (#8165)
Summary:
Enable backup/restore functionality with Integrated BlobDB in
db_stress and crash test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8165

Test Plan:
Ran python3 -u tools/db_crashtest.py --simple whitebox along
with :
  1. decreased "backup_in_one" value for backups to be more frequent and
  2. manually changed code for "enable_blob_file" to be always true and
     apply blobdb params 100% for testing purpose.

Reviewed By: ltamasi

Differential Revision: D27636025

Pulled By: akankshamahajan15

fbshipit-source-id: 0d0e0d1479ced163f992872dc998e79c581bfc99
2021-04-07 17:57:24 -07:00
Peter Dillinger 879357fdb0 Make backups openable as read-only DBs (#8142)
Summary:
A current limitation of backups is that you don't know the
exact database state of when the backup was taken. With this new
feature, you can at least inspect the backup's DB state without
restoring it by opening it as a read-only DB.

Rather than add something like OpenAsReadOnlyDB to the BackupEngine API,
which would inhibit opening stackable DB implementations read-only
(if/when their APIs support it), we instead provide a DB name and Env
that can be used to open as a read-only DB.

Possible follow-up work:

* Add a version of GetBackupInfo for a single backup.
* Let CreateNewBackup return the BackupID of the newly-created backup.

Implementation details:

Refactored ChrootFileSystem to split off new base class RemapFileSystem,
which allows more general remapping of files. We use this base class to
implement BackupEngineImpl::RemapSharedFileSystem.

To minimize API impact, I decided to just add these fields `name_for_open`
and `env_for_open` to those set by GetBackupInfo when
include_file_details=true. Creating the RemapSharedFileSystem adds a bit
to the memory consumption, perhaps unnecessarily in some cases, but this
has been mitigated by (a) only initialize the RemapSharedFileSystem
lazily when GetBackupInfo with include_file_details=true is called, and
(b) using the existing `shared_ptr<FileInfo>` objects to hold most of the
mapping data.

To enhance API safety, RemapSharedFileSystem is wrapped by new
ReadOnlyFileSystem which rejects any attempts to write. This uncovered a
couple of places in which DB::OpenForReadOnly would write to the
filesystem, so I fixed these. Added a release note because this affects
logging.

Additional minor refactoring in backupable_db.cc to support the new
functionality.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142

Test Plan:
new test (run with ASAN and UBSAN), added to stress test and
ran it for a while with amplified backup_one_in

Reviewed By: ajkr

Differential Revision: D27535408

Pulled By: pdillinger

fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148
2021-04-06 14:37:53 -07:00
Yanqin Jin 08144bc2f5 Add user-defined timestamps to db_stress (#8061)
Summary:
Add some basic test for user-defined timestamp to db_stress. Currently,
read with timestamp always tries to read using the current timestamp.
Due to the per-key timestamp-sequence ordering constraint, we only add timestamp-
related tests to the `NonBatchedOpsStressTest` since this test serializes accesses
to the same key and uses a file to cross-check data correctness.
The timestamp feature is not supported in a number of components, e.g. Merge, SingleDelete,
DeleteRange, CompactionFilter, Readonly instance, secondary instance, SST file ingestion, transaction,
etc. Therefore, db_stress should exit if user enables both timestamp and these features at the same
time. The (currently) incompatible features can be found in
`CheckAndSetOptionsForUserTimestamp`.

This PR also fixes a bug triggered when timestamp is enabled together with
`index_type=kBinarySearchWithFirstKey`. This bug fix will also be in another separate PR
with more unit tests coverage. Fixing it here because I do not want to exclude the index type
from crash test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8061

Test Plan: make crash_test_with_ts

Reviewed By: jay-zhuang

Differential Revision: D27056282

Pulled By: riversand963

fbshipit-source-id: c3e00ad1023fdb9ebbdf9601ec18270c5e2925a9
2021-03-23 05:13:30 -07:00
Levi Tamasi 0d800dadea Adjust the set of potential min_blob_size values in stress/crash tests (#8085)
Summary:
Since our stress/crash tests by default generate values of size 8, 16, or 24,
it does not make much sense to set `min_blob_size` to 256. The patch
updates the set of potential `min_blob_size` values in the crash test
script and in `db_stress` where it might be set dynamically using
`SetOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8085

Test Plan: Ran `make check` and tried the crash test script.

Reviewed By: riversand963

Differential Revision: D27238620

Pulled By: ltamasi

fbshipit-source-id: 4a96f9944b1ed9220d3045c5ab0b34c49009aeee
2021-03-22 14:38:09 -07:00
Peter Dillinger 3bfd3ed2f3 Begin forward compatibility for new backup meta schema (#8069)
Summary:
This does not add any new public APIs or published
functionality, but adds the ability to read and use (and in tests,
write) backups with a new meta file schema, based on the old schema
but not forward-compatible (before this change). The new schema enables
some capabilities not in the old:

* Explicit versioning, so that users get clean error messages the next
time we want to break forward compatibility.
* Ignoring unrecognized fields (with warning), so that new non-critical
features can be added without breaking forward compatibility.
* Rejecting future "non-ignorable" fields, so that new features critical
to some use-cases could potentially be added outside of linear schema
versions, with broken forward compatibility.
* Fields at the end of the meta file, such as for checksum of the meta
file's contents (up to that point)
* New optional 'size' field for each file, which is checked when present
* Optionally omitting 'crc32' field, so that we aren't required to have
a crc32c checksum for files to take a backup. (E.g. to support backup
via hard links and to better support file custom checksums.)

Because we do not have a JSON parser and to share code, the new schema
is simply derived from the old schema.

BackupEngine code is updated to allow missing checksums in some places,
and to make that easier, `has_checksum` and `verify_checksum_after_work`
are eliminated. Empty `checksum_hex` indicates checksum is unknown. I'm
not too afraid of regressing on data integrity, because
(a) we have pretty good test coverage of corruption detection in backups, and
(b) we are increasingly relying on the DB itself for data integrity rather than
it being an exclusive feature of backups.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8069

Test Plan:
new unit tests, added to crash test (some local run with
boosted backup probability)

Reviewed By: ajkr

Differential Revision: D27139824

Pulled By: pdillinger

fbshipit-source-id: 9e0e4decfb42bb84783d64d2d246456d97e8e8c5
2021-03-19 20:15:40 -07:00
mrambacher 3dff28cf9b Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033)
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.

For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.

There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.

Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:

6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)

(Note that the times for 6.18 are prior to revert of the SystemClock).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033

Reviewed By: pdillinger

Differential Revision: D27014563

Pulled By: mrambacher

fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
2021-03-15 04:34:11 -07:00
Peter Dillinger 847ca9f964 Make default share_files_with_checksum=true (#8020)
Summary:
New comment for share_files_with_checksum:
// Only used if share_table_files is set to true. Setting to false is
// DEPRECATED and potentially dangerous because in that case BackupEngine
// can lose data if backing up databases with distinct or divergent
// history, for example if restoring from a backup other than the latest,
// writing to the DB, and creating another backup. Setting to true (default)
// prevents these issues by ensuring that different table files (SSTs) with
// the same number are treated as distinct. See
// share_files_with_checksum_naming and ShareFilesNaming.

I have also removed interim option kFlagMatchInterimNaming, which is no
longer needed and was never needed for correct+compatible operation
(just performance).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8020

Test Plan:
tests updated. Backward+forward compatibility verified with
SHORT_TEST=1 check_format_compatible.sh. ldb uses default backup
options, and I manually verified shared_checksum in
/tmp/rocksdb_format_compatible_peterd/bak/current/ after run.

Reviewed By: ajkr

Differential Revision: D26786331

Pulled By: pdillinger

fbshipit-source-id: 36f968dfef1f5cacbd65154abe1d846151a55130
2021-03-09 16:27:13 -08:00
Yanqin Jin 1f11d07f24 Enable compact filter for blob in dbstress and dbbench (#8011)
Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8011

Test Plan:
```
./db_bench -enable_blob_files=1 -use_keep_filter=1 -disable_auto_compactions=1
/db_stress -enable_blob_files=1 -enable_compaction_filter=1 -acquire_snapshot_one_in=0 -compact_range_one_in=0 -iterpercent=0 -test_batches_snapshots=0 -readpercent=10 -prefixpercent=20 -writepercent=55 -delpercent=15 -continuous_verification_interval=0
```

Reviewed By: ltamasi

Differential Revision: D26736061

Pulled By: riversand963

fbshipit-source-id: 1c7834903c28431ce23324c4f259ed71255614e2
2021-03-01 17:24:47 -08:00
Andrew Kryczka d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
Levi Tamasi dab4fe5bcd Add checkpoint support to BlobDB (#7959)
Summary:
The patch adds checkpoint support to BlobDB. Blob files are hard linked or
copied, depending on whether the checkpoint directory is on the same filesystem
or not, similarly to table files.

TODO: Add support for blob files to `ExportColumnFamily` and to the checksum
verification logic used by backup/restore.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7959

Test Plan: Ran `make check` and the crash test for a while.

Reviewed By: riversand963

Differential Revision: D26434768

Pulled By: ltamasi

fbshipit-source-id: 994be55a8dc08133028250760fca440d2c7c4dc5
2021-02-17 12:42:36 -08:00
sdong e3183eae77 Stress test to allow memtable whole key filter (#7937)
Summary:
Right now, stress test cannot be configured to use memtable whole key filter without prefix filter. It doesn't appear to be necessary. remove this constraint.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7937

Test Plan: "make crash_test" to be able to run.

Reviewed By: ltamasi

Differential Revision: D26295532

fbshipit-source-id: 30c874a9dc2b672a460603a4ee32368674e0face
2021-02-05 22:50:45 -08:00
Levi Tamasi 0288bdbc53 Add the integrated BlobDB to the stress/crash tests (#7900)
Summary:
The patch adds support for the options related to the new BlobDB implementation
to `db_stress`, including support for dynamically adjusting them using `SetOptions`
when `set_options_one_in` and a new flag `allow_setting_blob_options_dynamically`
are specified. (The latter is used to prevent the options from being enabled when
incompatible features are in use.)

The patch also updates the `db_stress` help messages of the existing stacked BlobDB
related options to clarify that they pertain to the old implementation. In addition, it
adds the new BlobDB to the crash test script. In order to prevent a combinatorial explosion
of jobs and still perform whitebox/blackbox testing (including under ASAN/TSAN/UBSAN),
and to also test BlobDB in conjunction with atomic flush and transactions, the script sets
the BlobDB options in 10% of normal/`cf_consistency`/`txn` crash test runs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7900

Test Plan: Ran `make check` and `db_stress`/`db_crashtest.py` with various options.

Reviewed By: jay-zhuang

Differential Revision: D26094913

Pulled By: ltamasi

fbshipit-source-id: c2ef3391a05e43a9687f24e297df05f4a5584814
2021-02-02 11:41:18 -08:00
Andrew Kryczka 78ee8564ad Integrity protection for live updates to WriteBatch (#7748)
Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).

The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.

When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748

Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.

Reviewed By: pdillinger

Differential Revision: D25754492

Pulled By: ajkr

fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
2021-01-29 12:18:58 -08:00
Cheng Chang 30cd38c687 Increase the txn lock timeout in stress test (#7823)
Summary:
We recently encounter two cases of txn lock timeout in stress test. It might be caused due to latencies of resource scheduling in the internal infrastructure. Hopefully increasing the timeout can make the related tests less flaky.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7823

Test Plan: watch internal stress test to pass.

Reviewed By: siying

Differential Revision: D25739233

Pulled By: cheng-chang

fbshipit-source-id: 84a5a8ae820db24dacd0cfc05928b26505fab89d
2020-12-30 20:31:35 -08:00
Zhichao Cao 601585bca4 fix memory leak in db_stress checkpoint test (#7813)
Summary:
fix memory leak in db_stress checkpoint test. If s is not ok, checkpoint is not deleted, may cause memory leak.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7813

Test Plan: make asan_check

Reviewed By: cheng-chang

Differential Revision: D25702999

Pulled By: zhichao-cao

fbshipit-source-id: 08253b0852835acb8cfd412503cdabf720afb678
2020-12-25 13:15:48 -08:00
Zhichao Cao 04b3524ad0 Inject the random write error to stress test (#7653)
Summary:
Inject the random write error to stress test, it requires set reopen=0 and disable_wal=true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7653

Test Plan: pass db_stress and python3 db_crashtest.py blackbox

Reviewed By: ajkr

Differential Revision: D25354132

Pulled By: zhichao-cao

fbshipit-source-id: 44721104eecb416e27f65f854912c40e301dd669
2020-12-17 11:52:28 -08:00
Peter Dillinger 60af964372 Experimental (production candidate) SST schema for Ribbon filter (#7658)
Summary:
Added experimental public API for Ribbon filter:
NewExperimentalRibbonFilterPolicy(). This experimental API will
take a "Bloom equivalent" bits per key, and configure the Ribbon
filter for the same FP rate as Bloom would have but ~30% space
savings. (Note: optimize_filters_for_memory is not yet implemented
for Ribbon filter. That can be added with no effect on schema.)

Internally, the Ribbon filter is configured using a "one_in_fp_rate"
value, which is 1 over desired FP rate. For example, use 100 for 1%
FP rate. I'm expecting this will be used in the future for configuring
Bloom-like filters, as I expect people to more commonly hold constant
the filter accuracy and change the space vs. time trade-off, rather than
hold constant the space (per key) and change the accuracy vs. time
trade-off, though we might make that available.

### Benchmarking

```
$ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 34.1341
Number of filters: 1993
Total size (MB): 238.488
Reported total allocated memory (MB): 262.875
Reported internal fragmentation: 10.2255%
Bits/key stored: 10.0029
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 18.7508
  Random filter net ns/op: 258.246
    Average FP rate %: 0.968672
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 130.851
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 58.4523
  Random filter net ns/op: 363.717
    Average FP rate %: 0.952978
----------------------------
Done. (For more info, run with -legend or -help.)
```

168.166 / 238.488 = 0.705  -> 29.5% space reduction

130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction)

### Working around a hashing "flaw"

bloom_test discovered a flaw in the simple hashing applied in
StandardHasher when num_starts == 1 (num_slots == 128), showing an
excessively high FP rate.  The problem is that when many entries, on the
order of number of hash bits or kCoeffBits, are associated with the same
start location, the correlation between the CoeffRow and ResultRow (for
efficiency) can lead to a solution that is "universal," or nearly so, for
entries mapping to that start location. (Normally, variance in start
location breaks the effective association between CoeffRow and
ResultRow; the same value for CoeffRow is effectively different if start
locations are different.) Without kUseSmash and with num_starts > 1 (thus
num_starts ~= num_slots), this flaw should be completely irrelevant.  Even
with 10M slots, the chances of a single slot having just 16 (or more)
entries map to it--not enough to cause an FP problem, which would be local
to that slot if it happened--is 1 in millions. This spreadsheet formula
shows that: =1/(10000000*(1 - POISSON(15, 1, TRUE)))

As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is
intended for CPU efficiency of filters with many more entries/slots than
kCoeffBits, a very reasonable work-around is to disallow num_starts==1
when !kUseSmash, by making the minimum non-zero number of slots
2*kCoeffBits. This is the work-around I've applied. This also means that
the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not
space-efficient for less than a few hundred entries. Because of this, I
have made it fall back on constructing a Bloom filter, under existing
schema, when that is more space efficient for small filters. (We can
change this in the future if we want.)

TODO: better unit tests for this case in ribbon_test, and probably
update StandardHasher for kUseSmash case so that it can scale nicely to
small filters.

### Other related changes

* Add Ribbon filter to stress/crash test
* Add Ribbon filter to filter_bench as -impl=3
* Add option string support, as in "filter_policy=experimental_ribbon:5.678;"
where 5.678 is the Bloom equivalent bits per key.
* Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom
* Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on
binary searching CalculateSpace (inefficient), so that subclasses
(especially experimental ones) don't have to provide an efficient
implementation inverting CalculateSpace.
* Minor refactor FastLocalBloomBitsBuilder for new base class
XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder,
which allows the latter to fall back on Bloom construction in some
extreme cases.
* Mostly updated bloom_test for Ribbon filter, though a test like
FullBloomTest::Schema is a next TODO to ensure schema stability
(in case this becomes production-ready schema as it is).
* Add some APIs to ribbon_impl.h for configuring Ribbon filters.
Although these are reasonably covered by bloom_test, TODO more unit
tests in ribbon_test
* Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data
for constructing the linear approximations in GetNumSlotsFor95PctSuccess.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658

Test Plan:
Some unit tests updated but other testing is left TODO. This
is considered experimental but laying down schema compatibility as early
as possible in case it proves production-quality. Also tested in
stress/crash test.

Reviewed By: jay-zhuang

Differential Revision: D24899349

Pulled By: pdillinger

fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8
2020-11-12 20:46:14 -08:00
Cheng Chang c3911f1a72 Track WAL in MANIFEST: Track deleted WALs in MANIFEST after recovering from the WALs (#7649)
Summary:
After replaying the WALs, the memtables are flushed synchronously to L0 instead of being flushed in background. Currently, we only track WAL obsoletion events in the code path of background flush jobs. This PR tracks these events in RecoverLogFiles.

After this change, we can enable `track_and_verify_wal_in_manifest` in `db_stress`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7649

Test Plan: `python tools/db_crashtest.py whitebox`

Reviewed By: riversand963

Differential Revision: D24824501

Pulled By: cheng-chang

fbshipit-source-id: 207129f7b845c50b333680ce6818a68a2fad54b9
2020-11-09 10:25:43 -08:00
Cheng Chang 5e794b0841 Fix a recovery corner case (#7621)
Summary:
Consider the following sequence of events:

1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.

It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.

After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621

Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.

Reviewed By: riversand963

Differential Revision: D24668144

Pulled By: cheng-chang

fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
2020-11-07 22:23:27 -08:00
Cheng Chang 1e40696dd1 Track WAL in MANIFEST: LogAndApply WAL events to MANIFEST (#7601)
Summary:
When a WAL is synced, an edit is written to MANIFEST.
After flushing memtables, the obsoleted WALs are piggybacked to MANIFEST while writing the new L0 files to MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7601

Test Plan:
`track_and_verify_wals_in_manifest` is enabled by default for all tests extending `DBBasicTest`, and in db_stress_test.
Unit test `wal_edit_test`, `version_edit_test`, and `version_set_test` are also updated.
Watch all tests to pass.

Reviewed By: ltamasi

Differential Revision: D24553957

Pulled By: cheng-chang

fbshipit-source-id: 66a569ff1bdced38e22900bd240b73113906e040
2020-11-06 17:22:36 -08:00
Jay Zhuang 4a6840bd00 db_stress prints key in Hex (#7533)
Summary:
db_stress prints key in both id and hex. https://github.com/facebook/rocksdb/issues/7531

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7533

Test Plan: local test

Reviewed By: akankshamahajan15

Differential Revision: D24252725

Pulled By: jay-zhuang

fbshipit-source-id: f0c1409a0568874df36949d5da139316d978fa98
2020-10-13 12:38:59 -07:00
Andrew Kryczka 75d3b6fdf0 Redesign block cache pinning API (#7520)
Summary:
The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520

Test Plan:
- new unit test
- added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0`

Reviewed By: pdillinger

Differential Revision: D24200034

Pulled By: ajkr

fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e
2020-10-11 14:58:24 -07:00
sdong aedcaaef99 Stress test to support paranoid_file_checks (#7473)
Summary:
It's important to make sure no false positive is reported when options.paranoid_file_checks is used. Add it to stress test and a place holder in crash test. It is disabled in crash test as there appears to be a bug causing false positive.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7473

Test Plan: Run crash test

Reviewed By: ajkr

Differential Revision: D24026939

fbshipit-source-id: 89102acb45cf041776775ce44a4eef4b0f3a380c
2020-09-30 14:41:33 -07:00
Peter Dillinger b475a83f9d Postponing custom checksum support in BackupEngine (#7411)
Summary:
This change reverts BackupEngine to 6.12 state to accommodate a
higher-priority fix that does not easily merge with this custom checksum
support. We intend to reinstate this support soon, by merging a revert
of this change.

For backupable_db_test, I've removed the tests depending on this
feature.

I've also removed relevant HISTORY.md entry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7411

Test Plan: unit tests

Reviewed By: ajkr

Differential Revision: D23793835

Pulled By: pdillinger

fbshipit-source-id: 7e861436539584799b13d1a8ae559b81b6d08052
2020-09-18 15:27:03 -07:00
Peter Dillinger 93719fc953 Restore file size in backup table file names (and other cleanup) (#7400)
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.

This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.

We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).

We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.

Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`

Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.

Actual file name patterns and priorities, as regexes:

    kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
      [0-9]+_[0-9]+_[0-9]+[.]sst
    kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
      [0-9]+_[0-9a-fA-F-]+[.]sst
    kUseDbSessionId AND NOT kFlagIncludeFileSize ->
      [0-9]+_s[0-9A-Z]{20}[.]sst
    kUseDbSessionId AND kFlagIncludeFileSize (default) ->
      [0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst

We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.

This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400

Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.

Reviewed By: ajkr

Differential Revision: D23759587

Pulled By: pdillinger

fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
2020-09-17 10:24:22 -07:00
Peter Dillinger a0ac71aae1 Disable sst_file_manager in stress testing backup restore (#7384)
Summary:
This is potentially the cause of failures:

    Failure in Destroy restore dir with: IO error: file rmdir: /dev/shm/rocksdb/rocksdb_crashtest_whitebox/.restore13: Directory not empty

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7384

Test Plan: smoke test blackbox_crash_test

Reviewed By: jay-zhuang

Differential Revision: D23685087

Pulled By: pdillinger

fbshipit-source-id: 55f62e9853ce84be1d5ca7d856de867f0f2596ee
2020-09-14 14:21:06 -07:00
Peter Dillinger c4e2066dbd Fix cf_consistency_stress for backup/restore, harmonize (#7373)
Summary:
We can only check key on restored backup if in a stress test
configuration locking the key. (Fixes mismatch seen in backup/restore
with atomic flush.)

TestCheckpoint used a very ugly solution to the same problem: copy-paste
dozens of lines of code with some changes and removals. I removed the
unnecessary implementation and made the existing one simply adaptive,
like TestBackupRestore.

Also made TestBackupRestore clean up dead backup/restore directories on
success.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7373

Test Plan:
blackbox_crash_test_with_atomic_flush for a while,
blackbox_crash_test for a while, with backup and checkpoint 1 in 5k and
only 1k max_keys to stress this area

Reviewed By: ajkr

Differential Revision: D23629057

Pulled By: pdillinger

fbshipit-source-id: d7fe7e2be75aaf3cf974be9540a7c5c5de8b371b
2020-09-10 22:55:06 -07:00
Peter Dillinger e3149358a5 More backup/restore stress test fixes (#7361)
Summary:
(a) Missed a case in updating handling of rand_keys
(b) Only opening restored db with DB::Open so don't (yet)
attempt to open restored BlobDB or TransactionDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7361

Test Plan: better than being broken

Reviewed By: ajkr

Differential Revision: D23592570

Pulled By: pdillinger

fbshipit-source-id: dd1d999bcc0c852ee77cb6041964ec4abc0fd4fd
2020-09-09 11:19:05 -07:00
Peter Dillinger 4e258d3e63 Fix backup/restore in stress/crash test (#7357)
Summary:
(1) Skip check on specific key if restoring an old backup
(small minority of cases) because it can fail in those cases. (2) Remove
an old assertion about number of column families and number of keys
passed in, which is broken by atomic flush (cf_consistency) test. Like
other code (for better or worse) assume a single key and iterate over
column families. (3) Apply mock_direct_io to NewSequentialFile so that
db_stress backup works on /dev/shm.

Also add more context to output in case of backup/restore db_stress
failure.

Also a minor fix to BackupEngine to report first failure status in
creating new backup, and drop another clue about the potential
source of a "Backup failed" status.

Reverts "Disable backup/restore stress test (https://github.com/facebook/rocksdb/issues/7350)"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7357

Test Plan:
Using backup_one_in=10000,
"USE_CLANG=1 make crash_test_with_atomic_flush" for 30+ minutes
"USE_CLANG=1 make blackbox_crash_test" for 30+ minutes
And with use_direct_reads with TEST_TMPDIR=/dev/shm/rocksdb

Reviewed By: riversand963

Differential Revision: D23567244

Pulled By: pdillinger

fbshipit-source-id: e77171c2e8394d173917e36898c02dead1c40b77
2020-09-08 10:50:19 -07:00
Peter Dillinger 06ad5dd293 Add file checksum to stress/crash test (#7343)
Summary:
This change has the crash test randomly select from a few file
checksum implementations, or nullptr, for DB file_checksum_gen_factory.
For compatibility across runs on same DB, each non-null factory can
understand all the other functions, but the default changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7343

Test Plan:
'make blackbox_crash_test' for a while, including with some
debug output to ensure code is being exercised.

Reviewed By: zhichao-cao

Differential Revision: D23494580

Pulled By: pdillinger

fbshipit-source-id: 73bbc7ca32c1adaf619134c0c830f12894880b8a
2020-09-03 23:50:33 -07:00
Peter Dillinger 499c9448d0 Fix, enable, and enhance backup/restore in db_stress (#7348)
Summary:
Although added to db_stress, testing of backup/restore
was never integrated into the crash test, originally concerned about
performance. I've enabled it now and to address the peformance concern,
testing backup/restore is always skipped once the db exceeds a certain
size threshold, default 100MB. This should provide sufficient
opportunity for testing BackupEngine without bogging down everything
else with heavier and heavier operations.

Also fixed backup/restore in db_stress by making sure PurgeOldBackups
can remove manifest files, which are normally kept around for db_stress.

Added more coverage of backup options, and up to three backups being
saved in one backup directory (in some cases).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7348

Test Plan:
ran 'make blackbox_crash_test' for a while, with heightened
probabilitly of taking backups (1/10k). Also confirmed with some debug
output that the code is being covered, TestBackupRestore only takes
a few seconds to complete when triggered, and even at 1/10k and ~50MB
database, there's <,~ 1 thread testing backups at any time.

Reviewed By: ajkr

Differential Revision: D23510835

Pulled By: pdillinger

fbshipit-source-id: b6b8735591808141f81f10773ac31634cf03b6c0
2020-09-03 20:13:15 -07:00
Hans Holmberg 2a0d3c7054 Add a file system parameter: --fs_uri to db_stress and db_bench (#6878)
Summary:
This pull request adds the parameter --fs_uri to db_bench and db_stress, creating a composite env combining the default env with a specified registered rocksdb file system.

This makes it easier to develop and test new RocksDB FileSystems.

The pull request also registers the posix file system for testing purposes.

Examples:
```
$./db_bench --fs_uri=posix:// --benchmarks=fillseq

$./db_stress --fs_uri=zenfs://nullb1
```

zenfs is a RocksDB FileSystem I'm developing to add support for zoned block devices, and in that case the zoned block device is specified in the uri (a zoned null block device in the above example).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6878

Reviewed By: siying

Differential Revision: D23023063

Pulled By: ajkr

fbshipit-source-id: 8b3fe7193ce45e683043b021779b7a4d547af247
2020-08-17 11:55:24 -07:00
Andrew Kryczka 7eebe6d38a Mark files for compaction in stress/crash tests (#7231)
Summary:
The mechanism to mark files for compaction is most commonly used in
delete-triggered compaction. This PR adds an option to exercise the
marking mechanism on random files created by db_stress. This PR also
enables that option in db_crashtest.py on its db_stress runs at random.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7231

Test Plan:
- ran some minified crash tests; verified they succeed and we see `"compaction_reason": "FilesMarkedForCompaction"` regularly in the logs.

```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --duration=600 --interval=30 --max_key=10000000 --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --value_size_mult=33
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py whitebox --duration=600 --interval=30 --max_key=1000000 --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --value_size_mult=33 --random_kill_odd=8887
```

Reviewed By: anand1976

Differential Revision: D23025156

Pulled By: ajkr

fbshipit-source-id: a404c467ebc12afa94dae35956ea9b372f592a96
2020-08-10 16:17:56 -07:00
Jay Zhuang 0c5bb10f06 Remove redundant ROCKSDB_LITE check (#7172)
Summary:
It's already inside of a `#ifdef ROCKSDB_LITE` block.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7172

Reviewed By: gg814

Differential Revision: D22736057

Pulled By: jay-zhuang

fbshipit-source-id: 31f4aa05aba98e2e42fa6f890fa72acf3a0f12f2
2020-07-24 14:14:14 -07:00
Jay Zhuang fc4d5f5065 Add stress test for GetProperty (#7111)
Summary:
Add stress test coverage for `DB::GetProperty()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7111

Test Plan:
```
./db_stress -get_property_one_in=1
make crash_test
```

Reviewed By: ajkr

Differential Revision: D22487906

Pulled By: jay-zhuang

fbshipit-source-id: c118d95cc9b4e2fa669a06e6aa531541fa885dc5
2020-07-14 12:12:36 -07:00
mrambacher c7c7b07f06 More Makefile Cleanup (#7097)
Summary:
Cleans up some of the dependencies on test code in the Makefile while building tools:
- Moves the test::RandomString, DBBaseTest::RandomString into Random
- Moves the test::RandomHumanReadableString into Random
- Moves the DestroyDir method into file_utils
- Moves the SetupSyncPointsToMockDirectIO into sync_point.
- Moves the FaultInjection Env and FS classes under env

These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies.  By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated.

Tested both release and debug builds via Make and CMake for both static and shared libraries.

More work remains to clean up how the tools are built and remove some unnecessary dependencies.  There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097

Reviewed By: riversand963

Differential Revision: D22463160

Pulled By: pdillinger

fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2
2020-07-09 14:35:17 -07:00
Yanqin Jin 842bd2742a Running ./ldb without any extra arg print usage (#7107)
Summary:
as title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7107

Test Plan: make ldb && ./ldb

Reviewed By: pdillinger

Differential Revision: D22451399

Pulled By: riversand963

fbshipit-source-id: 797645e06473bb9cf139c533877e5161281515e8
2020-07-09 10:20:06 -07:00
Peter Dillinger 90fd6b0cc8 cf_consistency_stress (crash_test_with_atomic_flush) checkpoint clean (#7103)
Summary:
Delicious copy-pasta from https://github.com/facebook/rocksdb/issues/7039

Also fixing DestroyDir to allow files to go missing while it is operating. This seems to fix failures I got with test plan reproducer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7103

Test Plan:
make blackbox_crash_test_with_atomic_flush for a while with
checkpoint_one_in=100

Reviewed By: siying

Differential Revision: D22435315

Pulled By: pdillinger

fbshipit-source-id: 0ec0538402493887aeda43ecc03f32979cb84ced
2020-07-08 13:04:55 -07:00
Jay Zhuang ca7659e2c4 Fix release build caused by #7067 (#7077)
Summary:
The issue is introduced by https://github.com/facebook/rocksdb/issues/7067

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7077

Test Plan: `make release`

Reviewed By: pdillinger

Differential Revision: D22370835

Pulled By: jay-zhuang

fbshipit-source-id: 44326bae07809c4518371b6a7d1f47124e24a4f3
2020-07-02 20:53:08 -07:00
Jay Zhuang 00de699096 Replace reinterpret_cast with static_cast_with_check (#7067)
Summary:
Replace `reinterpret_cast` with `static_cast_with_check` for `DBImpl` and `ColumnFamilyHandleImpl`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7067

Reviewed By: siying

Differential Revision: D22361587

Pulled By: jay-zhuang

fbshipit-source-id: dfe9e8f3af39c3d27cc372c55ab9ad905eb0a5a1
2020-07-02 19:25:41 -07:00
mrambacher 80f71b5863 Use Libraries in the RocksDB Makefile Build (#6660)
Summary:
Change the linking of tests/tools to be against a library rather than a list of objects.  This change substantially reduces the size of the objects produced.

peterd clean repo size: 264M
Before this change, with make all: 40G
After this change, with make all: 28G
With make LIB_MODE=shared all: 7.0G

The list of TESTS was changed from being hard-coded to generated from the test sources variable.  Note that there are some test sources that are not built as tests (though the set of tests is identical to the previous version).

Added OBJ_DIR option to Makefile to allow objects to be placed in an alternative location.  By default, OBJ_DIR is the same as before ("./").

This change is a precursor to being able to build/run the tests/tools linked against static libraries.  Additionally, it should be possible to clean up and merge some of the rules for building tests and the like if so desired.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6660

Reviewed By: riversand963

Differential Revision: D22244463

Pulled By: pdillinger

fbshipit-source-id: db9c6341d81ed62c2270374f4ede02fb9604c754
2020-06-30 19:33:31 -07:00
Cheng Chang f045ee6422 Increase transaction timeout and enable deadlock detection in stress test (#7056)
Summary:
There are errors like `Transaction put: Operation timed out: Timeout waiting to lock key
terminate called without an active exception`, based on experiment on devserver, increasing timeouts can resolve the issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7056

Test Plan: watch stress test with txn.

Reviewed By: anand1976

Differential Revision: D22317265

Pulled By: cheng-chang

fbshipit-source-id: 2dc3352def5e78d2c39a18d7262a3a65ca98bbba
2020-06-30 14:29:17 -07:00
sdong 2d1d51d385 db_stress: deep clean directory before checkpoint (#7039)
Summary:
We see crash test occassionally fails with "A checkpoint operation failed with: Invalid argument: Directory exists". The suspicious is that the directory fails to be deleted because some trash files. Deep clean the directory after a DestroyDB() call.

Also add more debugging printf in case it fails.
Also, preserve the DB if verification fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7039

Test Plan: Run db_stress with low --checkpoint_one_in value

Reviewed By: riversand963

Differential Revision: D22271694

fbshipit-source-id: 6a9b2abb664fc69a4dc666741df4f6b23703cd6d
2020-06-30 12:01:34 -07:00
Peter Dillinger 5b2bbacb6f Minimize memory internal fragmentation for Bloom filters (#6427)
Summary:
New experimental option BBTO::optimize_filters_for_memory builds
filters that maximize their use of "usable size" from malloc_usable_size,
which is also used to compute block cache charges.

Rather than always "rounding up," we track state in the
BloomFilterPolicy object to mix essentially "rounding down" and
"rounding up" so that the average FP rate of all generated filters is
the same as without the option. (YMMV as heavily accessed filters might
be unluckily lower accuracy.)

Thus, the option near-minimizes what the block cache considers as
"memory used" for a given target Bloom filter false positive rate and
Bloom filter implementation. There are no forward or backward
compatibility issues with this change, though it only works on the
format_version=5 Bloom filter.

With Jemalloc, we see about 10% reduction in memory footprint (and block
cache charge) for Bloom filters, but 1-2% increase in storage footprint,
due to encoding efficiency losses (FP rate is non-linear with bits/key).

Why not weighted random round up/down rather than state tracking? By
only requiring malloc_usable_size, we don't actually know what the next
larger and next smaller usable sizes for the allocator are. We pick a
requested size, accept and use whatever usable size it has, and use the
difference to inform our next choice. This allows us to narrow in on the
right balance without tracking/predicting usable sizes.

Why not weight history of generated filter false positive rates by
number of keys? This could lead to excess skew in small filters after
generating a large filter.

Results from filter_bench with jemalloc (irrelevant details omitted):

    (normal keys/filter, but high variance)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.6278
    Number of filters: 5516
    Total size (MB): 200.046
    Reported total allocated memory (MB): 220.597
    Reported internal fragmentation: 10.2732%
    Bits/key stored: 10.0097
    Average FP rate %: 0.965228
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.5104
    Number of filters: 5464
    Total size (MB): 200.015
    Reported total allocated memory (MB): 200.322
    Reported internal fragmentation: 0.153709%
    Bits/key stored: 10.1011
    Average FP rate %: 0.966313

    (very few keys / filter, optimization not as effective due to ~59 byte
     internal fragmentation in blocked Bloom filter representation)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.5649
    Number of filters: 162950
    Total size (MB): 200.001
    Reported total allocated memory (MB): 224.624
    Reported internal fragmentation: 12.3117%
    Bits/key stored: 10.2951
    Average FP rate %: 0.821534
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 31.8057
    Number of filters: 159849
    Total size (MB): 200
    Reported total allocated memory (MB): 208.846
    Reported internal fragmentation: 4.42297%
    Bits/key stored: 10.4948
    Average FP rate %: 0.811006

    (high keys/filter)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.7017
    Number of filters: 164
    Total size (MB): 200.352
    Reported total allocated memory (MB): 221.5
    Reported internal fragmentation: 10.5552%
    Bits/key stored: 10.0003
    Average FP rate %: 0.969358
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.7131
    Number of filters: 160
    Total size (MB): 200.928
    Reported total allocated memory (MB): 200.938
    Reported internal fragmentation: 0.00448054%
    Bits/key stored: 10.1852
    Average FP rate %: 0.963387

And from db_bench (block cache) with jemalloc:

    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ (for FILE in /dev/shm/dbbench.no_optimize/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17063835
    $ (for FILE in /dev/shm/dbbench/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17430747
    $ #^ 2.1% additional filter storage
    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8440400
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 21087528
    rocksdb.bloom.filter.useful COUNT : 4963889
    rocksdb.bloom.filter.full.positive COUNT : 1214081
    rocksdb.bloom.filter.full.true.positive COUNT : 1161999
    $ #^ 1.04 % observed FP rate
    $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8448592
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 18220328
    rocksdb.bloom.filter.useful COUNT : 5360933
    rocksdb.bloom.filter.full.positive COUNT : 1321315
    rocksdb.bloom.filter.full.true.positive COUNT : 1262999
    $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters

(Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427

Test Plan: unit test added, 'make check' with gcc, clang and valgrind

Reviewed By: siying

Differential Revision: D22124374

Pulled By: pdillinger

fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e
2020-06-22 13:32:07 -07:00
Andrew Kryczka d76eed4839 minor fixes for stress/crash contruns (#7006)
Summary:
Avoid using `cf_consistency` together with `enable_compaction_filter` as
the former heavily uses snapshots while the latter is incompatible with
snapshots.

Also fix a clang-analyze error for a write to a variable that is never
read.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7006

Reviewed By: zhichao-cao

Differential Revision: D22141679

Pulled By: ajkr

fbshipit-source-id: 1840ae238168818a9ab5973f90fd78c067399447
2020-06-19 16:05:17 -07:00
Peter Dillinger 88b4210701 Remove racially charged terms "whitelist" and "blacklist" (#7008)
Summary:
We don't need them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7008

Test Plan: "make check" and ensure "make crash_test" starts

Reviewed By: ajkr

Differential Revision: D22143838

Pulled By: pdillinger

fbshipit-source-id: 72c8e16603abc59f4954e304466bc4dc1f58f94e
2020-06-19 15:27:32 -07:00
Zhichao Cao a607f3efaa Fix unused variable failure (#7004)
Summary:
pass make check, run db_stress
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7004

Reviewed By: ajkr

Differential Revision: D22132617

Pulled By: zhichao-cao

fbshipit-source-id: d65397967e213206ec5efcb767bbdda8a575662a
2020-06-18 22:06:51 -07:00
Andrew Kryczka 775dc623ad add CompactionFilter to stress/crash tests (#6988)
Summary:
Added a `CompactionFilter` that is aware of the stress test's expected state. It only drops key versions that are already covered according to the expected state. It is incompatible with snapshots (same as all `CompactionFilter`s), so disables all snapshot-related features when used in the crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6988

Test Plan:
running a minified blackbox crash test

```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --max_key=1000000 -write_buffer_size=1048576 -max_bytes_for_level_base=4194304 -target_file_size_base=1048576 -value_size_mult=33 --interval=10 --duration=3600
```

Reviewed By: anand1976

Differential Revision: D22072888

Pulled By: ajkr

fbshipit-source-id: 727b9d7a90d5eab18be0ec6cd5a810712ac13320
2020-06-18 09:54:55 -07:00
Yanqin Jin 15d9f28da5 Add stress test for best-efforts recovery (#6819)
Summary:
Add crash test for the case of best-efforts recovery.
After a certain amount of time, we kill the db_stress process, randomly delete some certain table files and restart db_stress. Given the randomness of file deletion, it is difficult to verify against a reference for data correctness. Therefore, we just check that the db can restart successfully.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6819

Test Plan:
```
./db_stress -best_efforts_recovery=true -disable_wal=1 -reopen=0
./db_stress -best_efforts_recovery=true -disable_wal=0 -skip_verifydb=1 -verify_db_one_in=0 -continuous_verification_interval=0
make crash_test_with_best_efforts_recovery
```

Reviewed By: anand1976

Differential Revision: D21436753

Pulled By: riversand963

fbshipit-source-id: 0b3605c922a16c37ed17d5ab6682ca4240e47926
2020-06-12 19:27:46 -07:00
Zhichao Cao 2adb7e3768 Fix potential overflow of unsigned type in for loop (#6902)
Summary:
x.size() -1 or y - 1 can overflow to an extremely large value when x.size() pr y is 0 when they are unsigned type. The end condition of i in the for loop will be extremely large, potentially causes segment fault. Fix them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6902

Test Plan: pass make asan_check

Reviewed By: ajkr

Differential Revision: D21843767

Pulled By: zhichao-cao

fbshipit-source-id: 5b8b88155ac5a93d86246d832e89905a783bb5a1
2020-06-02 15:05:07 -07:00
Andrew Kryczka b464a85e33 fix transaction rollback in db_stress TestMultiGet (#6873)
Summary:
There were further uses of `txn` after `RollbackTxn(txn)` leading to
stress test errors. Moved the rollback to the end of the function.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6873

Test Plan:
found a command from the crash test that previously failed immediately under TSAN; verified now it succeeds.
```
./db_stress --acquire_snapshot_one_in=10000 --allow_concurrent_memtable_write=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --block_size=16384 --bloom_bits=222.913637674 --bottommost_compression_type=none --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_style=1 --compaction_ttl=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_pipelined_write=0 --flush_one_in=1000000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=12 --index_type=2 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=22 --long_running_snapshots=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_levels=1 --open_files=100 --ops_per_thread=200000 --partition_filters=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --read_fault_one_in=0 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --snapshot_hold_ops=100000 --subcompactions=4 --sync=0 --sync_fault_injection=False --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --txn_write_policy=1 --unordered_write=1 --use_block_based_filter=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=1 --use_multiget=1 --use_txn=1 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35
```

Reviewed By: pdillinger

Differential Revision: D21708338

Pulled By: ajkr

fbshipit-source-id: dcf55cddee0a14f429a75e7a8a505acf8025f2b1
2020-05-24 15:27:24 -07:00
anand76 eb04bb86c6 Fix a bug in crash_test_with_txn (#6860)
Summary:
In NoBatchedOpsStress::TestMultiGet, call txn->Get() when transactions
are in use.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6860

Test Plan: make crash_test_with_txn

Reviewed By: pdillinger

Differential Revision: D21667249

Pulled By: anand1976

fbshipit-source-id: 194bd7b9630a8efc3ae29d85422a61214e9e200e
2020-05-20 14:47:05 -07:00
anand76 39b24432d4 Strengthen MultiGet correctness verification in NoBatchedOpsStress (#6849)
Summary:
Add MultiGet to VerifyDb and check consistency with Get in TestMultiGet.

Test plan -
make crash_test
ASAN crash test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6849

Reviewed By: pdillinger

Differential Revision: D21635011

Pulled By: anand1976

fbshipit-source-id: deb5a79d08fefd8d8010204f1f20b83adc92310e
2020-05-19 12:47:22 -07:00
Tongliang Liao 07204837ce Mark dependencies as PRIVATE and fix missing dependencies in tools. (#6790)
Summary:
Tools were mistakenly using leaked (PUBLIC) dependencies from main lib.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6790

Reviewed By: riversand963

Differential Revision: D21471551

fbshipit-source-id: ec43b92e231777e0fcf0f865444391af09d6963b
2020-05-12 21:07:55 -07:00
Andrew Kryczka 1f20df2f38 cover single level universal in crash test (#6818)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6818

Test Plan:
fast whitebox test and verify there are some single-level universal and
some multi-level universal runs.

```
$ python ./tools/db_crashtest.py whitebox --simple -max_key=1000000 -value_size_mult=33 -write_buffer_size=524288 -target_file_size_base=524288 -max_bytes_for_level_base=2097152 --duration=120 --interval=10 --ops_per_thread=1000 --random_kill_odd=887
```

Reviewed By: riversand963

Differential Revision: D21432138

Pulled By: ajkr

fbshipit-source-id: 2fc5ba9f3dfa49bb11e81da7dd00a17b476e64d7
2020-05-06 18:08:09 -07:00
Ziyue Yang e619a20e93 Add an option for parallel compression in for db_stress (#6722)
Summary:
This commit adds an `compression_parallel_threads` option in
db_stress. It also fixes the naming of parallel compression
option in db_bench to keep it aligned with others.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6722

Reviewed By: pdillinger

Differential Revision: D21091385

fbshipit-source-id: c9ba8c4e5cc327ff9e6094a6dc6a15fcff70f100
2020-04-30 10:49:07 -07:00
Cheng Chang 0a77617820 Disable O_DIRECT in stress test when db directory does not support direct IO (#6727)
Summary:
In crash test, the db directory might be set to /dev/shm or /tmp, in certain environments such as internal testing infrastructure, neither of these directories support direct IO, so direct IO is never enabled in crash test.

This PR sets up SyncPoints in direct IO related code paths to disable O_DIRECT flag in calls to `open`, so the direct IO code paths will be executed, all direct IO related assertions will be checked, but no real direct IO request will be issued to the file system.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6727

Test Plan:
export CRASH_TEST_EXT_ARGS="--use_direct_reads=1 --mmap_read=0"
make -j24 crash_test

Reviewed By: zhichao-cao

Differential Revision: D21139250

Pulled By: cheng-chang

fbshipit-source-id: db9adfe78d91aa4759835b1af91c5db7b27b62ee
2020-04-25 00:01:03 -07:00
anand76 9e7b7e2c08 Silence false alarms in db_stress fault injection (#6741)
Summary:
False alarms are caused by codepaths that intentionally swallow IO
errors.

Tests:
make crash_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6741

Reviewed By: ltamasi

Differential Revision: D21181138

Pulled By: anand1976

fbshipit-source-id: 5ccfbc68eb192033488de6269e59c00f2c65ce00
2020-04-24 13:06:12 -07:00
sdong 73523baeb1 crash_test to cover options.avoid_flush_during_recovery (#6712)
Summary:
Options.avoid_flush_during_recovery is uncovered in crash_test. Add the coverage with a chance of 1/8, as it is a less frequently used options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6712

Test Plan: Run crash_test and see the option can be used or not used by chance.

Reviewed By: ltamasi

Differential Revision: D21056566

fbshipit-source-id: c3b1521517cfc204786e6ef8c6acd7fffda64793
2020-04-16 12:11:45 -07:00
Yueh-Hsuan Chiang 5801af4646 Add env_fault_injection argument to db_stress (#6687)
Summary:
Add env_fault_injection argument to db_stress.  When enabled,
FaultInjectionTestEnv will be used instead.  Currently this
option does not support running with other env setting.

This will allow
us to later manually produce error when running db_crashtest.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6687

Test Plan:
make db_stress -j32
./db_stress --env_fault_injection
./db_stress --env_fault_injection --hdfs   // expect error message

Reviewed By: ajkr

Differential Revision: D21014683

Pulled By: yhchiang

fbshipit-source-id: 0724aeac37efd57adb72a37defe6dbd3bfa8106a
2020-04-16 11:13:44 -07:00
anand76 610a09ccff Remove a printf from db_stress that's not useful info (#6705)
Summary:
This was causing db_crashtest.py to wrongly assume an error by parsing the output. Hopefully this will stabilize the crash tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6705

Test Plan: make blackbox_crash_test

Reviewed By: ltamasi

Differential Revision: D21043335

Pulled By: anand1976

fbshipit-source-id: 5cddd112b124d4e2ebd11724a17d4ef0f50c1cf8
2020-04-15 12:13:35 -07:00
anand76 234e2ed5b6 Fix a couple of bugs in db_stress fault injection (#6700)
Summary:
1. Fix a memory leak in FaultInjectionTestFS in the stack trace related
code
2. Check status of all MultiGet keys before deciding whether an error
was swallowed, instead of assuming an ok status for any key means an
undetected error
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6700

Test Plan: Run db_stress with asan and fault injection

Reviewed By: cheng-chang

Differential Revision: D21021498

Pulled By: anand1976

fbshipit-source-id: 489191efd1ab0fa834923a1e1d57253a7a315465
2020-04-14 11:06:55 -07:00
anand76 5c19a441c4 Fault injection in db_stress (#6538)
Summary:
This PR implements a fault injection mechanism for injecting errors in reads in db_stress. The FaultInjectionTestFS is used for this purpose. A thread local structure is used to track the errors, so that each db_stress thread can independently enable/disable error injection and verify observed errors against expected errors. This is initially enabled only for Get and MultiGet, but can be extended to iterator as well once its proven stable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6538

Test Plan:
crash_test
make check

Reviewed By: riversand963

Differential Revision: D20714347

Pulled By: anand1976

fbshipit-source-id: d7598321d4a2d72bda0ced57411a337a91d87dc7
2020-04-10 17:21:26 -07:00
Levi Tamasi 217ce20021 Remove GetSortedWalFiles/GetCurrentWalFile from the crash test (#6491)
Summary:
Currently, `db_stress` tests a randomly picked one of `GetLiveFiles`,
`GetSortedWalFiles`, and `GetCurrentWalFile` with a 1/N chance when the
command line parameter `get_live_files_and_wal_files_one_in` is specified.
The problem is that `GetSortedWalFiles` and `GetCurrentWalFile` are unreliable
in the sense that they can return errors if another thread removes a WAL file
while they are executing (which is a perfectly plausible and legitimate scenario).
The patch splits this command line parameter into three (one for each API),
and changes the crash test script so that only `GetLiveFiles` is tested during
our continuous crash test runs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6491

Test Plan:
```
make check
python tools/db_crashtest.py whitebox
```

Reviewed By: siying

Differential Revision: D20312200

Pulled By: ltamasi

fbshipit-source-id: e7c3481eddfe3bd3d5349476e34abc9eee5b7dc8
2020-03-18 17:14:15 -07:00
Yanqin Jin 58918d4ccc Use correct Env for DestroyDB in stress test (#6539)
Summary:
When using custom Env, trying to call DestroyDB() with default Options will
fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6539

Test Plan: ./db_stress

Differential Revision: D20476204

Pulled By: riversand963

fbshipit-source-id: 612c6754660cc9b5bb3e9c2dbb2f6ecd7f648797
2020-03-16 16:57:48 -07:00
Andrew Kryczka f52db84650 support SstFileManager in db_stress (#6454)
Summary:
Add some flags for configuring an SstFileManager. An
SstFileManager is only created when one or more of these flags are
set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6454

Test Plan:
- ran it a while:

```
$ python ./tools/db_crashtest.py blackbox --simple -max_key=100000 -write_buffer_size=131072 -target_file_size_base=131072 -max_bytes_for_level_base=524288 -value_size_mult=33 --interval=10 -max_background_compactions=4 -max_background_flushes=2 -sst_file_manager_bytes_per_sec=1048576
```

- verified with strace the SstFileManager is behaving as configured:

```
$ strace -fp `pidof db_stress` -e ftruncate,unlink
...
[pid 3074805]
ftruncate(9</tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash>,
67423) = 0
[pid 3074805]
ftruncate(9</tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash>,
51039) = 0
[pid 3074805]
ftruncate(9</tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash>,
34655) = 0
[pid 3074805]
ftruncate(9</tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash>,
18271) = 0
[pid 3074805]
ftruncate(9</tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash>,
1887) = 0
[pid 3074805]
unlink("/tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash") = 0
...
```

Differential Revision: D20103315

Pulled By: ajkr

fbshipit-source-id: b3e1092747157459d244b047947a979b85c98f48
2020-02-25 16:45:30 -08:00
sdong fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Maysam Yabandeh 01ab882ba3 Fix release warning for unused bg_canceled
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6357

Differential Revision: D19670931

Pulled By: maysamyabandeh

fbshipit-source-id: d528c4c7f9450f1f38b9d2a36e0d5d0865b39be9
2020-01-31 15:09:10 -08:00
Maysam Yabandeh 2243030bc5 Cancel bg jobs before deleting WritePrepared DB in stress tests (#6355)
Summary:
Background jobs in WritePrepared DB might access the db via a snapshot checker callback. The stress tests therefore should cancel background jobs before deleting the db in ::Reopen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6355

Differential Revision: D19664132

Pulled By: maysamyabandeh

fbshipit-source-id: 6060a830e8aad0015c10448286ad37c8a346ac01
2020-01-31 10:29:11 -08:00
Burton Li c9a5e48762 fix build warnnings on MSVC (#6309)
Summary:
Fix build warnings on MSVC. siying
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6309

Differential Revision: D19455012

Pulled By: ltamasi

fbshipit-source-id: 940739f2c92de60e47cc2bed8dd7f921459545a9
2020-01-30 16:07:26 -08:00
sdong 8f2bee6747 Add ReadOptions.auto_prefix_mode (#6314)
Summary:
Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314

Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.

Differential Revision: D19458717

fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
2020-01-28 14:44:05 -08:00
anand76 687119aeaf Variable key length in db_stress (#6273)
Summary:
Undo https://github.com/facebook/rocksdb/issues/6243 and fix the crash test failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6273

Test Plan: Run make ubsan_crash_test

Differential Revision: D19331472

Pulled By: anand1976

fbshipit-source-id: 30aa4a36c1b0f77a97159d82bbfd1cd767878e28
2020-01-09 21:27:18 -08:00
sdong 1244abef66 Stress Test: relax prefix iterator check condition (#6269)
Summary:
Right now, when validating prefix iterator, if control iterator is invalidate but prefix iterator shows value, we determine it as a test failure. However, this fails to consider the case where a file or memtable containing a tombstone is filtered out by a prefix bloom filter. The fix is to relax the check in this case. If we are out of prefix range, then ignore the check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6269

Test Plan: Run crash_test for a short while and it still passes.

Differential Revision: D19317594

fbshipit-source-id: b964a1cdc1df5efe439d4b32f8023e1fbc8598c1
2020-01-08 13:32:06 -08:00
Maysam Yabandeh f4a378be3e Print out non-ok DB::Open status in db_stress (#6272)
Summary:
The crash test is failing with non-ok status after TransactionDB::Open. This patch adds more debugging information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6272

Differential Revision: D19314527

Pulled By: maysamyabandeh

fbshipit-source-id: d45ecb0f2144e052fb4b5fdd483150440991a3b4
2020-01-08 12:10:55 -08:00
Maysam Yabandeh 83957dc510 Exclude MergeInProgress status from errors in stress tests (#6257)
Summary:
When called on transactions, MultiGet could return a legit MergeInProgress status. The patch excludes this case from errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6257

Differential Revision: D19275787

Pulled By: maysamyabandeh

fbshipit-source-id: f7158229422af015947e592ae066b4273c9fb9a4
2020-01-03 13:07:35 -08:00
Maysam Yabandeh 7c98d71567 Print before AddErrors in stress tests (#6256)
Summary:
Stress tests count number of errors and report them at the end. Not all the cases are accompanied with a log line which makes debugging difficult. The patch adds a log line to the remaining cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6256

Differential Revision: D19268785

Pulled By: maysamyabandeh

fbshipit-source-id: bdabcaa5c5c7edcb4ce4f25e38fd8a3fd9c7700b
2020-01-02 16:46:23 -08:00
Maysam Yabandeh 48a678b7c9 Prevent an incompatible combination of options (#6254)
Summary:
allow_concurrent_memtable_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6254

Differential Revision: D19265819

Pulled By: maysamyabandeh

fbshipit-source-id: 47f2e2dc26fe0972c7152f4da15dadb9703f1179
2020-01-02 16:15:06 -08:00
Peter Dillinger 95d226d8f5 Fix a clang analyzer report, and 'analyze' make rule (#6244)
Summary:
Clang analyzer was falsely reporting on use of txn=nullptr.
Added a new const variable so that it can properly prune impossible
control flows.

Also, 'make analyze' previously required setting USE_CLANG=1 as an
environment variable, not a make variable, or else compilation errors
like

g++: error: unrecognized command line option ‘-Wshorten-64-to-32’

Now USE_CLANG is not required for 'make analyze' (it's implied) and you
can do an incremental analysis (recompile what has changed) with
'USE_CLANG=1 make analyze_incremental'
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6244

Test Plan: 'make -j24 analyze', 'make crash_test'

Differential Revision: D19225950

Pulled By: pdillinger

fbshipit-source-id: 14f4039aa552228826a2de62b2671450e0fed3cb
2019-12-24 18:46:40 -08:00
Peter Dillinger 37fd2b9694 Revert "Generate variable length keys in db_stress (#6165)" and follow-ups (#6243)
Summary:
This commit is suspected in some crash test failures such as

Verification failed for column family 0 key 78438077: Value not found: NotFound:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6243

Test Plan: 'make check' and start 'make crash_test'

Differential Revision: D19220495

Pulled By: pdillinger

fbshipit-source-id: 6c4709cee80ab4344e06ce360f51e947d79fb3fa
2019-12-23 16:32:57 -08:00
Peter Dillinger 5f559897cf Disable occasionally failing assertion in TestPrefixScan (#6238)
Summary:
Seeing crash test failures like

db_stress: db_stress_tool/no_batched_ops_stress.cc:271: virtual
rocksdb::Status
rocksdb::NonBatchedOpsStressTest::TestPrefixScan(rocksdb::ThreadState*,
const rocksdb::ReadOptions&, const std::vector<int>&, const
std::vector<long int>&): Assertion `count <=
GetPrefixKeyCount(prefix.ToString(), upper_bound)' failed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6238

Differential Revision: D19210312

Pulled By: pdillinger

fbshipit-source-id: 4d2c35c38f418b408e01c7ba22adf6983ae67d44
2019-12-21 21:12:11 -08:00
Peter Dillinger 22fea0ba79 Fix unused variable in release build
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6237

Differential Revision: D19210304

Pulled By: pdillinger

fbshipit-source-id: f6f050e995f4d210f812bb1d2020adbd751e1d5a
2019-12-21 20:58:30 -08:00
anand76 d4da412864 Add Transaction::MultiGet to db_stress (#6227)
Summary:
Call Transaction::MultiGet from TestMultiGet() in db_stress. We add some Puts/Merges/Deletes into the transaction in order to exercise MultiGetFromBatchAndDB. There is no data verification on read, just check status. There is currently no read data verification in db_stress as it requires synchronization with writes. It needs to be tackled separately.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6227

Test Plan: make crash_test_with_txn

Differential Revision: D19204611

Pulled By: anand1976

fbshipit-source-id: 770d0e30d002e88626c264c58103f1d709bb060c
2019-12-20 23:12:51 -08:00
sdong e0f9d11a05 db_stress should not keep manifest files under checkpoint directory (#6233)
Summary:
Recently db_stress starts to use a special Env that keeps all manifest files. This should not apply to checkpoint directory and causes test failure like this:

Verification failed: Checkpoint gave inconsistent state. Status is IO error: While mkdir: /dev/shm/rocksdb/rocksdb_crashtest_whitebox/.checkpoint27.tmp: File exists
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6233

Test Plan: Run crash_test with high chance of checkpoint and make sure it doesn't reproduce.

Differential Revision: D19207250

fbshipit-source-id: 12a931379e2e0572bb84aa658b6d03770c8551d4
2019-12-20 22:10:06 -08:00
sdong 9d36c066c6 db_stress: listners to implement all functions (#6197)
Summary:
Listners are one source of bugs because we frequently release some mutex to invoke them, which introduce race conditions. Implement all callback functions in db_stress's listener class, and randomly sleep.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6197

Test Plan: Run crash_test for a while and see no obvious problem.

Differential Revision: D19134015

fbshipit-source-id: b9ea8be9366e4501759119520cd4f204943538f6
2019-12-20 21:47:06 -08:00
sdong 79cc8dc29b db_stress: cover approximate size (#6213)
Summary:
db_stress to execute DB::GetApproximateSizes() with randomized keys and options. Return value is not validated but error will be reported.
Two ways to generate the range keys: (1) two random keys; (2) a small range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6213

Test Plan: (1) run "make crash_test" for a while; (2) hack the code to ingest some errors to see it is reported.

Differential Revision: D19204665

fbshipit-source-id: 652db36f13bcb5a3bd8fe4a10c0aa22a77a0bce2
2019-12-20 21:43:35 -08:00
anand76 3160edfdc7 Generate variable length keys in db_stress (#6165)
Summary:
Currently, db_stress generates fixed length keys of 8 bytes. This patch adds the ability to generate variable length keys. Most of the db_stress code continues to work with a numeric key randomly generated, and the numeric key also acts as an index into the values_ array. The numeric key is mapped to a variable length string key in a deterministic way. Furthermore, the ordering is preserved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6165

Test Plan: run make crash_test

Differential Revision: D19204646

Pulled By: anand1976

fbshipit-source-id: d2d46a96615b4832a8be2a981f5913905f0e1ca7
2019-12-20 21:10:33 -08:00
sdong 338c149b92 crash_test to cover bottommost compression and some other changes (#6215)
Summary:
Several improvements to crash_test/stress_test:
(1) Stress_test to support an parameter of bottommost compression
(2) Rename those FLAGS_* variables that are not gflags to avoid confusion
(3) Crash_test to randomly generate compression type for bottommost compression with half the chance.
(4) Stress_test to sanitize unsupported compression type to snappy, so that crash_test to cover all possible compression types and people don't need to worry about they don't support all comrpession types in their environment.
(5) In crash_test, when generating db_stress command, sort arguments in alphabeta order, so that it is easier to find value for a specific argument.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6215

Test Plan: Run "make crash_test" for a while and see the botommost option shown in LOG files.

Differential Revision: D19171255

fbshipit-source-id: d7001e246c4ff9ee5760776eea0be97738650735
2019-12-20 16:14:52 -08:00
sdong e55c2b3f0b db_stress: improvements in TestIterator (#6166)
Summary:
1. Cover SeekToFirst() and SeekToLast().
2. Try to record the history of iterator operations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6166

Test Plan: Do some manual changes in the code to cover the failure cases and see the error printing is correct and SeekToFirst() and SeekToLast() sometimes show up.

Differential Revision: D19047079

fbshipit-source-id: 1ed616f919fe4d32c0a021fc37932a7bd3063bcd
2019-12-20 14:56:15 -08:00
Zhichao Cao f89dea4fec db_stress: Added the verification for GetLiveFiles, GetSortedWalFiles and GetCurrentWalFile (#6224)
Summary:
Add the verification in operateDB to verify GetLiveFiles, GetSortedWalFiles and GetCurrentWalFile. The test will be called every 1 out of N, N is decided by get_live_files_and_wal_files_one_i, whose default is 1000000.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6224

Test Plan: pass db_stress default run.

Differential Revision: D19183358

Pulled By: zhichao-cao

fbshipit-source-id: 20073cf72ede77a3e0d3cf5f28304f1f605d2b1a
2019-12-20 12:07:30 -08:00
Yanqin Jin c4fd9cf461 Remove an unnecessary check before running db_stress (#6231)
Summary:
As title. We can run non-cf-consistency stress tests with verify_db_one_in>0,
thus remove the check added previously.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6231

Test Plan:
```
make crash_test
```

Differential Revision: D19198295

Pulled By: riversand963

fbshipit-source-id: e874c701bb03ab76eaab00f059dd4032bb2f537f
2019-12-20 11:29:06 -08:00
Levi Tamasi 786c3d45ed Support BlobDB in db_stress (#6230)
Summary:
The patch adds support for BlobDB to `db_stress`. Note that BlobDB currently does
not support (amongst other features) Column Families or the `SingleDelete` API,
so for now, those should be disabled on the command line when running `db_stress` in
BlobDB mode (using `-column_families=1` and `-nooverwritepercent=0`,
respectively). Also, some BlobDB features that do not go well with the verification logic
in `db_stress` like TTL and FIFO eviction are not supported currently.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6230

Test Plan:
```
./db_stress -max_key=100000 -use_blob_db -column_families=1 -nooverwritepercent=0 -reopen=1 -blob_db_file_size=1000000 -target_file_size_base=1000000 -blob_db_enable_gc -blob_db_gc_cutoff=0.1 -blob_db_min_blob_size=10 -blob_db_bytes_per_sync=16384
```

Differential Revision: D19191476

Pulled By: ltamasi

fbshipit-source-id: 35840452af8c5e6095249c7fd9a53a119a0985fc
2019-12-20 10:27:56 -08:00
Yanqin Jin 670a916d01 Add more verification to db_stress (#6173)
Summary:
Currently, db_stress performs verification by calling `VerifyDb()` at the end of test and optionally before tests start. In case of corruption or incorrect result, it will be too late. This PR adds more verification in two ways.
1. For cf consistency test, each test thread takes a snapshot and verifies every N ops. N is configurable via `-verify_db_one_in`. This option is not supported in other stress tests.
2. For cf consistency test, we use another background thread in which a secondary instance periodically tails the primary (interval is configurable). We verify the secondary. Once an error is detected, we terminate the test and report. This does not affect other stress tests.

Test plan (devserver)
```
$./db_stress -test_cf_consistency -verify_db_one_in=0 -ops_per_thread=100000 -continuous_verification_interval=100
$./db_stress -test_cf_consistency -verify_db_one_in=1000 -ops_per_thread=10000 -continuous_verification_interval=0
$make crash_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6173

Differential Revision: D19047367

Pulled By: riversand963

fbshipit-source-id: aeed584ad71f9310c111445f34975e5ab47a0615
2019-12-20 08:49:29 -08:00
Peter Dillinger 873331fe49 Refactor pulling out parts of StressTest::OperateDb (#6195)
Summary:
Complete some refactoring called for in https://github.com/facebook/rocksdb/issues/6148. Somehow I got some 'make format' in here for files I didn't change, but that should be OK. (I'm not sure why "hide whitespace changes" doesn't seem to help in review.)

Not addressed in this PR: some operations simply print to stdout rather than failing on discovering a bad status or inconsistency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6195

Differential Revision: D19131067

Pulled By: pdillinger

fbshipit-source-id: 4f416e6b792023989ef119f385fe122426cb825d
2019-12-19 14:04:49 -08:00
anand76 2afea29762 Add VerifyChecksum() to db_stress (#6203)
Summary:
Add an option to db_stress, verify_checksum_one_in, to call DB::VerifyChecksum() once every N ops.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6203

Differential Revision: D19145753

Pulled By: anand1976

fbshipit-source-id: d09edf21f309ad53aa40dd25b7a563d50665fd8b
2019-12-17 20:44:58 -08:00
Maysam Yabandeh 68d5d82d1f Wait for CancelAllBackgroundWork before Close in db stress (#6191)
Summary:
In https://github.com/facebook/rocksdb/issues/6174 we fixed the stress test to respect the CancelAllBackgroundWork + Close order for WritePrepared transactions. The fix missed to take into account that some invocation of CancelAllBackgroundWork are with wait=false parameter which essentially breaks the order.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6191

Differential Revision: D19102709

Pulled By: maysamyabandeh

fbshipit-source-id: f4e7b5fdae47ff1c1ac284ba1cf67d5d3f3d03eb
2019-12-16 18:33:09 -08:00
sdong bcc372c0c3 Add some new options to crash_test (#6176)
Summary:
Several options are trivially added to crash test and random values are picked.
Made simple test run non-dynamic level and normal test run dynamic level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6176

Test Plan: Run crash_test and watch the printing

Differential Revision: D19053955

fbshipit-source-id: 958cb43c968541ebd87ed4d91e778bd1d40e7502
2019-12-16 15:43:13 -08:00
sdong 35126dd874 db_stress: preserve all historic manifest files (#6142)
Summary:
compaction history is stored in manifest files. Preserve all of them in db_stress would help debugging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6142

Test Plan: Run db_stress and observe that manifest files are preserved. Run whole crash_test and see how DB directory looks like.

Differential Revision: D19047026

fbshipit-source-id: f83c3e0bb5332b1b4768be5dcee56a24f9b760a9
2019-12-16 14:32:34 -08:00
Zhichao Cao fbda25f57a db_stress: generate the key based on Zipfian distribution (hot key) (#6163)
Summary:
In the current db_stress, all the keys are generated randomly and follows the uniform distribution. In order to test some corner cases that some key are always updated or read, we need to generate the key based on other distributions. In this PR, the key is generated based on Zipfian distribution and the skewness can be controlled by setting hot_key_alpha (0.8 to 1.5 is suggested). The larger hot_key_alpha is, the more skewed will be. Not that, usually, if hot_key_alpha is larger than 2, there might be only 1 or 2 keys that are generated. If hot_key_alpha is 0, it generate the key follows uniform distribution (random key)

Testing plan: pass the db_stress and printed the keys to make sure it follows the distribution.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6163

Differential Revision: D18978480

Pulled By: zhichao-cao

fbshipit-source-id: e123b4865477f7478e83fb581f9576bada334680
2019-12-16 14:01:58 -08:00
Maysam Yabandeh 4b97812da8 Add long-running snapshots to stress tests (#6171)
Summary:
Current implementation holds on to 10% of snapshots for 10x longer, and 1% of snapshots 100x longer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6171

Test Plan:
```
make -j32 crash_test

Differential Revision: D19038399

Pulled By: maysamyabandeh

fbshipit-source-id: 75da2dbb5c47a0b3f37d299b8719e392b73b42c0
2019-12-14 15:22:40 -08:00
Maysam Yabandeh 349bd3ed82 CancelAllBackgroundWork before Close in db stress (#6174)
Summary:
Close asserts that there is no unreleased snapshots. For WritePrepared transaction, this means that the background work that holds on a snapshot must be canceled first. Update the stress tests to respect the sequence.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6174

Test Plan:
```
make -j32 crash_test

Differential Revision: D19057322

Pulled By: maysamyabandeh

fbshipit-source-id: c9e9e24f779bbfb0ab72c2717e34576c01bc6362
2019-12-13 18:22:50 -08:00
Peter Dillinger 58d46d1915 Add useful idioms to Random API (OneInOpt, PercentTrue) (#6154)
Summary:
And clean up related code, especially in stress test.

(More clean up of db_stress_test_base.cc coming after this.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6154

Test Plan: make check, make blackbox_crash_test for a bit

Differential Revision: D18938180

Pulled By: pdillinger

fbshipit-source-id: 524d27621b8dbb25f6dff40f1081e7c00630357e
2019-12-13 14:30:14 -08:00
Kefu Chai ac304adf46 cmake: do not build tests for Release build and cleanups (#5916)
Summary:
fixes https://github.com/facebook/rocksdb/issues/2445
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5916

Differential Revision: D19031236

fbshipit-source-id: bc3107b6b25a01958677d7cb411b1f381aae91c6
2019-12-13 12:48:06 -08:00
Maysam Yabandeh fec7302a9d Enable unordered_write in stress tests (#6164)
Summary:
With WritePrepared transactions configured with two_write_queues, unordered_write will offer the same guarantees as vanilla rocksdb and thus can be enabled in stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6164

Test Plan:
```
make -j32 crash_test_with_txn

Differential Revision: D18991899

Pulled By: maysamyabandeh

fbshipit-source-id: eece5e96b4169b67d7931e5c0afca88540a113e1
2019-12-13 10:25:04 -08:00
Maysam Yabandeh 8613ee2e94 Enable all txn write policies in crash test (#6158)
Summary:
Currently the default txn write policy in crash tests is WRITE_PREPARED. The patch randomly picks the write policy at the start of the crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6158

Test Plan:
```
make -j32 crash_test_with_txn
```

Differential Revision: D18946307

Pulled By: maysamyabandeh

fbshipit-source-id: f77d7a94f99a08791ef9626da153d284bf521950
2019-12-12 10:43:49 -08:00
Yanqin Jin 383f5071f0 Add SyncWAL to db_stress (#6149)
Summary:
Add SyncWAL to db_stress. Specify with `-sync_wal_one_in=N` so that it will be
called once every N operations on average.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6149

Test Plan:
```
$make db_stress
$./db_stress -sync_wal_one_in=100 -ops_per_thread=100000
```

Differential Revision: D18922529

Pulled By: riversand963

fbshipit-source-id: 4c0b8cb8fa21852722cffd957deddf688f12ea56
2019-12-10 21:55:25 -08:00
sdong 7a99162a74 db_stress: sometimes call CancelAllBackgroundWork() and Close() before closing DB (#6141)
Summary:
CancelAllBackgroundWork() and Close() are frequently used features but we don't cover it in stress test. Simply execute them before closing the DB with 1/2 chance.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6141

Test Plan: Run "db_stress".

Differential Revision: D18900861

fbshipit-source-id: 49b46ccfae120d0f9de3e0543b82fb6d715949d0
2019-12-10 20:04:52 -08:00
Peter Dillinger a653857178 Add PauseBackgroundWork() to db_stress (#6148)
Summary:
Worker thread will occasionally call PauseBackgroundWork(),
briefly sleep (to avoid stalling itself) and then call
ContinueBackgroundWork().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6148

Test Plan:
some running of 'make blackbox_crash_test' with temporary
printf output to confirm code occasionally reached.

Differential Revision: D18913886

Pulled By: pdillinger

fbshipit-source-id: ae9356a803390929f3165dfb6a00194692ba92be
2019-12-10 15:46:48 -08:00
Adam Simpkins 2bb5fc1280 Add an option to the CMake build to disable building shared libraries (#6122)
Summary:
Add an option to explicitly disable building shared versions of the
RocksDB libraries.  The shared libraries cannot be built in cases where
some dependencies are only available as static libraries.  This allows
still building RocksDB in these situations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6122

Differential Revision: D18920740

fbshipit-source-id: d24f66d93c68a1e65635e6e0b663bae62c903bca
2019-12-10 15:20:50 -08:00
sdong 14c38baca0 db_stress: sometimes validate compact range data (#6140)
Summary:
Right now, in db_stress, compact range is simply executed without any immediate data validation. Add a simply validation which compares hash for all keys within the compact range to stay the same against the same snapshot before and after the compaction.

Also, randomly tune most knobs of CompactRangeOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6140

Test Plan: Run db_stress with "--compact_range_one_in=2000 --compact_range_width=100000000" for a while. Manually ingest some hacky code and observe the error path.

Differential Revision: D18900230

fbshipit-source-id: d96e75bc8c38dd5ec702571ffe7cf5f4ea93ee10
2019-12-10 11:41:50 -08:00
Peter Dillinger 6380df5e10 Vary bloom_bits in db_crashtest (#6103)
Summary:
Especially with non-integral bits/key now supported,
db_crashtest should vary the bloom_bits configuration. The probabilities
look like this:

1/2 chance of a uniform int from 0 to 19. This includes overall 1/40
chance of 0 which disables the bloom filter.

1/2 chance of a float from a lognormal distribution with a median of 10.
This always produces positive values but with a decent chance of < 1
(overall ~1/40) or > 100 (overall ~1/40), the enforced/coerced
implementation limits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6103

Test Plan:
start 'make blackbox_crash_test' several times and look at
configuration output

Differential Revision: D18734877

Pulled By: pdillinger

fbshipit-source-id: 4a38cb057d3b3fc1327f93199f65b9a9ffbd7316
2019-12-10 08:39:50 -08:00
sdong a960287dee db_stress: Some code style improvements (#6137)
Summary:
Two changes:
1. Prevent static variables in a header file
2. Add "override" keyword when virtual functions are overridden.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6137

Test Plan: Build db_stress with or without LITE.

Differential Revision: D18892007

fbshipit-source-id: 295356427a34473b23ed36d6ed4ef3ae35a32db0
2019-12-09 14:38:42 -08:00
sdong 7d79b32618 Break db_stress_tool.cc to a list of source files (#6134)
Summary:
db_stress_tool.cc now is a giant file. In order to main it easier to improve and maintain, break it down to multiple source files.
Most classes are turned into their own files. Separate .h and .cc files are created for gflag definiations. Another .h and .cc files are created for some common functions. Some test execution logic that is only loosely related to class StressTest is moved to db_stress_driver.h and db_stress_driver.cc. All the files are located under db_stress_tool/. The directory name is created as such because if we end it with either stress or test, .gitignore will ignore any file under it and makes it prone to issues in developements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6134

Test Plan: Build under GCC7 with and without LITE on using GNU Make. Build with GCC 4.8. Build with cmake with -DWITH_TOOL=1

Differential Revision: D18876064

fbshipit-source-id: b25d0a7451840f31ac0f5ebb0068785f783fdf7d
2019-12-08 23:51:01 -08:00