Summary:
deleterandom tests are too fast to get good signal, e.g.
--deletes=31250 in 0.170 seconds vs. --reads=1500000 in 288.491
seconds for readrandom. Removing the special handling (unknown
motivation in faa7eb3b99) should suffice.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10437
Test Plan: watch continuous results
Reviewed By: ltamasi
Differential Revision: D38261185
Pulled By: pdillinger
fbshipit-source-id: 0f1b1b19efccda5689027d36cc2f01307f36031d
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309
Reviewed By: ltamasi
Differential Revision: D38211655
Pulled By: gangliao
fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
Summary:
In this PR we bring ClockCache closer to production quality. We implement the following changes:
1. Fixed a few bugs in ClockCache.
2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table.
3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity.
4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.)
5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache.
As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418
Test Plan:
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
Reviewed By: pdillinger
Differential Revision: D38170673
Pulled By: guidotag
fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a
Summary:
If user runs `db_bench` with `-perf_level=2` or higher, db_bench should
print perf context after each of all benchmarks.
Or make `-perf_level` a per-benchmark switch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10396
Test Plan: ./db_bench -benchmarks=fillseq,readseq -perf_level=2
Reviewed By: ajkr
Differential Revision: D38016324
Pulled By: riversand963
fbshipit-source-id: d83ea4abc34d40ffea394ca6abf0814bc5c0a2e0
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321
Reviewed By: ltamasi
Differential Revision: D37913590
Pulled By: gangliao
fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
Summary:
After branch 7.5.fb branch is cut, following release process, upgrade version number to 7.6 and add 7.5.fb to format compatibility check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10376
Test Plan: Watch CI
Reviewed By: ajkr
Differential Revision: D37927694
fbshipit-source-id: 71b37ae55ebb7c95a1bcc0d7eee643d6ba5f8461
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298
Reviewed By: ltamasi
Differential Revision: D37908743
Pulled By: gangliao
fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
Summary:
ClockCache is still in experimental stage, and currently fails some pre-release fbcode tests. See https://www.internalfb.com/diff/D37772011. API calls to construct ClockCache are done via the function NewClockCache. For now, NewClockCache calls will return an LRUCache (with appropriate arguments), which is stable.
The idea that NewClockCache returns nullptr was also floated, but this would be interpreted as unsupported cache, and a default LRUCache would be constructed instead, potentially causing a performance regression that is harder to identify.
A new version of the NewClockCache function was created for our internal tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10351
Test Plan: ``make -j24 check`` and re-run the pre-release tests.
Reviewed By: pdillinger
Differential Revision: D37802685
Pulled By: guidotag
fbshipit-source-id: 0a8d10612ff21e576f7360cb13e20bc36e244972
Summary:
as title.
Test plan
- make check
- CI on PR
- TEST_TMPDIR=/dev/shm make crash_test_with_multiops_wp_txn (tested with successful run)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10350
Reviewed By: ajkr
Differential Revision: D37792872
Pulled By: riversand963
fbshipit-source-id: ff064093b7f715d0acf387af2e3ae87b1278b52b
Summary:
Previously the version was displayed as $major.$minor
This changes it to $major.$minor.$path
This also adds the git hash for the time from which RocksDB was built to the end of report.tsv. I confirmed that benchmark_log_tool.py still parses it and that the people
who consume/graph these results are OK with it.
Example output:
ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id githash
609488 244.1 1GB 0.0GB, 1.4 0.7 93.3 39 38 0 0 1.6 1.0 4 15 26 5365 15 0.0 0 0.1 0.0 0.5 fillseq.wal_disabled.v400 2022-06-29T13:36:05 7.5.0 6115254416
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10277
Test Plan: Run it
Reviewed By: jay-zhuang
Differential Revision: D37532418
Pulled By: mdcallag
fbshipit-source-id: 55e472640d51265819b228d3373c9fa9b62b660d
Summary:
This adds --undefok to support use of this script with BlobDB for db_bench versions prior
to 7.5 when the options land in a release.
While there is a limit to how far back this script can go WRT backwards compatiblity,
this is an easy change to support early 7.x releases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10276
Test Plan: Run it with versions of db_bench that do not and then do support these options
Reviewed By: gangliao
Differential Revision: D37529299
Pulled By: mdcallag
fbshipit-source-id: 7bb1feec5c68760e6d64792c585bfbde4f5e52d8
Summary:
This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
- Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
- Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
- Middle insertions into the clock list.
- A test that exercises the clock eviction policy.
- Update the Java API of ClockCache and Java calls to C++.
Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37522461
Pulled By: guidotag
fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
Summary:
Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter.
This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of
the presence of timestamps.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259
Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts
Reviewed By: ajkr
Differential Revision: D37459692
Pulled By: riversand963
fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef
Summary:
Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252
Reviewed By: riversand963
Differential Revision: D37432948
Pulled By: ajkr
fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a
Summary:
In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`.
As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202
Reviewed By: ltamasi
Differential Revision: D37325739
Pulled By: gangliao
fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8
Summary:
There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data
block hash index, which was not caught because data block hash index was
never added to stress tests. This change fixes both issues.
Fixes https://github.com/facebook/rocksdb/issues/10186
I intend to pick this into the 7.4.0 release candidate
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220
Test Plan:
Failure quickly reproduces in crash test with
kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing
the failure with a unit test I believe would be too tricky and fragile
to be worthwhile.
Reviewed By: anand1976
Differential Revision: D37315647
Pulled By: pdillinger
fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba
Summary:
In FastLRUCache, we replace the current chained per-shard hash table by an open-addressing hash table. In particular, this allows us to preallocate all handles.
Because all handles are preallocated, this implementation doesn't support strict_capacity_limit = false (i.e., allowing insertions beyond the predefined capacity). This clashes with current assumptions of some tests, namely two tests in cache_test and the crash tests. We have disabled these for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10194
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37296770
Pulled By: guidotag
fbshipit-source-id: 232ff1b8260331d868ebf4e3e5d8ad709390b0ad
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.
Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198
Reviewed By: ltamasi
Differential Revision: D37294735
Pulled By: gangliao
fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.
However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159
Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
- Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work
Reviewed By: ajkr, anand1976
Differential Revision: D37135172
Pulled By: hx235
fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.
Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.
There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037
Test Plan:
- Manual
- Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
- Enabled in MyShadow for 1+ week
- Automated
- Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
- Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection
Reviewed By: cbi42
Differential Revision: D36614569
Pulled By: ajkr
fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
Summary:
In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
the public API, because of its inefficiency. Although we normally maintain read compatibility
on old DBs forever, filters are not required for reading a DB, only for optimizing read
performance. Thus, it should be acceptable to remove this code and the substantial
maintenance burden it carries as useful features are developed and validated (such
as user timestamp).
This change completely removes the code for reading and writing the old block-based
filters, net removing about 1370 lines of code no longer needed. Options removed from
testing / benchmarking tools. The prior existence is only evident in a couple of places:
* `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
a major release to minimize source code incompatibilities.
* A warning is logged when an old table file is opened that used the old block-based
filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
should suffice. Unfortunately, sst_dump does not tell you whether a file uses
block-based filter, and the structure of the code makes it very difficult to fix.
* To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
for metaindex is maintained (for now).
Other notes:
* In some cases where numbers are associated with filter configurations, we have had to
update the assigned numbers so that they all correspond to something that exists.
* Fixed potential stat counting bug by assuming `filter_checked = false` for cases
like `filter == nullptr` rather than assuming `filter_checked = true`
* Removed obsolete `block_offset` and `prefix_extractor` parameters from several
functions.
* Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
because the caller guarantees the prefix extractor exists and is compatible
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184
Test Plan:
tests updated, manually test new warning in LOG using base version to
generate a DB
Reviewed By: riversand963
Differential Revision: D37212647
Pulled By: pdillinger
fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
Summary:
Fix existing usage of non-ASCII and add a check to prevent
future use. Added `-n` option to greps to provide line numbers.
Alternative to https://github.com/facebook/rocksdb/issues/10147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10164
Test Plan:
used new checker to find & fix cases, manually check
db_bench output is preserved
Reviewed By: akankshamahajan15
Differential Revision: D37148792
Pulled By: pdillinger
fbshipit-source-id: 68c8b57e7ab829369540d532590bf756938855c7
Summary:
There is `Options::allow_data_in_errors` that controls whether RocksDB
is allowed to log data, e.g. key, value, etc in LOG files. It is false
by default. However, in db_bench and db_stress, it is often ok to log
data because there is no concern about privacy.
This PR allows db_stress and db_bench to set this option on the command
line, while it remains false by default. Furthermore, make
crash/recovery test driven by db_crashtest.py to opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171
Test Plan: Stress test and db_bench
Reviewed By: hx235
Differential Revision: D37163787
Pulled By: riversand963
fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8
Summary:
We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values.
Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154
Test Plan: `make -j24 check`
Reviewed By: pdillinger
Differential Revision: D37124451
Pulled By: guidotag
fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c
Summary:
See https://github.com/facebook/rocksdb/issues/10082 for more details. Trivial move
isn't done for universal when compaction is from L0 into L0. So a too small value for
num_levels with db_bench means there will be fewer trivial moves with universal and
that means that write-amp will increase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10158
Test Plan: run it
Reviewed By: siying
Differential Revision: D37122519
Pulled By: mdcallag
fbshipit-source-id: 1cb39049676f68a6cc3ea8d105a9965f89d4d09e
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.
It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.
This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.
In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.
This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```
If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.
Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();
// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
std::shared_ptr<TransactionNotifier> notifier,
TxnTimestamp ts,
std::shared_ptr<const Snapshot>* ret);
// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```
The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```
We also provide two additional APIs for stats collection and reporting purposes.
```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
TxnTimestamp ts_lb,
TxnTimestamp ts_ub,
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```
To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```
Before shutdown, RocksDB will release all timestamped snapshots.
Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.
Timestamped snapshots can also be used with user-defined timestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879
Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```
Reviewed By: siying
Differential Revision: D35783919
Pulled By: riversand963
fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
Summary:
A recent diff add a few more fields to one of the db_bench output lines that gets parsed.
This diff updates tools/benchmark.sh to handle that.
overwrite : 7.939 micros/op 125963 ops/sec; 50.5 MB/s
overwrite : 7.854 micros/op 127320 ops/sec 1800.001 seconds 229176999 operations; 51.0 MB/s
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10124
Test Plan: Run it
Reviewed By: jay-zhuang
Differential Revision: D36945137
Pulled By: mdcallag
fbshipit-source-id: 9c96f79491411da997e369a3be9c6b921a21d0fa
Summary:
This PR updates secondary instance testing in stress test by default.
A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary.
Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan
to read some random keys with only very basic verification to make sure no assertion failure is triggered.
Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled.
Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic
catch-up.
In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not
work well when the underlying file is renamed by primary, e.g. SstFileManager.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb make crash_test
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush
```
Reviewed By: ajkr
Differential Revision: D36939458
Pulled By: riversand963
fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e
Summary:
The patch attempts to fix three bugs in `verify_random_db.sh`:
1) https://github.com/facebook/rocksdb/pull/9937 changed the default for
`--try_load_options` to true in the script's use case, so we have to
explicitly set it to false if the corresponding argument of the script
is 0. This should fix the issue we've been seeing with our forward
compatibility tests where 7.3 is unable to open a database created by
the version on main after adding a new configuration option.
2) The script seems to support two "extra parameters"; however,
in practice, if the second one was set, only that one was passed on to
`ldb`. Now both get forwarded.
3) When running the `diff` command, the base DB directory was passed as
the second argument instead of the file containing the `ldb` output
(this actually seems to work, probably accidentally though).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10112
Reviewed By: pdillinger
Differential Revision: D36911363
Pulled By: ltamasi
fbshipit-source-id: fe29db4e28d373cee51a12322c59050fc50e926d
Summary:
db_bench can now run with FastLRUCache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10096
Test Plan:
- Temporarily add an ``assert(false)`` in the execution path that sets up the FastLRUCache. Run ``make -j24 db_bench``. Then test the appropriate code is used by running ``./db_bench -cache_type=fast_lru_cache`` and checking that the assert is called. Repeat for LRUCache.
- Verify that FastLRUCache (currently a clone of LRUCache) produces similar benchmark data than LRUCache, by comparing the outputs of ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=fast_lru_cache`` and ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=lru_cache``.
Reviewed By: gitbw95
Differential Revision: D36898774
Pulled By: guidotag
fbshipit-source-id: f9f6b6f6da124f88b21b3c8dee742fbb04eff773
Summary:
Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.
In order to achieve this, we would like to do the following:
- Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
- Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
- Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
- Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
- Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
- Ideally extend the C and Java bindings with the new option
- Update the BlobDB wiki to document the new option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077
Reviewed By: ltamasi
Differential Revision: D36884156
Pulled By: gangliao
fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
Summary:
Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
```
strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
```
There is one directory file descriptor that is not closed in the strace log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049
Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.
Reviewed By: ajkr, hx235
Differential Revision: D36722135
Pulled By: littlepig2013
fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
Summary:
Stress tests can run with the experimental FastLRUCache. Crash tests randomly choose between LRUCache and FastLRUCache.
Since only LRUCache supports a secondary cache, we validate the `--secondary_cache_uri` and `--cache_type` flags---when `--secondary_cache_uri` is set, the `--cache_type` is set to `lru_cache`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10081
Test Plan:
- To test that the FastLRUCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test_with_atomic_flush`. The cache type should sometimes be `fast_lru_cache`.
- To test the flag validation, run `make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --secondary_cache_uri=x" blackbox_crash_test_with_atomic_flush` multiple times. The test will always be aborted (which is okay). Check that the cache type is always `lru_cache`.
Reviewed By: anand1976
Differential Revision: D36839908
Pulled By: guidotag
fbshipit-source-id: ebcdfdcd12ec04c96c09ae5b9c9d1e613bdd1725
Summary:
Thanks to https://github.com/facebook/rocksdb/issues/9919 and https://github.com/facebook/rocksdb/issues/10051 the known bugs in file ingestion (besides mmap read + file checksum) are fixed. Now we can try again to enable file ingestion in crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9357
Test Plan: stress file ingestion heavily for an hour: `$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=1000000 --ingest_external_file_one_in=100 --duration=3600 --interval=20 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152`
Reviewed By: riversand963
Differential Revision: D33410746
Pulled By: ajkr
fbshipit-source-id: d276431390995a67f68390d61c06a40945fdd280
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10053
Need to exit if ldb command fails, to avoid running db_bench on
empty/bad DB and considering the results valid.
Reviewed By: jay-zhuang
Differential Revision: D36673200
fbshipit-source-id: e0d78a0d397e0e335d82d9349bfd612d38ffb552
Summary:
Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now).
The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973
Test Plan:
- `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter.
- Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart.
- Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb`
- Benchmark:
```
strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db
strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db
```
- db bench on backup and restore to ensure no performance regression.
- backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%)
- restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%)
```
# Set up
./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000
# benchmark
TEST_TMPDIR=/tmp/test_rocksdb
NUM_RUN=50
for ((j=0;j<$NUM_RUN;j++))
do
./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup'
# Restore
#./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db
done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt
```
Reviewed By: hx235
Differential Revision: D36327418
Pulled By: cbi42
fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691
Summary:
Essentially refactored the RangeMayExist implementation in
FullFilterBlockReader to FilterBlockReaderCommon so that it applies to
partitioned filters as well. (The function is not called for the
block-based filter case.) RangeMayExist is essentially a series of checks
around a possible PrefixMayExist, and I'm confident those checks should
be the same for partitioned as for full filters. (I think it's likely
that bugs remain in those checks, but this change is overall a simplifying
one.)
Added auto_prefix_mode support to db_bench
Other small fixes as well
Fixes https://github.com/facebook/rocksdb/issues/10003
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10012
Test Plan:
Expanded unit test that uses statistics to check for filter
optimization, fails without the production code changes here
Performance: populate two DBs with
```
TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters
```
Observe no measurable change in non-partitioned performance
```
TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
```
Before: seekrandom [AVG 15 runs] : 11798 (± 331) ops/sec
After: seekrandom [AVG 15 runs] : 11724 (± 315) ops/sec
Observe big improvement with partitioned (also supported by bloom use statistics)
```
TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
```
Before: seekrandom [AVG 12 runs] : 2942 (± 57) ops/sec
After: seekrandom [AVG 12 runs] : 7489 (± 184) ops/sec
Reviewed By: siying
Differential Revision: D36469796
Pulled By: pdillinger
fbshipit-source-id: bcf1e2a68d347b32adb2b27384f945434e7a266d
Summary:
Start tracking SST unique id in MANIFEST, which is used to verify with
SST properties to make sure the SST file is not overwritten or
misplaced. A DB option `try_verify_sst_unique_id` is introduced to
enable/disable the verification, if enabled, it opens all SST files
during DB-open to read the unique_id from table properties (default is
false), so it's recommended to use it with `max_open_files = -1` to
pre-open the files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990
Test Plan: unittests, format-compatible test, mini-crash
Reviewed By: anand1976
Differential Revision: D36381863
Pulled By: jay-zhuang
fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f
Summary:
**Context:**
Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`.
Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags.
**Summary:**
- Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that
- Added missing db bench/stress test for some memory charging features
- Renamed related test suite to indicate they are under the same theme of memory charging
- Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication
- Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926
Test Plan:
- New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)`
- New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)`
- CI
- db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
#-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465**
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078**
- db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal
Reviewed By: ajkr
Differential Revision: D36054712
Pulled By: hx235
fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af
Summary:
This PR
- since we are testing with disable_wal = true and best_efforts_recovery, we should set column family count to 1, due to the requirement of `ExpectedState` tracking and replaying logic.
- during backup and checkpoint restore, disable best-efforts-recovery. This does not matter now because db_crashtest.py always disables wal when testing best-efforts-recovery. In the future, if we enable wal, then not setting `restore_opitions.best_efforts_recovery` will cause backup db not to recover the WALs, and differ from db (that enables WAL).
- during verification of backup and checkpoint restore, print the key where inconsistency exists between expected state and db.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9986
Test Plan: TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery
Reviewed By: siying
Differential Revision: D36353105
Pulled By: riversand963
fbshipit-source-id: a484da161273e6216a1f7e245bac15a349693917
Summary:
https://github.com/facebook/rocksdb/issues/9961 broke format_compatible check because of `make clean`
referencing TEST_TMPDIR. The Makefile behavior seems reasonable to me,
so here's a fix in check_format_compatible.sh
Apparently I also included removing a redundant part of our CircleCI config.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9970
Test Plan: manual run: SHORT_TEST=1 ./tools/check_format_compatible.sh
Reviewed By: riversand963
Differential Revision: D36258172
Pulled By: pdillinger
fbshipit-source-id: d46507f04614e888b414ff23b88d040ae2b5c294
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
Test Plan: Watch CI tests
Reviewed By: riversand963
Differential Revision: D36176799
fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
Summary:
Previously all fault injection was ignored in release mode. This PR adds it back except for read fault injection (`--read_fault_one_in > 0`) since its dependency (`IGNORE_STATUS_IF_ERROR`) is unavailable in release mode.
Other notable changes include:
- Moved `EnableWriteErrorInjection()` for `--write_fault_one_in > 0` so it's after `DB::Open()` without depending on `SyncPoint`
- Made `--read_fault_one_in > 0` return an error in release mode
- Updated `db_crashtest.py` to always set `--read_fault_one_in=0` in release mode
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9957
Test Plan:
```
$ DEBUG_LEVEL=0 make -j24 db_stress
$ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox
```
Reviewed By: anand1976
Differential Revision: D36193830
Pulled By: ajkr
fbshipit-source-id: 0b97946b4e3f06e3e0f6e7833c2763da08ec5321
Summary:
`db_stress` already tracks expected state history to verify prefix-recoverability when `sync_fault_injection` is enabled. This PR enables `sync_fault_injection` in `db_crashtest.py`.
Previously enabling `sync_fault_injection` would cause whole unsynced files to be dropped. This PR adds a more interesting case of losing only the tail of unsynced data by implementing `TestFSWritableFile::RangeSync()` and enabling `{wal_,}bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9947
Test Plan:
- regular blackbox, blackbox --simple
- various commands to stress this new case, such as `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --write_buffer_size=2097152 --avoid_flush_during_recovery=1 --disable_wal=0 --interval=10 --db_write_buffer_size=0 --sync_fault_injection=1 --wal_compression=none --delpercent=0 --delrangepercent=0 --prefixpercent=0 --iterpercent=0 --writepercent=100 --readpercent=0 --wal_bytes_per_sync=131072 --duration=36000 --sync=0 --open_write_fault_one_in=16`
Reviewed By: riversand963
Differential Revision: D36152775
Pulled By: ajkr
fbshipit-source-id: 44b68a7fad0a4cf74af9fe1f39be01baab8141d8
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
Test Plan: See CI Runs.
Reviewed By: riversand963
Differential Revision: D36173954
fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
Summary:
This is inspired by debugging a regression test that runs for ~0.05 seconds and the short
running time makes it prone to variance. While db_bench ran for ~60 seconds, 59.95 seconds
was spent opening 128 databases (and doing recovery). So it was harder to notice that the
benchmark only ran for 0.05 seconds.
Normally I add output to the end of the line to make life easier for existing tools that parse it
but in this case the output near the end of the line has two optional parts and one of the optional
parts adds an extra newline.
This is for https://github.com/facebook/rocksdb/issues/9856
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9886
Test Plan:
./db_bench --benchmarks=overwrite,readrandom --num=1000000 --threads=4
old output:
DB path: [/tmp/rocksdbtest-2260/dbbench]
overwrite : 14.108 micros/op 283338 ops/sec; 31.3 MB/s
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 7.994 micros/op 496788 ops/sec; 55.0 MB/s (1000000 of 1000000 found)
new output:
DB path: [/tmp/rocksdbtest-2260/dbbench]
overwrite : 14.117 micros/op 282862 ops/sec 14.141 seconds 4000000 operations; 31.3 MB/s
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 8.649 micros/op 458475 ops/sec 8.725 seconds 4000000 operations; 49.8 MB/s (981548 of 1000000 found)
Reviewed By: ajkr
Differential Revision: D36102269
Pulled By: mdcallag
fbshipit-source-id: 5cd8a9e11f5cbe2a46809571afd83335b6b0caa0
Summary:
If the DB path is specified, the user would expect ldb loads the
options from the path, but it's not:
```
$ ldb list_live_files_metadata --db=`pwd`
```
Default `try_load_options` to true in that case. The user can still
disable that by:
```
$ ldb list_live_files_metadata --db=`pwd` --try_load_options=false
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9937
Test Plan:
`ldb list_live_files_metadata --db=`pwd`` is able to work for
a db generated with different options.num_levels.
Reviewed By: ajkr
Differential Revision: D36106708
Pulled By: jay-zhuang
fbshipit-source-id: 2732fdc027a4d172436b2c9b6a9787b56b10c710
Summary:
When compaction filter determines that a key should be removed, it updates the internal key's type
to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction
together with `SingleDelete`, it will cause compaction iterator to return Corruption.
To fix the issue, compaction filter should return more information in addition to the intention of removing
a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing
`kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`.
In addition, I updated db_stress_shared_state.[cc|h] so that `no_overwrite_ids_` becomes `const`. It is easier to
reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()`
when calling from `Open()` so that we can set up the rollback deletion type callback for transactions.
Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does
not quite work with `MultiOpsTxnsStressTest`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929
Test Plan:
make check
make crash_test
make crash_test_with_txn
Reviewed By: anand1976
Differential Revision: D36069678
Pulled By: riversand963
fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92
Summary:
Adds more coverage to `MultiOpsTxnsStressTest` with a focus on write-prepared transactions.
1. Add a hack to manually evict commit cache entries. We currently cannot assign small values to `wp_commit_cache_bits` because it requires a prepared transaction to commit within a certain range of sequence numbers, otherwise it will throw.
2. Add coverage for commit-time-write-batch. If write policy is write-prepared, we need to set `use_only_the_last_commit_time_batch_for_recovery` to true.
3. After each flush/compaction, verify data consistency. This is possible since data size can be small: default numbers of primary/secondary keys are just 1000.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9829
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb_crashtest_blackbox/ make blackbox_crash_test_with_multiops_wp_txn
```
Reviewed By: pdillinger
Differential Revision: D35806678
Pulled By: riversand963
fbshipit-source-id: d7fde7a29fda0fb481a61f553e0ca0c47da93616
Summary:
This patch completes the second part of the task: "Add blob support to the dump and dump_live_files command"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9896
Reviewed By: ltamasi
Differential Revision: D35852667
Pulled By: jowlyzhang
fbshipit-source-id: a006456c881f468a92da689e895134762e9574e1
Summary:
This patch is the first part of adding blob dump support. It only adds blob dump support to the dump command. A follow up patch will add blob dump support to the dump_live_files command.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9881
Reviewed By: ltamasi
Differential Revision: D35796731
Pulled By: jowlyzhang
fbshipit-source-id: 2cc5973b222d505a331ac7b969edcf992b47c5ee
Summary:
This patch completes the first part of the task: "Extend all three commands so they can decode and print blob references if a new option --decode_blob_index is specified"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9870
Reviewed By: ltamasi
Differential Revision: D35753932
Pulled By: jowlyzhang
fbshipit-source-id: 9d2bbba0eef2ed86b982767eba9de1b4881f35c9
Summary:
`InitializeOptionsGeneral()` was overwriting many options that were already configured by OPTIONS file, potentially with the flag default values. This PR changes that function to only overwrite options in limited scenarios, as described at the top of its definition. Block cache is still a violation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9862
Test Plan: ran under various scenarios (multi-DB, single DB, OPTIONS file, flags) and verified options are set as expected
Reviewed By: jay-zhuang
Differential Revision: D35736960
Pulled By: ajkr
fbshipit-source-id: 75b77740af37e6f5741618f8a8f5685df2417d03
Summary:
Since they operate at distinct abstraction layers, I thought it
was prudent to combine with EncryptedEnv CI test for each PR, for efficiency
in testing. Also added supported compressions to sst_dump --help output
so that CI job can verify no compiled-in compression support.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9850
Test Plan: CI, some manual stuff
Reviewed By: riversand963
Differential Revision: D35682346
Pulled By: pdillinger
fbshipit-source-id: be9879c1533fed304ee32c89fd9ba4b07c2b90cc
Summary:
This change only add decode blob index support to dump_live_files command, which is part of a task to add blob support to a few commands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9842
Reviewed By: ltamasi
Differential Revision: D35650167
Pulled By: jowlyzhang
fbshipit-source-id: a78151b98bc38ac6f52c6e01ca6927a3429ddd14
Summary:
Various renaming and fixes to get rid of remaining uses of
"backupable" which is terminology leftover from the original, flawed
design of BackupableDB. Now any DB can be backed up, using BackupEngine.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9792
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D35334386
Pulled By: pdillinger
fbshipit-source-id: 2108a42b4575c8cccdfd791c549aae93ec2f3329
Summary:
There's an existing benchmark, "getmergeoperands", but it is unconventional in that it has multiple phases and hardcoded setup parameters.
This PR adds a different one, "readrandomoperands", that follows the pattern of other benchmarks of having a single phase and taking its configuration from existing flags.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9785
Test Plan:
```
$ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -write_buffer_size=1048576 -max_bytes_for_level_base=4194304 -target_file_size_base=1048576 -compression_type=none -disable_auto_compactions=true
$ ./db_bench -use_existing_db=true -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -disable_auto_compactions=true -duration=10
...
readrandomoperands : 542.082 micros/op 1844 ops/sec; 0.2 MB/s (11980 of 18999 found)
```
Reviewed By: jay-zhuang
Differential Revision: D35290412
Pulled By: ajkr
fbshipit-source-id: fb367ca614b128cef844a75f0e5d9dd7c3328d85
Summary:
Fixes a bug introduced by me in https://github.com/facebook/rocksdb/pull/9733
That PR added a counter so that the per-thread seeds in ThreadState would
be unique even when --benchmarks had more than one test. But it incorrectly
used this counter as the value for ThreadState::tid as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9757
Test Plan:
Confirm that unexpectedly good QPS results on the regression tests return
to normal with this fix. I have confirmed that the QPS increase starts with
the PR 9733 diff.
Reviewed By: jay-zhuang
Differential Revision: D35149303
Pulled By: mdcallag
fbshipit-source-id: dee5cc36b7faaba6c3be6d6a253d3c2eaad72864
Summary:
This is for https://github.com/facebook/rocksdb/issues/9737
I have wasted more than a few hours running db_bench benchmarks where --seed was not set
and getting better than expected results because cache hit rates are great because
multiple invocations of db_bench used the same value for --seed or did not set it,
and then all used 0. The result is that all see the same sequence of keys.
Others have done the same. The problem is worse in that it is easy to miss and the result is a benchmark with results that are misleading.
A good way to avoid this is to set it to the equivalent of gettimeofday() when either
--seed is not set or it is set to 0 (the default).
With this change the actual seed is printed when it was 0 at process start:
Set seed to 1647992570365606 because --seed was 0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9740
Test Plan:
Perf results:
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000
readrandom : 6.469 micros/op 154583 ops/sec; 17.1 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=0
readrandom : 6.565 micros/op 152321 ops/sec; 16.9 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=1
readrandom : 6.461 micros/op 154777 ops/sec; 17.1 MB/s (4000000 of 4000000 found)
./db_bench --benchmarks=fillseq,readrandom --num=1000000 --reads=4000000 --seed=2
readrandom : 6.525 micros/op 153244 ops/sec; 17.0 MB/s (4000000 of 4000000 found)
Reviewed By: jay-zhuang
Differential Revision: D35145361
Pulled By: mdcallag
fbshipit-source-id: 2b35b153ccec46b27d7c9405997523555fc51267
Summary:
This adds the --slow_usecs option with a default value of 1M. Operations that
take this much time have a message printed when --histogram=1, --stats_interval=0
and --stats_interval_seconds=0. The current code hardwired this to 20,000 usecs
and for some stress tests that reduced throughput by 20% or more.
This is for https://github.com/facebook/rocksdb/issues/9620
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9732
Test Plan:
./db_bench --benchmarks=fillrandom,readrandom --compression_type=lz4 --slow_usecs=100 --histogram=1
./db_bench --benchmarks=fillrandom,readrandom --compression_type=lz4 --slow_usecs=100000 --histogram=1
Reviewed By: jay-zhuang
Differential Revision: D35121522
Pulled By: mdcallag
fbshipit-source-id: daf27f937efd748980545d6395db332712fc078b
Summary:
db_bench quietly parses and ignores bad values for --compaction_fadvice and --value_size_distribution_type
I prefer that it fail for them as it does for bad option values in most other cases. Otherwise a benchmark
result will be provided for the wrong configuration and the result will be misleading.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9741
Test Plan:
These now fail:
./db_bench --compaction_fadvice=noney
Unknown compaction fadvice:noney
./db_bench --value_size_distribution_type=norma
Cannot parse distribution type 'norma'
While correct values continue to work:
./db_bench --value_size_distribution_type=normal
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
./db_bench --compaction_fadvice=none
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Reviewed By: siying
Differential Revision: D35115973
Pulled By: mdcallag
fbshipit-source-id: c2b10de5c2d1ea7c7539e676f5bd556351f5d370
Summary:
When --benchmarks has more than one test then the threads in one benchmark
will use the same set of seeds as the threads in the previous benchmark.
This diff fixe that.
This fixes https://github.com/facebook/rocksdb/issues/9632
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9733
Test Plan:
For this command line the block cache is 8GB, so it caches at most 1024 8KB blocks. Note that without
this diff the second run of readrandom has a much better response time because seed reuse means the
second run reads the same 1000 blocks as the first run and they are cached at that point. But with
this diff that does not happen.
./db_bench --benchmarks=fillseq,flush,compact0,waitforcompaction,levelstats,readrandom,readrandom --compression_type=zlib --num=10000000 --reads=1000 --block_size=8192
...
```
Level Files Size(MB)
--------------------
0 0 0
1 11 238
2 9 253
3 0 0
4 0 0
5 0 0
6 0 0
```
--- perf results without this diff
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 46.212 micros/op 21618 ops/sec; 2.4 MB/s (1000 of 1000 found)
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 21.963 micros/op 45450 ops/sec; 5.0 MB/s (1000 of 1000 found)
--- perf results with this diff
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 47.213 micros/op 21126 ops/sec; 2.3 MB/s (1000 of 1000 found)
DB path: [/tmp/rocksdbtest-2260/dbbench]
readrandom : 42.880 micros/op 23299 ops/sec; 2.6 MB/s (1000 of 1000 found)
Reviewed By: jay-zhuang
Differential Revision: D35089763
Pulled By: mdcallag
fbshipit-source-id: 1b50143a07afe876b8c8e5fa50dd94a8ce57fc6b
Summary:
This changes db_bench to fail at startup for invalid compression types. It had been
changing them to Snappy. For other invalid options it fails at startup.
This is for https://github.com/facebook/rocksdb/issues/9621
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9729
Test Plan:
This continues to work:
./db_bench --benchmarks=fillrandom --compression_type=lz4
This now fails rather than changing the compression type to Snappy
./db_bench --benchmarks=fillrandom --compression_type=lz44
Cannot parse compression type 'lz44'
Reviewed By: jay-zhuang
Differential Revision: D35081323
Pulled By: mdcallag
fbshipit-source-id: 9b38c835abddce11aa7feb235df63f53cf829981
Summary:
…in order
This fixes https://github.com/facebook/rocksdb/issues/9650
For db_bench --benchmarks=fillseq --num_multi_db=X it loads databases in sequence
rather than randomly choosing a database per Put. The benefits are:
1) avoids long delays between flushing memtables
2) avoids flushing memtables for all of them at the same point in time
3) puts same number of keys per database so that query tests will find keys as expected
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9713
Test Plan:
Using db_bench.1 without the change and db_bench.2 with the change:
for i in 1 2; do rm -rf /data/m/rx/* ; time ./db_bench.$i --db=/data/m/rx --benchmarks=fillseq --num_multi_db=4 --num=10000000; du -hs /data/m/rx ; done
--- without the change
fillseq : 3.188 micros/op 313682 ops/sec; 34.7 MB/s
real 2m7.787s
user 1m52.776s
sys 0m46.549s
2.7G /data/m/rx
--- with the change
fillseq : 3.149 micros/op 317563 ops/sec; 35.1 MB/s
real 2m6.196s
user 1m51.482s
sys 0m46.003s
2.7G /data/m/rx
Also, temporarily added a printf to confirm that the code switches to the next database at the right time
ZZ switch to db 1 at 10000000
ZZ switch to db 2 at 20000000
ZZ switch to db 3 at 30000000
for i in 1 2; do rm -rf /data/m/rx/* ; time ./db_bench.$i --db=/data/m/rx --benchmarks=fillseq,readrandom --num_multi_db=4 --num=100000; du -hs /data/m/rx ; done
--- without the change, smaller database, note that not all keys are found by readrandom because databases have < and > --num keys
fillseq : 3.176 micros/op 314805 ops/sec; 34.8 MB/s
readrandom : 1.913 micros/op 522616 ops/sec; 57.7 MB/s (99873 of 100000 found)
--- with the change, smaller database, note that all keys are found by readrandom
fillseq : 3.110 micros/op 321566 ops/sec; 35.6 MB/s
readrandom : 1.714 micros/op 583257 ops/sec; 64.5 MB/s (100000 of 100000 found)
Reviewed By: jay-zhuang
Differential Revision: D35030168
Pulled By: mdcallag
fbshipit-source-id: 2a18c4ec571d954cf5a57b00a11802a3608823ee
Summary:
Changes:
* improves monitoring by displaying average size of a Put value and average scan length
* forces the minimum value size to be 10. Before this it was 0 if you didn't set the distribution parameters.
* uses reasonable defaults for the distribution parameters that determine value size and scan length
* includes seeks in "reads ... found" message, before this they were missing
This is for https://github.com/facebook/rocksdb/issues/9672
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9711
Test Plan:
Before this change:
./db_bench --benchmarks=fillseq,mixgraph --mix_get_ratio=50 --mix_put_ratio=25 --mix_seek_ratio=25 --num=100000 --value_k=0.2615 --value_sigma=25.45 --iter_k=2.517 --iter_sigma=14.236
fillseq : 4.289 micros/op 233138 ops/sec; 25.8 MB/s
mixgraph : 18.461 micros/op 54166 ops/sec; 755.0 MB/s ( Gets:50164 Puts:24919 Seek:24917 of 50164 in 75081 found)
After this change:
./db_bench --benchmarks=fillseq,mixgraph --mix_get_ratio=50 --mix_put_ratio=25 --mix_seek_ratio=25 --num=100000 --value_k=0.2615 --value_sigma=25.45 --iter_k=2.517 --iter_sigma=14.236
fillseq : 3.974 micros/op 251553 ops/sec; 27.8 MB/s
mixgraph : 16.722 micros/op 59795 ops/sec; 833.5 MB/s ( Gets:50164 Puts:24919 Seek:24917, reads 75081 in 75081 found, avg size: 36.0 value, 504.9 scan)
Reviewed By: jay-zhuang
Differential Revision: D35030190
Pulled By: mdcallag
fbshipit-source-id: d8f555f28d869f752ddb674a524108884511b151
Summary:
The goal of this change is to allow changes to the "current" (in
FileSystem) file temperatures to feed back into DB metadata, so that
they can inform decisions and stats reporting. In part because of
modular code factoring, it doesn't seem easy to do this automagically,
where opening an SST file and observing current Temperature different
from expected would trigger a change in metadata and DB manifest write
(essentially giving the deep read path access to the write path). It is also
difficult to do this while the DB is open because of the limitations of
LogAndApply.
This change allows updating file temperature metadata on a closed DB
using an experimental utility function UpdateManifestForFilesState()
or `ldb update_manifest --update_temperatures`. This should suffice for
"migration" scenarios where outside tooling has placed or re-arranged DB
files into a (different) tiered configuration without going through
RocksDB itself (currently, only compaction can change temperature
metadata).
Some details:
* Refactored and added unit test for `ldb unsafe_remove_sst_file` because
of shared functionality
* Pulled in autovector.h changes from https://github.com/facebook/rocksdb/issues/9546 to fix SuperVersionContext
move constructor (related to an older draft of this change)
Possible follow-up work:
* Support updating manifest with file checksums, such as when a
new checksum function is used and want existing DB metadata updated
for it.
* It's possible that for some repair scenarios, lighter weight than
full repair, we might want to support UpdateManifestForFilesState() to
modify critical file details like size or checksum using same
algorithm. But let's make sure these are differentiated from modifying
file details in ways that don't suspect corruption (or require extreme
trust).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9683
Test Plan: unit tests added
Reviewed By: jay-zhuang
Differential Revision: D34798828
Pulled By: pdillinger
fbshipit-source-id: cfd83e8fb10761d8c9e7f9c020d68c9106a95554
Summary:
The primary goal of this change is to add support for backing up and
restoring (applying on restore) file temperature metadata, without
committing to either the DB manifest or the FS reported "current"
temperatures being exclusive "source of truth".
To achieve this goal, we need to add temperature information to backup
metadata, which requires updated backup meta schema. Fortunately I
prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
6.19.0 for this kind of schema update. (Previously, backup meta schema
was not extensible! Making this schema update public will allow some
other "nice to have" features like taking backups with hard links, and
avoiding crc32c checksum computation when another checksum is already
available.) While schema version 2 is newly public, the default schema
version is still 1. Until we change the default, users will need to set
to 2 to enable features like temperature data backup+restore. New
metadata like temperature information will be ignored with a warning
in versions before this change and since 6.19.0. The metadata is
considered ignorable because a functioning DB can be restored without
it.
Some detail:
* Some renaming because "future schema" is now just public schema 2.
* Initialize some atomics in TestFs (linter reported)
* Add temperature hint support to SstFileDumper (used by BackupEngine)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660
Test Plan:
related unit test majorly updated for the new functionality,
including some shared testing support for tracking temperatures in a FS.
Some other tests and testing hooks into production code also updated for
making the backup meta schema change public.
Reviewed By: ajkr
Differential Revision: D34686968
Pulled By: pdillinger
fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
Summary:
BlockBasedTableOptions.hash_index_allow_collision is already deprecated and has no effect. Delete it for preparing 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9454
Test Plan: Run all existing tests.
Reviewed By: ajkr
Differential Revision: D33805827
fbshipit-source-id: ed8a436d1d083173ec6aef2a762ba02e1eefdc9d
Summary:
**Summary:**
RocksDB uses a block cache to reduce IO and make queries more efficient. The block cache is based on the LRU algorithm (LRUCache) and keeps objects containing uncompressed data, such as Block, ParsedFullFilterBlock etc. It allows the user to configure a second level cache (rocksdb::SecondaryCache) to extend the primary block cache by holding items evicted from it. Some of the major RocksDB users, like MyRocks, use direct IO and would like to use a primary block cache for uncompressed data and a secondary cache for compressed data. The latter allows us to mitigate the loss of the Linux page cache due to direct IO.
This PR includes a concrete implementation of rocksdb::SecondaryCache that integrates with compression libraries such as LZ4 and implements an LRU cache to hold compressed blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9518
Test Plan:
In this PR, the lru_secondary_cache_test.cc includes the following tests:
1. The unit tests for the secondary cache with either compression or no compression, such as basic tests, fails tests.
2. The integration tests with both primary cache and this secondary cache .
**Follow Up:**
1. Statistics (e.g. compression ratio) will be added in another PR.
2. Once this implementation is ready, I will do some shadow testing and benchmarking with UDB to measure the impact.
Reviewed By: anand1976
Differential Revision: D34430930
Pulled By: gitbw95
fbshipit-source-id: 218d78b672a2f914856d8a90ff32f2f5b5043ded
Summary:
When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs.
Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556
Reviewed By: anand1976
Differential Revision: D34308216
Pulled By: sidroyc
fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.
`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.
There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).
The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424
Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
- setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
- benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`
Reviewed By: hx235
Differential Revision: D33747386
Pulled By: ajkr
fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
Summary:
Opening DB as seconeary instance has been supported in ldb but it is not mentioned in --help. Mention it there. The part of the help message after the modification:
```
commands MUST specify --db=<full_path_to_db_directory> when necessary
commands can optionally specify
--env_uri=<uri_of_environment> or --fs_uri=<uri_of_filesystem> if necessary
--secondary_path=<secondary_path> to open DB as secondary instance. Operations not supported in secondary instance will fail.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9582
Test Plan: Build and run ldb --help
Reviewed By: riversand963
Differential Revision: D34286427
fbshipit-source-id: e56c5290d0548098ab6acc6dde2167f5a64f34f3
Summary:
I did another pass through running CI jobs. It is uncommon now to see
`db_stress` stuck in the setup phase but still happen.
One reason was repeatedly reading/verifying checksum on filter blocks when
`-cache_index_and_filter_blocks=1` and `-cache_size=1048576`. To address
that I increased the cache size.
Another reason was having a WAL with many range tombstones and every
`db_stress` run using `-avoid_flush_during_recovery=1` (in that
scenario, the setup phase spent too much CPU in
`rocksdb::MemTable::NewRangeTombstoneIteratorInternal()`). To address
that I fixed the `-avoid_flush_during_recovery` setting so it is
reevaluated for every `db_stress` run.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9483
Reviewed By: riversand963
Differential Revision: D33922929
Pulled By: ajkr
fbshipit-source-id: 0a298ec7c4df6f6b44620233996047a2dc7ee5f3
Summary:
This change removes the ability to configure the deprecated,
inefficient block-based filter in the public API. Options that would
have enabled it now use "full" (and optionally partitioned) filters.
Existing block-based filters can still be read and used, and a "back
door" way to build them still exists, for testing and in case of trouble.
About the only way this removal would cause an issue for users is if
temporary memory for filter construction greatly increases. In
HISTORY.md we suggest a few possible mitigations: partitioned filters,
smaller SST files, or setting reserve_table_builder_memory=true.
Or users who have customized a FilterPolicy using the
CreateFilter/KeyMayMatch mechanism removed in https://github.com/facebook/rocksdb/issues/9501 will have to upgrade
their code. (It's long past time for people to move to the new
builder/reader customization interface.)
This change also introduces some internal-use-only configuration strings
for testing specific filter implementations while bypassing some
compatibility / intelligence logic. This is intended to hint at a path
toward making FilterPolicy Customizable, but it also gives us a "back
door" way to configure block-based filter.
Aside: updated db_bench so that -readonly implies -use_existing_db
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9535
Test Plan:
Unit tests updated. Specifically,
* BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back
door configuration interface and ignoring of `use_block_based_builder`.
* BlockBasedTableTest.TracingGetTest is migrated from testing
block-based filter access pattern to full filter access patter, by
re-ordering some things.
* Options test (pretty self-explanatory)
Performance test - create with `./db_bench -db=/dev/shm/rocksdb1 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0` with and without `-use_block_based_filter`, which creates a DB with 21 SST files in L0. Read with `./db_bench -db=/dev/shm/rocksdb1 -readonly -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=30`
Without -use_block_based_filter: readrandom 464 ops/sec, 689280 KB DB
With -use_block_based_filter: readrandom 169 ops/sec, 690996 KB DB
No consistent difference with fillrandom
Reviewed By: jay-zhuang
Differential Revision: D34153871
Pulled By: pdillinger
fbshipit-source-id: 31f4a933c542f8f09aca47fa64aec67832a69738
Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33788508
Pulled By: akankshamahajan15
fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
Summary:
After https://github.com/facebook/rocksdb/issues/9481, we are using newer default compiler for
build-format-compatible CircleCI nightly job, which fails on building
2.2.fb.branch branch because it tries to use a pre-compiled libsnappy.a
that is checked into the repo (!). This works around that by setting
SNAPPY_LDFLAGS=-lsnappy, which is only understood by such old versions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9517
Test Plan:
Run check_format_compatible.sh on Ubuntu 20 AWS machine,
watch nightly run
Reviewed By: hx235
Differential Revision: D34055561
Pulled By: pdillinger
fbshipit-source-id: 45f9d428dd082f026773bfa8d9dd4dad66fc9378
Summary:
The patch does some cleanup in and around `VersionStorageInfo`:
* Renames the method `PrepareApply` to `PrepareAppend` in `Version`
to make it clear that it is to be called before appending the `Version` to
`VersionSet` (via `AppendVersion`), not before applying any `VersionEdit`s.
* Introduces a helper method `VersionStorageInfo::PrepareForVersionAppend`
(called by `Version::PrepareAppend`) that encapsulates the population of the
various derived data structures in `VersionStorageInfo`, and turns the
methods computing the derived structures (`UpdateNumNonEmptyLevels`,
`CalculateBaseBytes` etc.) into private helpers.
* Changes `Version::PrepareAppend` so it only calls `UpdateAccumulatedStats`
if the `update_stats` flag is set. (Earlier, this was checked by the callee.)
Related to this, it also moves the call to `ComputeCompensatedSizes` to
`VersionStorageInfo::PrepareForVersionAppend`.
* Updates and cleans up `version_builder_test`, `version_set_test`, and
`compaction_picker_test` so `PrepareForVersionAppend` is called anytime
a new `VersionStorageInfo` is set up or saved. This cleanup also involves
splitting `VersionStorageInfoTest.MaxBytesForLevelDynamic`
into multiple smaller test cases.
* Fixes up a bunch of comments that were outdated or just plain incorrect.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9494
Test Plan: Ran `make check` and the crash test script for a while.
Reviewed By: riversand963
Differential Revision: D33971666
Pulled By: ltamasi
fbshipit-source-id: fda52faac7783041126e4f8dec0fe01bdcadf65a
Summary:
Added a CountedFileSystem that tracks a number of file operations (opens, closes, deletes, renames, flushes, syncs, fsyncs, reads, writes). This class was based on the ReportFileOpEnv from db_bench.
This is a stepping stone PR to be able to change the SpecialEnv into a SpecialFileSystem, where several of the file varieties wish to do operation counting.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9283
Reviewed By: pdillinger
Differential Revision: D33062004
Pulled By: mrambacher
fbshipit-source-id: d0d297a7fb9c48c06cbf685e5fa755c27193b6f5
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Summary:
Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum
This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.
Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.
**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption
- See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
- Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342
Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
- For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
- FastLocalBloom
- Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
- After change:
- `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
- Standard128Ribbon
- Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
- After change:
- `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.
Reviewed By: pdillinger
Differential Revision: D33746928
Pulled By: hx235
fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
Summary:
Apparently setting total_order_seek=true for DB::Get was
intended to allow accurate read semantics if the current prefix
extractor doesn't match what was used to generate SST files on
disk. But since prefix_extractor was made a mutable option in 5.14.0, we
have been able to detect this case and provide the correct semantics
regardless of the total_order_seek option. Since that time, the option
has only made Get() slower in a reasonably common case: prefix_extractor
unchanged and whole_key_filtering=false.
So this change primarily removes unnecessary effect of
total_order_seek on Get. Also cleans up some related comments.
Also adds a -total_order_seek option to db_bench and canonicalizes
handling of ReadOptions in db_bench so that command line options have
the expected association with library features. (There is potential
for change in regression test behavior, but the old behavior is likely
indefensible, or some other inconsistency would need to be fixed.)
TODO in follow-up work: there should be no reason for Get() to depend on
current prefix extractor at all.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9427
Test Plan:
Unit tests updated.
Performance (using db_bench update)
Create DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12 -whole_key_filtering=0`
Test with and without `-total_order_seek` on `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=readrandom -num=10000000 -duration=40 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`
Before this change, total_order_seek=false: 25188 ops/sec
Before this change, total_order_seek=true: 1222 ops/sec (~20x slower)
After this change, total_order_seek=false: 24570 ops/sec
After this change, total_order_seek=true: 25012 ops/sec (indistinguishable)
Reviewed By: siying
Differential Revision: D33753458
Pulled By: pdillinger
fbshipit-source-id: bf892f34907a5e407d9c40bd4d42f0adbcbe0014
Summary:
Despite attempts to optimize `db_stress` setup phase (i.e.,
pre-`OperateDb()`) latency in https://github.com/facebook/rocksdb/issues/9470 and https://github.com/facebook/rocksdb/issues/9475, it still always took tens
of seconds. Since we still aren't able to setup a 100M key `db_stress`
quickly, we should reduce the number of keys. This PR reduces it 4x
while increasing `value_size_mult` 4x (from its default value of 8) so
that memtables and SST files fill at a similar rate compared to before this PR.
Also disabled bzip2 compression since we'll probably never use it and
I noticed many CI runs spending majority of CPU on bzip2 decompression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9476
Reviewed By: siying
Differential Revision: D33898520
Pulled By: ajkr
fbshipit-source-id: 855021784ad9664f2be5bce21f0339a1cf93230d
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds has been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9455
Test Plan: Rely on my eyeball and CI
Reviewed By: ajkr
Differential Revision: D33811664
Pulled By: hx235
fbshipit-source-id: 866859427fe710354a90f1095057f80116365ff0
Summary:
In RocksDB DBOptions::skip_log_error_on_recovery is marked as
"NOT SUPPORTED" for a long time, and setting this option does not have
any effect on the behavior of RocksDB library. Therefore, we are removing it
in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9434
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33763015
Pulled By: akankshamahajan15
fbshipit-source-id: 11f09643298da6c02d3dcdb090b996f4c3cfdd76
Summary:
Even after https://github.com/facebook/rocksdb/issues/9461 could see
```
Error: please specify prefix_size for test_batches_snapshots test!
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9463
Test Plan:
run `make blackbox_crashtest` for a long time. (Unfortunately,
it's taking a long time to reproduce these failures)
Reviewed By: akankshamahajan15
Differential Revision: D33838152
Pulled By: pdillinger
fbshipit-source-id: b9a73c5bbb68df53f14c22b9b52f61d1f7ef38af
Summary:
The API is deprecated long time ago. Clean up the codebase by
removing it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462
Test Plan: CI, fake release: D33835220
Reviewed By: riversand963
Differential Revision: D33835103
Pulled By: jay-zhuang
fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184
Summary:
Changes in https://github.com/facebook/rocksdb/issues/9453 could trigger
```
stderr:
Error: prefixpercent is non-zero while prefix_size is not positive!
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9461
Test Plan: run `make blackbox_crashtest` for a long time
Reviewed By: ajkr
Differential Revision: D33830751
Pulled By: pdillinger
fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476
Summary:
This also removes the obsolete names BackupableDBOptions
and UtilityDB. API users must now use BackupEngineOptions and
DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced
`rocksdb_backup_engine_*`. Similar renaming in Java API.
In reference to https://github.com/facebook/rocksdb/issues/9389
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D33780269
Pulled By: pdillinger
fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac
Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.
This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)
Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453
Test Plan:
new unit test added. Added memtable Bloom to crash test.
Performance testing
--------------------
Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```
Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)
Before & after run simultaneously, average over 10 iterations per data point, ops/sec.
MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066
MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990
MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before: 12081715
After: 12117603
MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591
Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.
Reviewed By: ajkr
Differential Revision: D33805025
Pulled By: pdillinger
fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452
Test Plan: Rely on my eyeball and CI
Reviewed By: ajkr
Differential Revision: D33804938
Pulled By: hx235
fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945
Summary:
Add an option to set the WAL compression algorithm - wal_compression.
TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432
Reviewed By: pdillinger
Differential Revision: D33797275
Pulled By: sidroyc
fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7
Summary:
This PR is one proposal to resolve https://github.com/facebook/rocksdb/issues/9382.
Looking at the code, I can't think of a reason why rdb is an internal component of RocksDB: it does not require
any header files NOT in `include/rocksdb`. It's a better idea to host it somewhere else.
Plus, rdb requires python2 which is not supported any more. No fixes or improvements will be made, even for potential
security bugs (https://www.python.org/doc/sunset-python-2/).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9399
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D33641965
Pulled By: riversand963
fbshipit-source-id: 2a6a74693e5de36834f355e41d6865db206af48b
Summary:
This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point,
people can start from the example repo and fork.
Java/JNI is not included yet, and needs to be done later if necessary.
The goal is to include this commit in RocksDB 7.0 release.
Reference:
https://github.com/ajkr/dedupfs by ajkr
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170
Test Plan:
Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress.
make check
Reviewed By: ajkr
Differential Revision: D33751662
Pulled By: riversand963
fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f
Summary:
In response to https://github.com/facebook/rocksdb/issues/9354, this PR adds a way for users to "opt out"
of extra checks that can impact peak write performance, which
currently only includes force_consistency_checks. I considered including
some other options but did not see a db_bench performance difference.
Also clarify in comment for force_consistency_checks that it can "slow
down saturated writing."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9363
Test Plan:
basic coverage in unit tests
Using my perf test in https://github.com/facebook/rocksdb/issues/9354 comment, I see
force_consistency_checks=true -> 725360 ops/s
force_consistency_checks=false -> 783072 ops/s
Reviewed By: mrambacher
Differential Revision: D33636559
Pulled By: pdillinger
fbshipit-source-id: 25bfd006f4844675e7669b342817dd4c6a641e84
Summary:
As title.
This is part of an fb-internal task.
First, remove all `using namespace` statements if applicable.
Next, utilize multiple build platforms and see if anything is broken.
Should anything become broken, fix the compilation errors with as little extra change as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369
Test Plan:
internal build and make check
make clean && make static_lib && cd examples && make all
Reviewed By: pdillinger
Differential Revision: D33517260
Pulled By: riversand963
fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40
Summary:
Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (https://github.com/facebook/rocksdb/issues/8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled.
To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py.
Depends on https://github.com/facebook/rocksdb/issues/9305.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9338
Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan.
Reviewed By: riversand963
Differential Revision: D33345333
Pulled By: ajkr
fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293
Reviewed By: pdillinger, zhichao-cao
Differential Revision: D33181591
Pulled By: mrambacher
fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
Summary:
Several improvements to SimulatedHybridFileSystem:
(1) Allow a mode where all I/Os to all files simulate HDD. This can be enabled in db_bench using -simulate_hdd
(2) Latency calculation is slightly more accurate
(3) Allow to simulate more than one HDD spindles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9301
Test Plan: Run db_bench and observe the results are reasonable.
Reviewed By: jay-zhuang
Differential Revision: D33141662
fbshipit-source-id: b736e58c4ba910d06899cc9ccec79b628275f4fa
Summary:
db_crashtest.py uses multiple CFs only when run without flag `--simple`.
The previous config set `-test_batches_snapshots=1` in that case for
blackbox mode. But `-test_batches_snapshots=1` cannot verify recovery
correctness, so it should not always be set for multi-CF blackbox tests.
We can instead randomly toggle it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9303
Reviewed By: riversand963
Differential Revision: D33155229
Pulled By: ajkr
fbshipit-source-id: 4a6fdc4eddccc8ece664063baf6393ce1c5de6b7
Summary:
SimulatedHybridFileSystem now takes a more thorough simualtion of an HDD:
1. cover writes too, not just read
2. Latency and throughput is now simulated as seek + read time, using a rate limiter
This implementation can be modified to simulate full HDD behavior, which is not yet done.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9259
Test Plan: Run db_bench and observe the desired behavior.
Reviewed By: jay-zhuang
Differential Revision: D32903039
fbshipit-source-id: a83f5d72143e114d5e75edf39d647bf0b71978e1
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.
Reviewed By: ltamasi
Differential Revision: D31721350
fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
Summary:
When table_options.prepopulate_block_cache is set to
BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and
table_options.partition_filters is also set true, then there is
segmentation failure when top level filter is fetched because its
entered with wrong type in cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263
Test Plan:
Updated unit tests;
Ran db_stress: make crash_test -j32
Reviewed By: pdillinger
Differential Revision: D32936566
Pulled By: akankshamahajan15
fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426
Summary:
Fix a bug that causes file temperature not preserved after DB is restarted, or options.max_manifest_file_size is hit.
Also, pass temperature information to NewRandomAccessFile() to allow users to hack a solution where they don't preserve tiering information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9242
Test Plan: Add a unit test that would fail without the fix.
Reviewed By: jay-zhuang
Differential Revision: D32818150
fbshipit-source-id: 36aa3f148c60107f7b8e9d65b63b039f9e1a1eec
Summary:
Disable the QPS verification in test temporally, which causes the test failure due to different system delays.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9190
Test Plan: make check
Reviewed By: siying
Differential Revision: D32576289
Pulled By: zhichao-cao
fbshipit-source-id: 1df972e77dd82eed5af3462e5db5e141aadf8fae
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
Test Plan: Ran `make check` and the stress/crash test.
Reviewed By: riversand963
Differential Revision: D32565512
Pulled By: ltamasi
fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
Summary:
Fix the analyzer test failure caused by inaccurate timing wait. The wait time at different system might be different or cause the delay, now we do not accurately count the lines. Only in a very rare extreme case, test will ignore the part exceed the timing of 1 second.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9181
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D32511319
Pulled By: zhichao-cao
fbshipit-source-id: e694c8cb465c750cfa5a43dab3eff6707b9a11c8
Summary:
RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB.
This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block.
1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer), readahead_size will decrease by 8KB.
2. It maintains readahead_size for L1 - Ln levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056
Test Plan:
Added new unit tests
Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled.
Reviewed By: anand1976
Differential Revision: D31773640
Pulled By: akankshamahajan15
fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98
Summary:
Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087
Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc.
Reviewed By: riversand963
Differential Revision: D32007222
Pulled By: anand1976
fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792
Summary:
It is useful to add options.manual_wal_flush to db_bench
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9132
Test Plan: Run the benchamrk with the option.
Reviewed By: ltamasi
Differential Revision: D32188060
fbshipit-source-id: a70835d3cad0f30095218dfda1daff0a432892e5
Summary:
Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles.
When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed.
In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749
Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end.
Reviewed By: jay-zhuang
Differential Revision: D30735261
fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.
Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument
This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.
DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.
Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.
### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
As you can see, especially once warmed up, xxh3 is fastest.
### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
Test
for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
Results (ops/sec)
for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3
Reviewed By: mrambacher
Differential Revision: D31905249
Pulled By: pdillinger
fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
Summary:
This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty.
In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control.
Two set of write benchmarks are run:
1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163
2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655
Test Plan: Add unit tests to the functionality.
Reviewed By: ajkr
Differential Revision: D31787034
fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
Summary:
https://github.com/facebook/rocksdb/issues/8031 broke internal tests. This should fix but also preserve
the intended capability of getting git commit id when hg not used
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9047
Test Plan: already broken ¯\\_(ツ)_/¯
Reviewed By: zhichao-cao
Differential Revision: D31732198
Pulled By: pdillinger
fbshipit-source-id: 7dba8531ddca55a6de5e04978a1a1601aae4cee9
Summary:
Set the perf_level in ```tools/regression_test.sh``` in order to exercise ```PerfContext``` counters in regression tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8031
Test Plan: Manually run the script
Reviewed By: ajkr
Differential Revision: D31508269
Pulled By: anand1976
fbshipit-source-id: 20ddfd1cbca37f1439eed2870086a86d90653b44
Summary:
The patch adds a new BlobDB benchmarking script called `run_blob_bench.sh`.
It is a thin wrapper around `benchmark.sh` (similarly to `run_flash_bench.sh`):
it actually calls `benchmark.sh` a number of times, cycling through six workloads,
two write-only ones (bulk load and overwrite), two read/write ones (point lookups
while writing, range scans while writing), and two read-only ones (point lookups
and range scans).
Note: this is a simpler/cleaned up/reworked version of the script used to produce the
benchmark results in http://rocksdb.org/blog/2021/05/26/integrated-blob-db.html .
The new version takes advantage of several recent `benchmark.sh` improvements
like the ability to pass in arbitrary `db_bench` options or the possibility of using a
job ID.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9015
Test Plan: Ran the script manually with different parameter combinations.
Reviewed By: riversand963
Differential Revision: D31555277
Pulled By: ltamasi
fbshipit-source-id: 0e151b2f7b2cf6f66ed7f95455571492ad7ea87f
Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.
In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).
These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)
This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994
Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.
Reviewed By: riversand963
Differential Revision: D31489850
Pulled By: ltamasi
fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.
The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995
Test Plan:
- Verified it fixes the following failure:
```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```
- `make check -j48`
Reviewed By: ltamasi
Differential Revision: D31495388
Pulled By: ajkr
fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
Summary:
Enable SingleDelete with user defined timestamp in db_bench,
db_stress and crash test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8971
Test Plan:
1. For db_stress, ran the command for full duration: i) python3 -u tools/db_crashtest.py
--enable_ts whitebox --nooverwritepercent=100
ii) make crash_test_with_ts
2. For db_bench, ran: ./db_bench -benchmarks=randomreplacekeys
-user_timestamp_size=8 -use_single_deletes=true
Reviewed By: riversand963
Differential Revision: D31246558
Pulled By: akankshamahajan15
fbshipit-source-id: 29cd8740c9921341e52f09242fca3c44d75a12b7
Summary:
The ldb list_live_files_metadata command does not print any information about blob files currently. We would like to add this functionality. Note that list_live_files_metadata has two different modes of operation: the one shown above, which shows the LSM tree structure, and another one, which can be enabled using the flag --sort_by_filename and simply lists the files in numerical order regardless of level. We would like to show blob files in both modes.
Changes:
1. Using GetAllColumnFamilyMetaData API instead of GetLiveFilesMetaData API for fetching live files data.
Testing:
1. Created a sample rocksdb instance using dbbench command (this creates both SST and blob files)
2. Checked if the blob files are listed or not by using ldb commands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8976
Reviewed By: ltamasi
Differential Revision: D31316061
Pulled By: pradeepambati
fbshipit-source-id: d15cdea192febf7a45f28deee2ba40615d3d84ab
Summary:
This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead.
Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
Hopefully, this sort of code hygiene cleanup will speed up the builds...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
Reviewed By: pdillinger
Differential Revision: D31142788
Pulled By: mrambacher
fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
Summary:
This is a precursor refactoring to enable an upcoming feature: persistence failure correctness testing.
- Changed `--expected_values_path` to `--expected_values_dir` and migrated "db_crashtest.py" to use the new flag. For persistence failure correctness testing there are multiple possible correct states since unsynced data is allowed to be dropped. Making it possible to restore all these possible correct states will eventually involve files containing snapshots of expected values and DB trace files.
- The expected values directory is managed by an `ExpectedStateManager` instance. Managing expected state files is separated out of `SharedState` to prevent `SharedState` from becoming too complex when the new files and features (snapshotting, tracing, and restoring) are introduced.
- Migrated expected values file access/management out of `SharedState` into a separate class called `ExpectedState`. This is not exposed directly to the test but rather the `ExpectedState` for the latest values file is accessed via a pass-through API on `ExpectedStateManager`. This forces the test to always access the single latest `ExpectedState`.
- Changed the initialization of the latest expected values file to use a tempfile followed by rename, and also add cleanup logic for possible stranded tempfiles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8913
Test Plan:
run in several ways; try to make sure it's not obviously broken.
- crashtest blackbox without TEST_TMPDIR
```
$ python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest blackbox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest whitebox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py whitebox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none --random_kill_odd=88887
```
- db_stress without expected_values_dir
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true
```
- db_stress with expected_values_dir and manual corruption
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true --expected_values_dir=./
// modify one byte in "./LATEST.state"
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=false --expected_values_dir=./
...
Verification failed for column family 0 key 0000000000000000 (0): Value not found: NotFound:
...
```
Reviewed By: riversand963
Differential Revision: D30921951
Pulled By: ajkr
fbshipit-source-id: babfe218062e55d018c9b046536c0289fb78f41c
Summary:
For now, disable it since the below command indicates it can cause a
failure. Running that command with `-experimental_mempurge_threshold=0`
has been running successfully for several minutes, whereas before it
failed in seconds.
```
$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8958
Reviewed By: ltamasi
Differential Revision: D31187059
Pulled By: ajkr
fbshipit-source-id: 04d5bfb4fcc4f5b66233e691427dfd940c67037f
Summary:
Several improvements to MultiRead:
1. Fix a bug in stress test which causes false positive when both MultiRead() return and individual read request have failure injected.
2. Add two more types of fault that should be handled: empty read results and checksum mismatch
3. Add a message indicating which type of fault is injected
4. Increase the failure rate
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8937
Reviewed By: anand1976
Differential Revision: D31085930
fbshipit-source-id: 3a04994a3cadebf9a64d25e1fe12b14b7a272fba
Summary:
In FileChecksumTestHelper::VerifyEachFileChecksum(), we query the file list, and then for each file in the list verify the checksum. However, compaction can delete those files in the mean time and cause failures. To prevent it from happening, disable file deletion during the validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8942
Test Plan: Run exsiting test and see it doesn't fail.
Reviewed By: pdillinger
Differential Revision: D31086488
fbshipit-source-id: 554608f36d2dd3bf0a20dfc4039c68bd8533d7f8
Summary:
In case of IO uring bugs, we need to provide a way for users to turn it off.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8931
Test Plan: Manually run db_bench with/without the option and verify the behavior
Reviewed By: pdillinger
Differential Revision: D31040252
Pulled By: anand1976
fbshipit-source-id: 56f2537d6ac8488c9e126296d8190ad9e0158f70
Summary:
As title. The reason is that after loading customized options, the env is not set back to the correct one. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8929
Test Plan: Manually validate in an environment where the command failed.
Reviewed By: riversand963
Differential Revision: D31026931
fbshipit-source-id: c25dc788bf80ed5bf4b24922c442781943bcd65b
Summary:
Because even 32-bit systems can have large files
This is a "change" that I don't want intermingled with an upcoming refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8926
Test Plan: CI
Reviewed By: zhichao-cao
Differential Revision: D31020974
Pulled By: pdillinger
fbshipit-source-id: ca9eb4510697df6f1f55e37b37730b88b1809a92
Summary:
* Started on some proper usage text to document the options
* Added a `JOB_ID` parameter, so that we can trace jobs and relate them to other assets
* Now generates a correct TSV file of the summary
* Summary has new additional fields:
* RocksDB Version
* Date
* Job ID
* db_bench log files now also include the Job ID
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8730
Reviewed By: mrambacher
Differential Revision: D30747344
Pulled By: jay-zhuang
fbshipit-source-id: 87eb78d20959b6d95804aebf129606fa9c71f407
Summary:
These tests would frequently fail to find SST files due to race
condition in running ldb (read-only) on an open DB which might do automatic
compaction. But only sometimes would that failure translate into test
failure because the implementation of ldb file_checksum_dump would
swallow many errors. Now,
* DB closed while running ldb to avoid unnecessary race condition
* Detect and report/propagate more failures in `ldb file_checksum_dump`
* Use --hex so that random binary data is not printed to console
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8898
Test Plan: ./ldb_cmd_test --gtest_filter=*Checksum* --gtest_repeat=100
Reviewed By: zhichao-cao
Differential Revision: D30848738
Pulled By: pdillinger
fbshipit-source-id: 20290b517eeceba99bb538bb5a17088f7e878405
Summary:
This allows the wrapper classes to own the wrapped object and eliminates confusion as to ownership. Previously, many classes implemented their own ownership solutions. Fixes https://github.com/facebook/rocksdb/issues/8606
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8618
Reviewed By: pdillinger
Differential Revision: D30136064
Pulled By: mrambacher
fbshipit-source-id: d0bf471df8818dbb1770a86335fe98f761cca193
Summary:
Make the Statistics object into a Customizable object. Statistics can now be stored and created to/from the Options file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8637
Reviewed By: zhichao-cao
Differential Revision: D30530550
Pulled By: mrambacher
fbshipit-source-id: 5fc7d01d8431f37b2c205bbbd8342c9f697023bd
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
Reviewed By: zhichao-cao
Differential Revision: D29558961
Pulled By: mrambacher
fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
Summary:
* Don't hardcode namespace rocksdb (use ROCKSDB_NAMESPACE)
* Don't #include <rocksdb/...> (use double quotes)
* Support putting NOCOMMIT (any case) in source code that should not be
committed/pushed in current state.
These will be run with `make check` and in GitHub actions
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8821
Test Plan: existing tests, manually try out new checks
Reviewed By: zhichao-cao
Differential Revision: D30791726
Pulled By: pdillinger
fbshipit-source-id: 399c883f312be24d9e55c58951d4013e18429d92
Summary:
Regression test is broken and not running:
1. failed test is not reporting, fix it by add `set -e`
2. internal regression test is not run inside github, removing that
3. fix a few minor issues to pass the test
4. delete unused binary size build, and regression test is reporting binary size now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8753
Test Plan: CI: https://www.internalfb.com/intern/sandcastle/job/13510799359573861
Reviewed By: ltamasi
Differential Revision: D30754380
Pulled By: jay-zhuang
fbshipit-source-id: 0cfa008327fff31bc61118a3fe642924090d28e1
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736
Test Plan: existing tests
Reviewed By: mrambacher
Differential Revision: D30700039
Pulled By: pdillinger
fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
Summary:
Current implementation does not support user-defined timestamp when
block-based filter is used. Will implement the support in the future, or
wait to see if block-based filter can be deprecated and removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8703
Test Plan: make whitebox_crash_test_with_ts
Reviewed By: pdillinger
Differential Revision: D30528931
Pulled By: riversand963
fbshipit-source-id: 60dd74ee0a6194e69072069d8c4bd876f249f38d
Summary:
Handler functions now use a common output function to output to stdout/files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8697
Test Plan: `trace_analyzer_test` can pass.
Reviewed By: zhichao-cao
Differential Revision: D30527696
Pulled By: autopear
fbshipit-source-id: c626cf4d53a39665a9c4bcf0cb019c448434abe4
Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.
So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)
I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.
C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.
BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679
Test Plan: new + updated tests, including crash test
Reviewed By: jay-zhuang
Differential Revision: D30445797
Pulled By: pdillinger
fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.
New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".
`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657
Reviewed By: ajkr
Differential Revision: D30290216
Pulled By: autopear
fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.
User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.
Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611
Reviewed By: ajkr
Differential Revision: D30266329
Pulled By: autopear
fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
Summary:
The last few releases overlooked adding to this test. This
change fixes that.
This change also fixes the problem of older branches not understanding
ROCKSDB_NO_FBCODE and referencing compilers no longer supported.
During the test, build_detect_platform is patched to force no FBCODE
compiler usage. (We should not need to update old branches perpetually.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8651
Test Plan: local run reproduces regression described in https://github.com/facebook/rocksdb/issues/8650
Reviewed By: jay-zhuang, zhichao-cao
Differential Revision: D30261872
Pulled By: pdillinger
fbshipit-source-id: 02b447d224d7e0eb8613c63185437ded146713bc
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
Reviewed By: pdillinger
Differential Revision: D30149315
Pulled By: bjlemaire
fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
Summary:
`GenericRateLimiter` slow path handles requests that cannot be satisfied
immediately. Such requests enter a queue, and their thread stays in `Request()`
until they are granted or the rate limiter is stopped. These threads are
responsible for unblocking themselves. The work to do so is split into two main
duties.
(1) Waiting for the next refill time.
(2) Refilling the bytes and granting requests.
Prior to this PR, the slow path logic involved a leader election algorithm to
pick one thread to perform (1) followed by (2). It elected the thread whose
request was at the front of the highest priority non-empty queue since that
request was most likely to be granted. This algorithm was efficient in terms of
reducing intermediate wakeups, which is a thread waking up only to resume
waiting after finding its request is not granted. However, the conceptual
complexity of this algorithm was too high. It took me a long time to draw a
timeline to understand how it works for just one edge case yet there were so
many.
This PR drops the leader election to reduce conceptual complexity. Now, the two
duties can be performed by whichever thread acquires the lock first. The risk
of this change is increasing the number of intermediate wakeups, however, we
took steps to mitigate that.
- `wait_until_refill_pending_` flag ensures only one thread performs (1). This\
prevents the thundering herd problem at the next refill time. The remaining\
threads wait on their condition variable with an unbounded duration -- thus we\
must remember to notify them to ensure forward progress.
- (1) is typically done by a thread at the front of a queue. This is trivial\
when the queues are initially empty as the first choice that arrives must be\
the only entry in its queue. When queues are initially non-empty, we achieve\
this by having (2) notify a thread at the front of a queue (preferring higher\
priority) to perform the next duty.
- We do not require any additional wakeup for (2). Typically it will just be\
done by the thread that finished (1).
Combined, the second and third bullet points above suggest the refill/granting
will typically be done by a request at the front of its queue. This is
important because one wakeup is saved when a granted request happens to be in an
already running thread.
Note there are a few cases that still lead to intermediate wakeup, however. The
first two are existing issues that also apply to the old algorithm, however, the
third (including both subpoints) is new.
- No request may be granted (only possible when rate limit dynamically\
decreases).
- Requests from a different queue may be granted.
- (2) may be run by a non-front request thread causing it to not be granted even\
if some requests in that same queue are granted. It can happen for a couple\
(unlikely) reasons.
- A new request may sneak in and grab the lock at the refill time, before the\
thread finishing (1) can wake up and grab it.
- A new request may sneak in and grab the lock and execute (1) before (2)'s\
chosen candidate can wake up and grab the lock. Then that non-front request\
thread performing (1) can carry over to perform (2).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8602
Test Plan:
- Use existing tests. The edge cases listed in the comment are all performance\
related; I could not really think of any related to correctness. The logic\
looks the same whether a thread wakes up/finishes its work early/on-time/late,\
or whether the thread is chosen vs. "steals" the work.
- Verified write throughput and CPU overhead are basically the same with and\
without this change, even in a rate limiter heavy workload:
Test command:
```
$ rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=fillrandom -num_multi_db=64 -num_low_pri_threads=64 -num_high_pri_threads=64 -write_buffer_size=262144 -target_file_size_base=262144 -max_bytes_for_level_base=1048576 -rate_limiter_bytes_per_sec=16777216 -key_size=24 -value_size=1000 -num=10000 -compression_type=none -rate_limiter_refill_period_us=1000
```
Results before this PR:
```
fillrandom : 108.463 micros/op 9219 ops/sec; 9.0 MB/s
7.40user 8.84system 1:26.20elapsed 18%CPU (0avgtext+0avgdata 256140maxresident)k
```
Results after this PR:
```
fillrandom : 108.108 micros/op 9250 ops/sec; 9.0 MB/s
7.45user 8.23system 1:26.68elapsed 18%CPU (0avgtext+0avgdata 255688maxresident)k
```
Reviewed By: hx235
Differential Revision: D30048013
Pulled By: ajkr
fbshipit-source-id: 6741bba9d9dfbccab359806d725105817fef818b
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310
Test Plan: Add several unit tests.
Reviewed By: ajkr
Differential Revision: D28493792
fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
- Added Options/Configurable/Object Registration for TTL and Cassandra variants
- Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option
- Added tests for new functionality
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481
Reviewed By: zhichao-cao
Differential Revision: D30136050
Pulled By: mrambacher
fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
Summary:
Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.
After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).
Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582
Reviewed By: pdillinger, autopear
Differential Revision: D29881122
Pulled By: mrambacher
fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb
Summary:
Introduction of a new `fillanddeleteuniquerandom` benchmark (`db_bench`) with 5 new option flags to simulate a benchmark where the following sequence is repeated multiple times:
"A set of keys S1 is inserted ('`disposable entries`'), then after some delay another set of keys S2 is inserted ('`persistent entries`') and the first set of keys S1 is deleted. S2 artificially represents the insertion of hypothetical results from some undefined computation done on the first set of keys S1. The next sequence can start as soon as the last disposable entry in the set S1 of this sequence is inserted, if the `delay` is non negligible."
New flags:
- `disposable_entries_delete_delay`: minimum delay in microseconds between insertion of the last `disposable` entry, and the start of the insertion of the first `persistent` entry.
- `disposable_entries_batch_size`: number of `disposable` entries inserted at the beginning of each sequence.
- `disposable_entries_value_size`: size of the random `value` string for the `disposable` entries.
- `persistent_entries_batch_size`: number of `persistent` entries inserted at the end of each sequence, right before the deletion of the `disposable` entries starts.
- `persistent_entries_value_size`: size of the random value string for the `persistent` entries.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8593
Reviewed By: pdillinger
Differential Revision: D29974436
Pulled By: bjlemaire
fbshipit-source-id: f578033e5b45e8268ba6fa6f38f4770c2e6e801d
Summary:
* Basic handling of SST file with just range tombstones rather than
failing assertion about smallest_seqno <= largest_seqno
* Adds --verbose option so that there exists a way to see the INFO
output from Repairer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8544
Test Plan: unit test added, manual testing for --verbose
Reviewed By: ajkr
Differential Revision: D29954805
Pulled By: pdillinger
fbshipit-source-id: 696af25805fc36cc178b04ba6045922a22625fd9
Summary:
Add `experimental_mempurge_policy` flag to `db_stress` and `db_crashtest.py`.
This flag is only read if the `experimental_allow_mempurge` flag is set to `true`. This flag can take the following values: `kAlways`, and `kAlternate` (default).
- `kAlways`: a flush is always redirected to a mempurge. If the mempurge aborts, the a regular flush proceeds.
- `kAlternate`: if one or more of the flush input memtables is an mempurge output memtable, then a flush is performed, else a mempurge is carried out. Similar to kAlways, if a mempurge aborts, the FlushJob proceeds to a regular flush to storage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8588
Reviewed By: pdillinger
Differential Revision: D29934251
Pulled By: bjlemaire
fbshipit-source-id: 90c1debed2029b9915d066914556547507c33dae
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473
Reviewed By: zhichao-cao
Differential Revision: D29901488
Pulled By: mrambacher
fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
Summary:
Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE.
`ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput).
`ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios.
This PR also includes support for `experimental_mempurge_policy` in `db_bench`.
Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583
Reviewed By: pdillinger
Differential Revision: D29888050
Pulled By: bjlemaire
fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1
Summary:
PR https://github.com/facebook/rocksdb/issues/8519 fix db_bench_tool.cc for MSVC build errors by simply copy-paste, this PR fix the copy-paste while also works for MSVC.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8553
Reviewed By: ajkr
Differential Revision: D29838056
Pulled By: jay-zhuang
fbshipit-source-id: 0cd60c146b87a355c3dc1061dfe813169d75cea4
Summary:
Now we can analyze the MultiGet queries in the trace file and generate a set of the statistic and analysis files. Note that, when one MultiGet access N keys, we count each sub-get-query individually. But the over all query number is still the MultiGet not the sub-get-query.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8575
Test Plan: added new unit test and make check
Reviewed By: anand1976
Differential Revision: D29860633
Pulled By: zhichao-cao
fbshipit-source-id: a132128527f36828d266df8e36e3ec626c2170be
Summary:
Add flags `overwrite_probability` and `overwrite_window_size` flag to `db_bench`.
Add the possibility of performing a `filluniquerandom` benchmark with an overwrite probability.
For each write operation, there is a probability _p_ that the write is an overwrite (_p_=`overwrite_probability`).
When an overwrite is decided, the key is randomly chosen from the last _N_ keys previously inserted into the DB (with _N_=`overwrite_window_size`).
When a pure write is decided, the key inserted into the DB is unique and therefore will not be an overwrite.
The `overwrite_window_size` is used so that the user can decide if the overwrite are mostly targeting recently inserted keys (when `overwrite_window_size` is small compared to the total number of writes), or can also target keys inserted "a long time ago" (when `overwrite_window_size` is comparable to total number of writes).
Note that total number of writes = # of unique insertions + # of overwrites.
No unit test specifically added.
Local testing show the following **throughputs** for `filluniquerandom` with 1M total writes:
- bypass the code inserts (no `overwrite_probability` flag specified): ~14.0MB/s
- `overwrite_probability=0.99`, `overwrite_window_size=10`: ~17.0MB/s
- `overwrite_probability=0.10`, `overwrite_window_size=10`: ~14.0MB/s
- `overwrite_probability=0.99`, `overwrite_window_size=1M`: ~14.5MB/s
- `overwrite_probability=0.10`, `overwrite_window_size=1M`: ~14.0MB/s
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8569
Reviewed By: pdillinger
Differential Revision: D29818631
Pulled By: bjlemaire
fbshipit-source-id: d472b4ea4e457a4da7c4ee4f14b40cccd6a4587a
Summary:
Tiny PR to add the `experimental_allow_mempurge` to the `db_bench` tool (`Mempurge` is the current prototype for memtable garbage collection).
This is useful to benchmark the prototype of this new feature, stress test it and help find new meaningful heuristics for GC.
By default, the flag to allow `mempurge` is set to `false`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8546
Reviewed By: anand1976
Differential Revision: D29738338
Pulled By: bjlemaire
fbshipit-source-id: 01892883a2f1c714c110718674da05992d6e2dd6
Summary:
Right now, db_bench with seekrandom and multiple DB setup creates iterator for all DBs just to query one of them. It's different from most real workloads. Fix it by only creating iterators that will be queried.
Also fix a bug that DBs are not destroyed in multi-DB mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7818
Test Plan: Run db_bench with single/multiDB X using/not using tailing iterator with ASAN build, and validate the behavior is expected.
Reviewed By: ajkr
Differential Revision: D25720226
fbshipit-source-id: c2ff7ff7120e5ba64287a30b057c5d29b2cbe20b
Summary:
Add `experiemental_allow_mempurge` flag support for `db_stress` and `db_crashtest.py`, with a `false` default value.
I succesfully tested locally both `whitebox` and `blackbox` crash tests with `experiemental_allow_mempurge` flag set as true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8545
Reviewed By: akankshamahajan15
Differential Revision: D29734513
Pulled By: bjlemaire
fbshipit-source-id: 24316c0eccf6caf409e95c035f31d822c66714ae
Summary:
Fixed a few MSVC (VCToolsVersion=14.0) build errors and warnings
* `DEFINE_string` is a macro and VC compiler complains that it cannot put [ifdef-inside-define](https://stackoverflow.com/questions/5586429/ifdef-inside-define)
* `sleep()` is not a recognizable function. Use `FLAGS_env->SleepForMicroseconds` instead
* Define precise type in comparison to avoid mismatch warning
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8519
Reviewed By: jay-zhuang
Differential Revision: D29683086
fbshipit-source-id: 8c80941472089f8daba84ae29597e75e603850e4
Summary:
1. Fix printing of stats when there are no writes (wamp=0). Previously had a div0 error
2. Added multireadrandom command as a valid target
3. Added ability to pass additional command line options to db_bench. Now can say things like benchmark.sh readrandom --mmap_read and the option will be passed to db_bench.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8346
Reviewed By: zhichao-cao
Differential Revision: D29500436
Pulled By: mrambacher
fbshipit-source-id: 54e90708aae9133be3a903e35efdf8f8abbd86fa
Summary:
Inject read failures in DB reopen, just as what we do for metadata writes and writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8476
Test Plan: Some manual tests and make sure failures are triggered.
Reviewed By: anand1976
Differential Revision: D29507283
fbshipit-source-id: d04da0163973447041038bd87701686a417c4e0c
Summary:
Previously Stress can inject metadata write failures when reopening a DB. We extend it to file append too, in the same way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8474
Test Plan: manually run crash test with various setting and make sure the failures are triggered as expected.
Reviewed By: zhichao-cao
Differential Revision: D29503116
fbshipit-source-id: e73a446e80ccbd09301a579280e56ff949381fab
Summary:
Add a ```-secondary_cache_uri``` to db_stress to allow the user to specify a custom ```SecondaryCache``` object from the object registry. Also allow db_crashtest.py to be run with an alternate db_stress location. Together, these changes will allow us to run db_stress using FB internal components.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8455
Reviewed By: zhichao-cao
Differential Revision: D29371972
Pulled By: anand1976
fbshipit-source-id: dd1b1fd80ebbedc11aa63d9246ea6ae49edb77c4
Summary:
Hello and thanks for RocksDB,
Here is a PR to add file deletes, renames and ```Flush()```, ```Sync()```, ```Fsync()``` and ```Close()``` to file ops report.
The reason is to help tune RocksDB options when using an env/filesystem with high latencies for file level ("metadata") operations, typically seen during ```DB::Open``` (```db_bench -num 0``` also see https://github.com/facebook/rocksdb/pull/7203 where IOTracing does not trace ```DB::Open```).
Before:
```
> db_bench -benchmarks updaterandom -num 0 -report_file_operations true
...
Entries: 0
...
Num files opened: 12
Num Read(): 6
Num Append(): 8
Num bytes read: 6216
Num bytes written: 6289
```
After:
```
> db_bench -benchmarks updaterandom -num 0 -report_file_operations true
...
Entries: 0
...
Num files opened: 12
Num files deleted: 3
Num files renamed: 4
Num Flush(): 10
Num Sync(): 5
Num Fsync(): 1
Num Close(): 2
Num Read(): 6
Num Append(): 8
Num bytes read: 6216
Num bytes written: 6289
```
Before:
```
> db_bench -benchmarks updaterandom -report_file_operations true
...
Entries: 1000000
...
Num files opened: 18
Num Read(): 396339
Num Append(): 1000058
Num bytes read: 892030224
Num bytes written: 187569238
```
After:
```
> db_bench -benchmarks updaterandom -report_file_operations true
...
Entries: 1000000
...
Num files opened: 18
Num files deleted: 5
Num files renamed: 4
Num Flush(): 1000068
Num Sync(): 9
Num Fsync(): 1
Num Close(): 6
Num Read(): 396339
Num Append(): 1000058
Num bytes read: 892030224
Num bytes written: 187569238
```
Another example showing how using ```DB::OpenForReadOnly``` reduces file operations compared to ```((Optimistic)Transaction)DB::Open```:
```
> db_bench -benchmarks updaterandom -num 1
> db_bench -benchmarks updaterandom -num 0 -use_existing_db true -readonly true -report_file_operations true
...
Entries: 0
...
Num files opened: 8
Num files deleted: 0
Num files renamed: 0
Num Flush(): 0
Num Sync(): 0
Num Fsync(): 0
Num Close(): 0
Num Read(): 13
Num Append(): 0
Num bytes read: 374
Num bytes written: 0
```
```
> db_bench -benchmarks updaterandom -num 1
> db_bench -benchmarks updaterandom -num 0 -use_existing_db true -report_file_operations true
...
Entries: 0
...
Num files opened: 14
Num files deleted: 3
Num files renamed: 4
Num Flush(): 14
Num Sync(): 5
Num Fsync(): 1
Num Close(): 3
Num Read(): 11
Num Append(): 10
Num bytes read: 7291
Num bytes written: 7357
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8448
Reviewed By: anand1976
Differential Revision: D29333818
Pulled By: zhichao-cao
fbshipit-source-id: a06a8c87f799806462319115195b3e94faf5f542
Summary:
Add an argument to ldb to dump live file names, column families, and levels, `list_live_files_metadata`. The output shows all active SST file names, sorted first by column family and then by level. For each level the SST files are sorted alphabetically.
Typically, the output looks like this:
```
./ldb --db=/tmp/test_db list_live_files_metadata
Live SST Files:
===== Column Family: default =====
---------- level 0 ----------
/tmp/test_db/000069.sst
---------- level 1 ----------
/tmp/test_db/000064.sst
/tmp/test_db/000065.sst
/tmp/test_db/000066.sst
/tmp/test_db/000071.sst
---------- level 2 ----------
/tmp/test_db/000038.sst
/tmp/test_db/000039.sst
/tmp/test_db/000052.sst
/tmp/test_db/000067.sst
/tmp/test_db/000070.sst
------------------------------
```
Second, a flag was added `--sort_by_filename`, to change the layout of the output. When this flag is added to the command, the output shows all active SST files sorted by name, in front of which the LSM level and the column family are mentioned. With the same example, the following command would return:
```
./ldb --db=/tmp/test_db list_live_files_metadata --sort_by_filename
Live SST Files:
/tmp/test_db/000038.sst : level 2, column family 'default'
/tmp/test_db/000039.sst : level 2, column family 'default'
/tmp/test_db/000052.sst : level 2, column family 'default'
/tmp/test_db/000064.sst : level 1, column family 'default'
/tmp/test_db/000065.sst : level 1, column family 'default'
/tmp/test_db/000066.sst : level 1, column family 'default'
/tmp/test_db/000067.sst : level 2, column family 'default'
/tmp/test_db/000069.sst : level 0, column family 'default'
/tmp/test_db/000070.sst : level 2, column family 'default'
/tmp/test_db/000071.sst : level 1, column family 'default'
------------------------------
```
Thus, the user can either request to show the files by levels, or sorted by filenames.
This PR includes a simple Python unit test that makes sure the file name and level printed out by this new feature matches the one found with an existing feature, `dump_live_file`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8446
Reviewed By: akankshamahajan15
Differential Revision: D29320080
Pulled By: bjlemaire
fbshipit-source-id: 01fb7b5637c59010d74c80730a28d815994e7009
Summary:
At the moment, the following command : "`./ --db=mypath/ dump_file_files`" returns a series of erronous names with double slashes, ie: "`mypath//000xxx.sst`", including manifest file names with double slashes "`mypath//MANIFEST-00XXX`", whereas "`./ --db=mypath dump_file_files`" correctly returns "`mypath/000xxx.sst`" and "`mypath/MANIFEST-00XXX`".
This (very short) PR simply checks if there is a need to add or remove any '`/`' character when the `db_path` and `manifest_filename`/sst `filenames` are concatenated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8439
Reviewed By: akankshamahajan15
Differential Revision: D29301349
Pulled By: bjlemaire
fbshipit-source-id: 3e9e58f9749d278b654ae838fcee13ad698705a8
Summary:
Tracing the MultiGet information including timestamp, keys, and CF_IDs to the trace file for analyzing and replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8421
Test Plan: make check, add test to trace_analyzer_test
Reviewed By: anand1976
Differential Revision: D29221195
Pulled By: zhichao-cao
fbshipit-source-id: 30c677d6c39ab31ef4bbdf7e0d1fa1fd79f295ff
Summary:
This PR prepopulates warm/hot data blocks which are already in memory
into block cache at the time of flush. On a flush, the data block that is
in memory (in memtables) get flushed to the device. If using Direct IO,
additional IO is incurred to read this data back into memory again, which
is avoided by enabling newly added option.
Right now, this is enabled only for flush for data blocks. We plan to
expand this option to cover compactions in the future and for other types
of blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242
Test Plan: Add new unit test
Reviewed By: anand1976
Differential Revision: D28521703
Pulled By: akankshamahajan15
fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
Summary:
Marked the Ribbon filter and optimize_filters_for_memory features
as production-ready, each enabling memory savings for Bloom-like filters.
Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use
Ribbon filters instead of Bloom, or `ribbonfilter` in place of
`bloomfilter` in configuration string.
Some small refactoring in db_stress.
Removed/refactored unused code in db_bench, in part preparing for future
default possibly being different from "disabled."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408
Test Plan:
Lots of prior automated, ad-hoc, and "real world" testing.
Updated tests for new API names. Quick db_bench test:
bloom fillrandom
77730 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 89929384
ribbon fillrandom
71492 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 64531384
Reviewed By: mrambacher
Differential Revision: D29140805
Pulled By: pdillinger
fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a
Summary:
- Added CreateFromString method to Env and FilesSystem to replace LoadEnv/Load. This method/signature is a precursor to making these classes extend Customizable.
- Added CreateFromSystem to Env. This method standardizes creating an Env from the environment variables. Previously, some places would check TEST_ENV_URI and others would also check TEST_FS_URI. Now the code is more command/standardized.
- Added CreateFromFlags to Env. These method allows Env to be create from string options (such as GFLAGS options) in a more standard way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8174
Reviewed By: zhichao-cao
Differential Revision: D28999603
Pulled By: mrambacher
fbshipit-source-id: 88e6911e7e91f908458a7fe10a20e93ecbc275fb
Summary:
Changed fprintf function to fputc in ApplyVersionEdit, and replaced null characters with whitespaces.
Added unit test in ldb_test.py - verifies that manifest_dump --verbose output is correct when keys and values containing null characters are inserted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8378
Reviewed By: pdillinger
Differential Revision: D29034584
Pulled By: bjlemaire
fbshipit-source-id: 50833687a8a5f726e247c38457eadc3e6dbab862
Summary:
Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360
Test Plan: add the test to lru_cache_test
Reviewed By: pdillinger
Differential Revision: D29006215
Pulled By: zhichao-cao
fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
Summary:
- Fix cmake build failure with gflags.
- Add CI tests for both gflags 2.1 and 2.2.
- Fix ctest config with gtest.
- Add CI to run test with ctest.
One benefit of ctest is it support timeout, it's set to 5min in our CI, so we will know which test is hang.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8324
Test Plan: CI pass
Reviewed By: ajkr
Differential Revision: D28762517
Pulled By: jay-zhuang
fbshipit-source-id: 09063c5af5f9f33abfcdeb48593acbd9826cd199
Summary:
Whitebox crash test can run significantly over the time limit for test slowness or no kiling points. This indefinite job can create problem when this test is periodically scheduled as a job. Instead, kill the job if it is 15 minutes over the limit.
Refactor the code slightly to consolidate the code for executing commands for white and black box tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8341
Test Plan: Run both of black and white box tests with both of natual and explicit kill condition.
Reviewed By: jay-zhuang
Differential Revision: D28756170
fbshipit-source-id: f253149890e62ace78f871be927e093e9b12f49b
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.
The new information can be seen in info LOG, for example:
Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):
$ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
rocksdb.block-cache-entry-stats.bytes.data-block: 0
rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
rocksdb.block-cache-entry-stats.bytes.index-block: 178992
rocksdb.block-cache-entry-stats.bytes.misc: 0
rocksdb.block-cache-entry-stats.bytes.other-block: 0
rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
rocksdb.block-cache-entry-stats.capacity: 8388608
rocksdb.block-cache-entry-stats.count.data-block: 0
rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
rocksdb.block-cache-entry-stats.count.index-block: 215
rocksdb.block-cache-entry-stats.count.misc: 1
rocksdb.block-cache-entry-stats.count.other-block: 0
rocksdb.block-cache-entry-stats.count.write-buffer: 0
rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
rocksdb.block-cache-entry-stats.percent.misc: 0.000000
rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.
This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.
An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".
Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.
Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.
This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
Test Plan: manual testing with db_bench, and a couple of basic unit tests
Reviewed By: ltamasi
Differential Revision: D28488721
Pulled By: pdillinger
fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
Summary:
This PR adds a ```-secondary_cache_uri``` option to the cache_bench and db_bench tools to allow the user to specify a custom secondary cache URI. The object registry is used to create an instance of the ```SecondaryCache``` object of the type specified in the URI.
The main cache_bench code is packaged into a separate library, similar to db_bench.
An example invocation of db_bench with a secondary cache URI -
```db_bench --env_uri=ws://ws.flash_sandbox.vll1_2/ -db=anand/nvm_cache_2 -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=67108864 -cache_index_and_filter_blocks=true -secondary_cache_uri='cachelibwrapper://filename=/home/anand76/nvm_cache/cache_file;size=2147483648;regionSize=16777216;admPolicy=random;admProbability=1.0;volatileSize=8388608;bktPower=20;lockPower=12' -partition_index_and_filters=true -duration=1800```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8312
Reviewed By: zhichao-cao
Differential Revision: D28544325
Pulled By: anand1976
fbshipit-source-id: 8f209b9af900c459dc42daa7a610d5f00176eeed
Summary:
This patch does two things:
1) Introduces some aliases in order to eliminate/prevent long-winded type names
w/r/t the internal table property collectors (see e.g.
`std::vector<std::unique_ptr<IntTblPropCollectorFactory>>`).
2) Makes it possible to apply only a subrange of table property collectors during
table building by turning `TableBuilderOptions::int_tbl_prop_collector_factories`
from a pointer to a `vector` into a range (i.e. a pair of iterators).
Rationale: I plan to introduce a BlobDB related table property collector, which
should only be applied during table creation if blob storage is enabled at the moment
(which can be changed dynamically). This change will make it possible to include/
exclude the BlobDB related collector as needed without having to introduce
a second `vector` of collectors in `ColumnFamilyData` with pretty much the same
contents.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8298
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D28430910
Pulled By: ltamasi
fbshipit-source-id: a81d28f2c59495865300f43deb2257d2e6977c8e
Summary:
As a part of tiered storage, writing tempeature information to manifest is needed so that after DB recovery, RocksDB still has the tiering information, to implement some further necessary functionalities.
Also fix some issues in simulated hybrid FS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8284
Test Plan: Add a new unit test to validate that the information is indeed written and read back.
Reviewed By: zhichao-cao
Differential Revision: D28335801
fbshipit-source-id: 56aeb2e6ea090be0200181dd968c8a7278037def
Summary:
In https://github.com/facebook/rocksdb/issues/8268, the `db_stress` stdout began containing both the strings
"fail" and "error" (case-insensitive). The whitebox crash test
failed upon seeing either of those strings.
I checked that all other occurrences of "fail" and "error"
(case-insensitive) that `db_stress` produces are printed to `stderr`. So
this PR separates the handling of `db_stress`'s stdout and stderr, and
only fails when one those bad strings are found in stderr.
The downside of this PR is `db_stress`'s original interleaving of stdout/stderr is not preserved in `db_crashtest.py`'s output.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8272
Test Plan:
run it; see it succeeds for several runs until encountering a real error
```
$ python3 tools/db_crashtest.py whitebox --simple --random_kill_odd=8887 --max_key=1000000 --value_size_mult=33
...
db_stress: cache/clock_cache.cc:483: bool rocksdb::{anonymous}::ClockCacheShard::Unref(rocksdb::{anonymous}::CacheHandle*, bool, rocksdb::{anonymous}::CleanupContext*): Assertion `CountRefs(flags) > 0' failed.
TEST FAILED. Output has 'fail'!!!
```
Reviewed By: zhichao-cao
Differential Revision: D28239233
Pulled By: ajkr
fbshipit-source-id: 3b8602a0d570466a7e2c81bb9c49468f7716091e
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
Reviewed By: pdillinger
Differential Revision: D28226540
Pulled By: mrambacher
fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
Summary:
In testing for https://github.com/facebook/rocksdb/issues/8225 I found cache_bench would crash with
-use_clock_cache, as well as db_bench -use_clock_cache, but not
single-threaded. Smaller cache size hits failure much faster. ASAN
reported the failuer as calling malloc_usable_size on the `key` pointer
of a ClockCache handle after it was reportedly freed. On detailed
inspection I found this bad sequence of operations for a cache entry:
state=InCache=1,refs=1
[thread 1] Start ClockCacheShard::Unref (from Release, no mutex)
[thread 1] Decrement ref count
state=InCache=1,refs=0
[thread 1] Suspend before CalcTotalCharge (no mutex)
[thread 2] Start UnsetInCache (from Insert, mutex held)
[thread 2] clear InCache bit
state=InCache=0,refs=0
[thread 2] Calls RecycleHandle (based on pre-updated state)
[thread 2] Returns to Insert which calls Cleanup which deletes `key`
[thread 1] Resume ClockCacheShard::Unref
[thread 1] Read `key` in CalcTotalCharge
To fix this, I've added a field to the handle to store the metadata
charge so that we can efficiently remember everything we need from
the handle in Unref. We must not read from the handle again if we
decrement the count to zero with InCache=1, which means we don't own
the entry and someone else could eject/overwrite it immediately.
Note before this change, on amd64 sizeof(Handle) == 56 even though there
are only 48 bytes of data. Grouping together the uint32_t fields would
cut it down to 48, but I've added another uint32_t, which takes it
back up to 56. Not a big deal.
Also fixed DisownData to cooperate with ASAN as in LRUCache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8261
Test Plan:
Manual + adding use_clock_cache to db_crashtest.py
Base performance
./cache_bench -use_clock_cache
Complete in 17.060 s; QPS = 2458513
New performance
./cache_bench -use_clock_cache
Complete in 17.052 s; QPS = 2459695
Any difference is easily buried in small noise.
Crash test shows still more bug(s) in ClockCache, so I'm expecting to
disable ClockCache from production code in a follow-up PR (if we
can't find and fix the bug(s))
Reviewed By: mrambacher
Differential Revision: D28207358
Pulled By: pdillinger
fbshipit-source-id: aa7a9322afc6f18f30e462c75dbbe4a1206eb294
Summary:
As the first part of the effort of having placing different files on different storage types, this change introduces several things:
(1) An experimental interface in FileSystem that specify temperature to a new file created.
(2) A test FileSystemWrapper, SimulatedHybridFileSystem, that simulates HDD for a file of "warm" temperature.
(3) A simple experimental feature ColumnFamilyOptions.bottommost_temperature. RocksDB would pass this value to FileSystem when creating any bottommost file.
(4) A db_bench parameter that applies the (2) and (3) to db_bench.
The motivation of the change is to introduce minimal changes that allow us to evolve tiered storage development.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8222
Test Plan:
./db_bench --benchmarks=fillrandom --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -level_compaction_dynamic_level_bytes --reads=100 -compaction_readahead_size=20000000 --reads=100000 -num=10000000
followed by
./db_bench --benchmarks=readrandom,stats --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -simulate_hybrid_fs_file=/tmp/warm_file_list -level_compaction_dynamic_level_bytes -compaction_readahead_size=20000000 --reads=500 --threads=16 -use_existing_db --num=10000000
and see results as expected.
Reviewed By: ajkr
Differential Revision: D28003028
fbshipit-source-id: 4724896d5205730227ba2f17c3fecb11261744ce
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.
To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.
I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)
At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)
Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246
Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost
Reviewed By: mrambacher
Differential Revision: D28099346
Pulled By: pdillinger
fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
Summary:
Greatly reduced the not-quite-copy-paste giant parameter lists
of rocksdb::NewTableBuilder, rocksdb::BuildTable,
BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.
Moved weird separate parameter `uint32_t column_family_id` of
TableFactory::NewTableBuilder into TableBuilderOptions.
Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
target_file_size` is not randomly placed between uint64_t timestamps
(was easy to mix up).
Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
FilterBuildingContext. The motivation for this change is making it
easier to pass along more data into new fields in FilterBuildingContext
(follow-up PR).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240
Test Plan: ASAN make check
Reviewed By: mrambacher
Differential Revision: D28075891
Pulled By: pdillinger
fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
Summary:
DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place.
If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully.
This option is enabled in crash test in half of the time.
Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8235
Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by https://github.com/facebook/rocksdb/pull/8192 and a similar one that fails when fsyncing parent directory.
Reviewed By: anand1976
Differential Revision: D28010944
fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558
Summary:
Add 6.18, 6.19 and 6.20 to check_format_compatible.sh
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8236
Test Plan: ./tools/check_format_compatible.sh (tested without 2.7.fb as it was failing as mentioned in the script)
Reviewed By: mrambacher
Differential Revision: D28019160
Pulled By: akankshamahajan15
fbshipit-source-id: b59a7c5c14cb4c115926e9ae7c74ea586b22c9ed
Summary:
This partially reverts commit 10196d7edc.
The problem with this change is because of important filter use cases:
FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so
would only use Ribbon filters if specifically including level 0 for the
Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate
unknown level, and this would be treated the same as level 0 unless
fixed.
We are keeping the part about committing to permanent schema, which is
only changes to API comments and HISTORY.md.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212
Test Plan: CI
Reviewed By: jay-zhuang
Differential Revision: D27896468
Pulled By: pdillinger
fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e
Summary:
Since the Ribbon filter schema seems good (compatible back to
6.15.0), this change commits to long term support of the SST schema,
even though we expect the API for enabling Ribbon to change (still
called NewExperimentalRibbonFilterPolicy).
This also adds support for "hybrid" configuration in which some levels
use Bloom (higher levels, lower numbered) for speed and the rest use
Ribbon (lower levels, higher numbered) for memory space efficiency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198
Test Plan: unit test added, crash test support
Reviewed By: jay-zhuang
Differential Revision: D27831232
Pulled By: pdillinger
fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab
Summary:
Extend the DB::GetLiveFilesChecksumInfo API to blob files.
This API is also used by the file_checksum_dump ldb command to dump checksum
of SST files which now also dumps blob files checksum.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8179
Test Plan: Add new unit test
Reviewed By: zhichao-cao
Differential Revision: D27714965
Pulled By: akankshamahajan15
fbshipit-source-id: d8b7343ea845a64c83800336d88cced7152a8c92
Summary:
Enable backup/restore functionality with Integrated BlobDB in
db_stress and crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8165
Test Plan:
Ran python3 -u tools/db_crashtest.py --simple whitebox along
with :
1. decreased "backup_in_one" value for backups to be more frequent and
2. manually changed code for "enable_blob_file" to be always true and
apply blobdb params 100% for testing purpose.
Reviewed By: ltamasi
Differential Revision: D27636025
Pulled By: akankshamahajan15
fbshipit-source-id: 0d0e0d1479ced163f992872dc998e79c581bfc99
Summary:
According to https://github.com/facebook/rocksdb/issues/5907, each filter partition "should include the bloom of the prefix of the last
key in the previous partition" so that SeekForPrev() in prefix mode can return correct result.
The prefix of the last key in the previous partition does not necessarily have the same prefix
as the first key in the current partition. Regardless of the first key in current partition, the
prefix of the last key in the previous partition should be added. The existing code, however,
does not follow this. Furthermore, there is another issue: when finishing current filter partition,
`FullFilterBlockBuilder::AddPrefix()` is called for the first key in next filter partition, which effectively
overwrites `last_prefix_str_` prematurely. Consequently, when the filter block builder proceeds
to the next partition, `last_prefix_str_` will be the prefix of its first key, leaving no way of adding
the bloom of the prefix of the last key of the previous partition.
Prefix extractor is FixedLength.2.
```
[ filter part 1 ] [ filter part 2 ]
abc d
```
When SeekForPrev("abcd"), checking the filter partition will land on filter part 2 because "abcd" > "abc"
but smaller than "d".
If the filter in filter part 2 happens to return false for the test for "ab", then SeekForPrev("abcd") will build
incorrect iterator tree in non-total-order mode.
Also fix a unit test which starts to fail following this PR. `InDomain` should not fail due to assertion
error when checking on an arbitrary key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8137
Test Plan:
```
make check
```
Without this fix, the following command will fail pretty soon.
```
./db_stress --acquire_snapshot_one_in=10000 --avoid_flush_during_recovery=0 \
--avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 \
--batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=17 \
--bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 \
--checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \
--compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 \
--compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 \
--compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 \
--continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox \
--db_write_buffer_size=8388608 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_blob_files=0 \
--enable_compaction_filter=0 --enable_pipelined_write=1 --file_checksum_impl=big --flush_one_in=1000000 \
--format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \
--get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=2 --ingest_external_file_one_in=0 \
--iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \
--log2_keys_per_lock=10 --long_running_snapshots=1 --mark_for_compaction_one_file_in=0 \
--max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000000 --max_key_len=3 \
--max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 \
--max_write_buffer_size_to_maintain=8388608 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False \
--nooverwritepercent=0 --open_files=500000 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=0 --pause_background_one_in=1000000 \
--periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --read_fault_one_in=0 --read_only=0 \
--readpercent=45 --recycle_log_file_num=0 --reopen=20 --secondary_catch_up_one_in=0 \
--snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 \
--sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False \
--target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 \
--top_level_index_pinning=0 --unpartitioned_pinning=1 --use_blob_db=0 --use_block_based_filter=0 \
--use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 \
--use_multiget=0 --use_ribbon_filter=0 --use_txn=0 --user_timestamp_size=8 --verify_checksum=1 \
--verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 \
--write_dbid_to_manifest=1 --writepercent=35
```
Reviewed By: pdillinger
Differential Revision: D27553054
Pulled By: riversand963
fbshipit-source-id: 60e391e4a2d8d98a9a3172ec5d6176b90ec3de98
Summary:
Fixes https://github.com/facebook/rocksdb/issues/6548.
If we do not reset the pinnable slice before calling get, we will see the following assertion failure
while running the test with multiple column families.
```
db_bench: ./include/rocksdb/slice.h:168: void rocksdb::PinnableSlice::PinSlice(const rocksdb::Slice&, rocksdb::Cleanable*): Assertion `!pinned_' failed.
```
This happens in `BlockBasedTable::Get()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8154
Test Plan:
./db_bench --benchmarks=fillseq -num_column_families=3
./db_bench --benchmarks=readrandom -use_existing_db=1 -num_column_families=3
Reviewed By: ajkr
Differential Revision: D27587589
Pulled By: riversand963
fbshipit-source-id: 7379e7649ba40f046d6a4014c9ad629cb3f9a786
Summary:
Add request_id in IODebugContext which will be populated by
underlying FileSystem for IOTracing purposes. Update IOTracer to trace
request_id in the tracing records. Provided API
IODebugContext::SetRequestId which will set the request_id and enable
tracing for request_id. The API hides the implementation and underlying
file system needs to call this API directly.
Update DB::StartIOTrace API and remove redundant Env* from the
argument as its not used and DB already has Env that is passed down to
IOTracer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8045
Test Plan: Update unit test.
Differential Revision: D26899871
Pulled By: akankshamahajan15
fbshipit-source-id: 56adef52ee5af0fb3060b607c3af1ec01635fa2b
Summary:
The check in db_bench for table_cache_numshardbits was 0 < bits <= 20, whereas the check in LRUCache was 0 < bits < 20. Changed the two values to match to avoid a crash in db_bench on a null cache.
Fixes https://github.com/facebook/rocksdb/issues/7393
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8110
Reviewed By: zhichao-cao
Differential Revision: D27353522
Pulled By: mrambacher
fbshipit-source-id: a414bd23b5bde1f071146b34cfca5e35c02de869
Summary:
Currently, partitioned filter does not support user-defined timestamp. Disable it for now in ts stress test so that
the contrun jobs can proceed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8127
Test Plan: make crash_test_with_ts
Reviewed By: ajkr
Differential Revision: D27388488
Pulled By: riversand963
fbshipit-source-id: 5ccff18121cb537bd82f2ac072cd25efb625c666
Summary:
Previously it only applied to block-based tables generated by flush. This restriction
was undocumented and blocked a new use case. Now compression sampling
applies to all block-based tables we generate when it is enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8105
Test Plan: new unit test
Reviewed By: riversand963
Differential Revision: D27317275
Pulled By: ajkr
fbshipit-source-id: cd9fcc5178d6515e8cb59c6facb5ac01893cb5b0
Summary:
Fix the following error while running `make crash_test`
```
Traceback (most recent call last):
File "tools/db_crashtest.py", line 705, in <module>
main()
File "tools/db_crashtest.py", line 696, in main
blackbox_crash_main(args, unknown_args)
File "tools/db_crashtest.py", line 479, in blackbox_crash_main
+ list({'db': dbname}.items())), unknown_args)
File "tools/db_crashtest.py", line 414, in gen_cmd
finalzied_params = finalize_and_sanitize(params)
File "tools/db_crashtest.py", line 331, in finalize_and_sanitize
dest_params.get("user_timestamp_size") > 0):
TypeError: '>' not supported between instances of 'NoneType' and 'int'
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8091
Test Plan: make crash_test
Reviewed By: ltamasi
Differential Revision: D27268276
Pulled By: riversand963
fbshipit-source-id: ed2873b9587ecc51e24abc35ef2bd3d91fb1ed1b
Summary:
Add some basic test for user-defined timestamp to db_stress. Currently,
read with timestamp always tries to read using the current timestamp.
Due to the per-key timestamp-sequence ordering constraint, we only add timestamp-
related tests to the `NonBatchedOpsStressTest` since this test serializes accesses
to the same key and uses a file to cross-check data correctness.
The timestamp feature is not supported in a number of components, e.g. Merge, SingleDelete,
DeleteRange, CompactionFilter, Readonly instance, secondary instance, SST file ingestion, transaction,
etc. Therefore, db_stress should exit if user enables both timestamp and these features at the same
time. The (currently) incompatible features can be found in
`CheckAndSetOptionsForUserTimestamp`.
This PR also fixes a bug triggered when timestamp is enabled together with
`index_type=kBinarySearchWithFirstKey`. This bug fix will also be in another separate PR
with more unit tests coverage. Fixing it here because I do not want to exclude the index type
from crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8061
Test Plan: make crash_test_with_ts
Reviewed By: jay-zhuang
Differential Revision: D27056282
Pulled By: riversand963
fbshipit-source-id: c3e00ad1023fdb9ebbdf9601ec18270c5e2925a9
Summary:
Since our stress/crash tests by default generate values of size 8, 16, or 24,
it does not make much sense to set `min_blob_size` to 256. The patch
updates the set of potential `min_blob_size` values in the crash test
script and in `db_stress` where it might be set dynamically using
`SetOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8085
Test Plan: Ran `make check` and tried the crash test script.
Reviewed By: riversand963
Differential Revision: D27238620
Pulled By: ltamasi
fbshipit-source-id: 4a96f9944b1ed9220d3045c5ab0b34c49009aeee
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071
Test Plan: make check, added new unit test.
Reviewed By: anand1976
Differential Revision: D27177043
Pulled By: zhichao-cao
fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
Summary:
Currently, a few ldb commands do not check the execution result of
database operations. This PR checks the execution results and tries to
improve the error reporting.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8072
Test Plan:
```
make check
```
and
```
ASSERT_STATUS_CHECKED=1 make -j20 ldb
python tools/ldb_test.py
```
Reviewed By: zhichao-cao
Differential Revision: D27152466
Pulled By: riversand963
fbshipit-source-id: b94220496a4b3591b61c1d350f665860a6579f30
Summary:
The new options are:
* compact0 - compact L0 into L1 using one thread
* compact1 - compact L1 into L2 using one thread
* flush - flush memtable
* waitforcompaction - wait for compaction to finish
These are useful for reproducible benchmarks to help get the LSM tree shape
into a deterministic state. I wrote about this at:
http://smalldatum.blogspot.com/2021/02/read-only-benchmarks-with-lsm-are.html
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8027
Reviewed By: riversand963
Differential Revision: D27053861
Pulled By: ajkr
fbshipit-source-id: 1646f35584a3db03740fbeb47d91c3f00fb35d6e
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes.
For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved.
Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found)
6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found)
PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found)
(Note that the times for 6.18 are prior to revert of the SystemClock).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
Reviewed By: pdillinger
Differential Revision: D27014563
Pulled By: mrambacher
fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
Summary:
This PR adds support for custom file systems to ldb and sst_dump by adding command line options for specifying --fs_uri and --backup_fs uri (for ldb backup/restore commands). fs_uri is already supported in db_bench and db_stress, and there is already support in ldb and db stress for specifying customized envs.
The PR also fixes what looks like a bug in the ldb backup/restore commands. As it is right now, backups can only be made from and to the same environment/file system which does not seem to be the intended behavior. This PR makes it possible to do/restore backups between different envs/file systems.
Example:
`./ldb backup --fs_uri=zenfs://dev:nvme2n1 --backup_fs_uri=posix:// --backup_dir=/tmp/my_rocksdb_backup --db=rocksdbtest/dbbench
`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8010
Reviewed By: jay-zhuang
Differential Revision: D26904654
Pulled By: ajkr
fbshipit-source-id: 9b695ed8b944fcc6b27c4daaa9f52e87ee2c1fb4
Summary:
Removed confusing, awkward, and undocumented internal API
ReadOneLine and replaced with very simple LineFileReader.
In refactoring backupable_db.cc, this has the side benefit of
removing the arbitrary cap on the size of backup metadata files.
Also added Status::MustCheck to make it easy to mark a Status as
"must check." Using this, I can ensure that after
LineFileReader::ReadLine returns false the caller checks GetStatus().
Also removed some excessive conditional compilation in status.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8026
Test Plan: added unit test, and running tests with ASSERT_STATUS_CHECKED
Reviewed By: mrambacher
Differential Revision: D26831687
Pulled By: pdillinger
fbshipit-source-id: ef749c265a7a26bb13cd44f6f0f97db2955f6f0f
Summary:
Haven't seen any production issues with new Bloom filter and
it's now > 1 year old (added in 6.6.0).
Updated check_format_compatible.sh and HISTORY.md
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017
Test Plan: tests updated (or prior bugs fixed)
Reviewed By: ajkr
Differential Revision: D26762197
Pulled By: pdillinger
fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2
Summary:
* Adds backup/restore forward/backward compatibility testing
* Adds forward/backward compatibility testing to sst ingestion
* More structure sharing and comments for the lists of branches
comprising each group
* Less reliant on invariants between groups with de-duplication logic
* Restructured for n+1 branch checkout+build steps rather than something
like 3n. Should be much faster despite more checks.
And to make manual runs easier
* On success, restores working trees to original working branch (aborts
early if uncommitted changes) and deletes temporary branch & remote
* Adds SHORT_TEST=1 mode that uses only the oldest version for each
* Adds USE_SSH=1 to use ssh instead of https for github
group
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8012
Test Plan:
a number of manual tests, mostly with SHORT_TEST=1. Using one
version older for any of the groups (except I didn't check
db_backward_only_refs) fails. Changing default format_version to 5
(planned) without updating this script fails as it should, and passes
with appropriate update. Full local run passed (had to remove "2.7.fb.branch"
due to compiler issues, also before this change).
Reviewed By: riversand963
Differential Revision: D26735840
Pulled By: pdillinger
fbshipit-source-id: 1320c22de5674760657e385aa42df9fade8b6fff
Summary:
Fixed 5 test case failures found on Windows 10/Windows Server 2016
1. In `flush_job_test`, the DestroyDir function fails in deconstructor because some file handles are still being held by VersionSet. This happens on Windows Server 2016, so need to manually reset versions_ pointer to release all file handles.
2. In `StatsHistoryTest.InMemoryStatsHistoryPurging` test, the capping memory cost of stats_history_size on Windows becomes 14000 bytes with latest changes, not just 13000 bytes.
3. In `SSTDumpToolTest.RawOutput` test, the output file handle is not closed at the end.
4. In `FullBloomTest.OptimizeForMemory` test, ROCKSDB_MALLOC_USABLE_SIZE is undefined on windows so `total_mem` is always equal to `total_size`. The internal memory fragmentation assertion does not apply in this case.
5. In `BlockFetcherTest.FetchAndUncompressCompressedDataBlock` test, XPRESS cannot reach 87.5% compression ratio with original CreateTable method, so I append extra zeros to the string value to enhance compression ratio. Beside, since XPRESS allocates memory internally, thus does not support for custom allocator verification, we will skip the allocator verification for XPRESS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7992
Reviewed By: jay-zhuang
Differential Revision: D26615283
Pulled By: ajkr
fbshipit-source-id: 3632612f84b99e2b9c77c403b112b6bedf3b125d
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
Summary:
The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977
Test Plan: make check. tested with old trace file in replay and analyzing.
Reviewed By: anand1976
Differential Revision: D26529948
Pulled By: zhichao-cao
fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76
Summary:
The patch adds checkpoint support to BlobDB. Blob files are hard linked or
copied, depending on whether the checkpoint directory is on the same filesystem
or not, similarly to table files.
TODO: Add support for blob files to `ExportColumnFamily` and to the checksum
verification logic used by backup/restore.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7959
Test Plan: Ran `make check` and the crash test for a while.
Reviewed By: riversand963
Differential Revision: D26434768
Pulled By: ltamasi
fbshipit-source-id: 994be55a8dc08133028250760fca440d2c7c4dc5
Summary:
The patch adds the configuration options of the new BlobDB implementation
to `db_bench` and adjusts the help messages of the old (`StackableDB`-based)
BlobDB's options to make it clear which implementation they pertain to.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7956
Test Plan: Ran `make check` and `db_bench` with the new options.
Reviewed By: jay-zhuang
Differential Revision: D26384808
Pulled By: ltamasi
fbshipit-source-id: b4405bb2c56cfd3506d4c32e3329c08dfdf69c94
Summary:
The patch fixes the build for `db_bench_tool_test` and makes the tests pass.
Namely, it fixes the following issues:
* https://github.com/facebook/rocksdb/issues/7703 removed the member variable `fs_` but the test case `OptionsFileMultiLevelUniversal`
was not updated.
* https://github.com/facebook/rocksdb/issues/7344 fixed the `OptionsFile` test case for the case when Snappy is *not* available but at the
same time broke it for the case when it *is* available. (The test used a default-constructed
`ColumnFamilyOptions` object, and the default value of the `compression` option is either
Snappy or no compression depending on whether Snappy is supported.)
* The test used `google::ParseCommandLineFlags` instead of
`GFLAGS_NAMESPACE::ParseCommandLineFlags`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7935
Test Plan: Ran the test both with and without Snappy support.
Reviewed By: zhichao-cao
Differential Revision: D26269765
Pulled By: ltamasi
fbshipit-source-id: b7303d8a981ab299d22ab540e0cbd12d149ed9bb
Summary:
The patch adds support for the options related to the new BlobDB implementation
to `db_stress`, including support for dynamically adjusting them using `SetOptions`
when `set_options_one_in` and a new flag `allow_setting_blob_options_dynamically`
are specified. (The latter is used to prevent the options from being enabled when
incompatible features are in use.)
The patch also updates the `db_stress` help messages of the existing stacked BlobDB
related options to clarify that they pertain to the old implementation. In addition, it
adds the new BlobDB to the crash test script. In order to prevent a combinatorial explosion
of jobs and still perform whitebox/blackbox testing (including under ASAN/TSAN/UBSAN),
and to also test BlobDB in conjunction with atomic flush and transactions, the script sets
the BlobDB options in 10% of normal/`cf_consistency`/`txn` crash test runs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7900
Test Plan: Ran `make check` and `db_stress`/`db_crashtest.py` with various options.
Reviewed By: jay-zhuang
Differential Revision: D26094913
Pulled By: ltamasi
fbshipit-source-id: c2ef3391a05e43a9687f24e297df05f4a5584814
Summary:
The unimplemented handler will return Status::InvalidArgument() and caused issues when using trace analyzer for write batch record. Override with returning Status::OK()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7910
Test Plan: tested with real trace, make check
Reviewed By: siying
Differential Revision: D26154327
Pulled By: zhichao-cao
fbshipit-source-id: bcdefd4891f839b2e89e4c079f9f430245f482fb
Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).
The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.
When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748
Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.
Reviewed By: pdillinger
Differential Revision: D25754492
Pulled By: ajkr
fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
Summary:
Removed the uses of the Legacy FileWrapper classes from the source code. The wrappers were creating an additional layer of indirection/wrapping, as the Env already has a FileSystem.
Moved the Custom FileWrapper classes into the CustomEnv, as these classes are really for the private use the the CustomEnv class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7851
Reviewed By: anand1976
Differential Revision: D26114816
Pulled By: mrambacher
fbshipit-source-id: db32840e58d969d3a0fa6c25aaf13d6dcdc74150
Summary:
Closes https://github.com/facebook/rocksdb/issues/7035
Changed how build_version.cc was generated:
- Included the GIT tag/branch in the build_version file
- Changed the "Build Date" to be:
- If the GIT branch is "clean" (no changes), the date of the last git commit
- If the branch is not clean, the current date
- Added APIs to access the "build information", rather than accessing the strings directly.
The build_version.cc file is now regenerated whenever the library objects are rebuilt.
Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866
Reviewed By: pdillinger
Differential Revision: D26086565
Pulled By: mrambacher
fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f
Summary:
Currently, db_bench cleanup only deletes the main DB, if there's one.
Multiple DBs that are opened when --num_multi_db is specified are not
deleted, which can lead to crashes due to running compaction threads on
process exit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7891
Test Plan: Run regression test
Reviewed By: jay-zhuang
Differential Revision: D26049914
Pulled By: anand1976
fbshipit-source-id: acef2821001ca5e208a96a6a273c724e56353316
Summary:
Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
Reviewed By: pdillinger
Differential Revision: D26006406
Pulled By: mrambacher
fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
Summary:
1. In IOTracing, add filename with each IOTrace record. Filename is stored in file object (Tracing Wrappers).
2. Change the logic of figuring out which additional information (file_size,
length, offset etc) needs to be store with each operation
which is different for different operations.
When new information will be added in future (depends on operation),
this change would make the future additions simple.
Logic: In IOTraceRecord, io_op_data is added and its
bitwise positions represent which additional information need
to added in the record from enum IOTraceOp. Values in IOTraceOp represent bitwise positions.
So if length and offset needs to be stored (IOTraceOp::kIOLen
is 1 and IOTraceOp::kIOOffset is 2), position 1 and 2 (from rightmost bit) will be set
and io_op_data will contain 110.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7885
Test Plan: Updated io_tracer_test and verified the trace file manually.
Reviewed By: anand1976
Differential Revision: D25982353
Pulled By: akankshamahajan15
fbshipit-source-id: ebfc5539cc0e231d7794a6b42b73f5403e360b22
Summary:
The regression_test.sh script checkpoints the DB directory before running db_bench on it. Specify the --try_load_options when creating the checkpoint in order to load options from the OPTIONS file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7864
Test Plan: manually run db_bench on the checkpoint dir
Reviewed By: akankshamahajan15
Differential Revision: D25926960
Pulled By: anand1976
fbshipit-source-id: d3442ae24a7044b474dc80efc9c06bdc6ebe0388
Summary:
When the --try_load_options is used in conjunction with the
--column_family option, ldb incorrectly sets the ColumnFamilyOptions for
that column family to defaults. This PR fixes that by retaining from the
OPTIONS file and applying command line overrides.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7847
Test Plan: Add a unit test in ldb_cmd_test
Reviewed By: ajkr
Differential Revision: D25874720
Pulled By: anand1976
fbshipit-source-id: 04bcf23b55e5a30b5b6a59b0e5cb4faef3da7429
Summary:
This PR does the following:
-> Creates a WinFileSystem class. This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
-> Introduces a CustomEnv class. A CustomEnv is an Env that takes a FileSystem as constructor argument. I believe there will only ever be two implementations of this class (PosixEnv and WinEnv). There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
-> Eliminates the public uses of the LegacyFileSystemWrapper.
With this change in place, there are effectively the following patterns of Env:
- "Base Env classes" (PosixEnv, WinEnv). These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem. These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
- Wrapped Composite Env classes (MemEnv). These classes take in an Env and a FileSystem. The core env functions are re-directed to the wrapped env. The file system calls are redirected to the input file system
- Legacy Wrapped Env classes. These classes take in an Env input (but no FileSystem). The core env functions are re-directed to the wrapped env. A "Legacy File System" is created using this env and the file system calls directed to the env itself.
With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created. Any other use of the PosixEnv is via another wrapped env. This cleans up some of the issues with the env construction and destruction.
Additionally, there were places in the code that required had an Env when they required a FileSystem. Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem(). These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703
Reviewed By: zhichao-cao
Differential Revision: D25762190
Pulled By: anand1976
fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
Summary:
Prior to this PR it prints the raw bytes which can include non-printable
characters. This PR adds the option to print in hex instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7820
Test Plan:
try it out
```
$ ./ldb file_checksum_dump --hex --db=/tmp/rocksdbtest-9383//db_basic_test_12281129388755189514/
16, FileChecksumCrc32c, 0xC789D948
```
Reviewed By: jay-zhuang
Differential Revision: D25738072
Pulled By: ajkr
fbshipit-source-id: 8cf2856877971756c0495cfa63a9a1281c414dc7
Summary:
The multireadrandom benchmark, when run for a specific number of reads (--reads argument), should base the duration on the actual number of keys read rather than number of batches.
Tests:
Run db_bench multireadrandom benchmark
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7817
Reviewed By: zhichao-cao
Differential Revision: D25717230
Pulled By: anand1976
fbshipit-source-id: 13f4d8162268cf9a34918655e60302d0aba3864b
Summary:
Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds. This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it. In this case, without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798
Reviewed By: jay-zhuang
Differential Revision: D25680451
Pulled By: pdillinger
fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917