Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33788508
Pulled By: akankshamahajan15
fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
Summary:
I feel it would be nice if we can fix this spelling error.
In `SizeApproximationOptions`, the `include_memtabtles` should be `include_memtables`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9490
Test Plan: make check
Reviewed By: hx235
Differential Revision: D33949862
Pulled By: riversand963
fbshipit-source-id: b2be67501b65d4aabb6b8df1bf25eb8d54cc1466
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Summary:
* Add more micro-benchmark tests
* Expose an API in DBImpl for waiting for compactions (still not visible to the user)
* Add argument name for ribbon_bench
* remove benchmark run from CI, as it runs too long.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9436
Test Plan: CI
Reviewed By: riversand963
Differential Revision: D33777836
Pulled By: jay-zhuang
fbshipit-source-id: c05de3bc082cc05b5d019f00b324e774bf4bbd96
Summary:
Disallow `immutable_db_opts.use_direct_io_for_flush_and_compaction == true` and
`mutable_db_opts.writable_file_max_buffer_size == 0`, since it causes `WritableFileWriter::Append()`
to loop forever and does not make much sense in direct IO.
This combination of options itself does not make much sense: asking RocksDB to do direct IO but not allowing
RocksDB to allocate a buffer. We should detect this false combination and warn user early, no matter whether
the application is running on a platform that supports direct IO or not. In the case of platform **not** supporting
direct IO, it's ok if the user learns about this and then finds that direct IO is not supported.
One tricky thing: the constructor of `WritableFileWriter` is being used in our unit tests, and it's impossible
to return status code from constructor. Since we do not throw, I put an assertion for now. Fortunately,
the constructor is not exposed to external applications.
Closing https://github.com/facebook/rocksdb/issues/7109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9348
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D33371924
Pulled By: riversand963
fbshipit-source-id: 2a3701ab541cee23bffda8a36cdf37b2d235edfa
Summary:
Add an option to set the WAL compression algorithm - wal_compression.
TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432
Reviewed By: pdillinger
Differential Revision: D33797275
Pulled By: sidroyc
fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7
Summary:
Wasn't able to easily reproduce error, but easy to see a race
condition between TestFlushListener::OnFlushCompleted and
DBTestBase::Close(), which frees CF handles before closing DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9400
Test Plan: CI etc.
Reviewed By: riversand963
Differential Revision: D33645134
Pulled By: pdillinger
fbshipit-source-id: d0ec914cc43c9e14f53da633876b95b61995138d
Summary:
Fix https://github.com/facebook/rocksdb/issues/8046 : FlushMemTable return ok but memtable does not synchronize flush. The way to fix it is to expose RecoveryError.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8173
Reviewed By: ajkr
Differential Revision: D31674552
Pulled By: jay-zhuang
fbshipit-source-id: 9d16b69ba12a196bb429332ec8224754de97773d
Summary:
Closing https://github.com/facebook/rocksdb/issues/5006
Calling `DB::DestroyColumnFamilyHandle(column_family)` with `column_family` being the return value of
`DB::DefaultColumnFamily()` will return `Status::InvalidArgument()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9347
Test Plan: make check
Reviewed By: akankshamahajan15
Differential Revision: D33369675
Pulled By: riversand963
fbshipit-source-id: a8266a4daddf2b7a773c2dc7f3eb9a4adfb6b6dd
Summary:
This option causes trace records to be written in the serialized write thread. That way, the write records in the trace must follow the same order as writes that are logged to WAL and writes that are applied to the DB.
By default I left it disabled to match existing behavior. I enabled it in `db_stress`, though, as that use case requires order of write records in trace matches the order in WAL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9334
Test Plan:
- See if below unsynced data loss crash test can run for 24h straight. It used to crash after a few hours when reaching an unlucky trace ordering.
```
DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 --test_batches_snapshots=0 --duration=86400
```
Reviewed By: zhichao-cao
Differential Revision: D33301990
Pulled By: ajkr
fbshipit-source-id: 82d97559727adb4462a7af69758449c8725b22d3
Summary:
As (https://github.com/facebook/rocksdb/issues/9210) discussed, the **full_history_ts_low** is a member of CompactRangeOptions currently, which means a CF's fullHistoryTsLow is advanced only when users submit a CompactRange request.
However, users may want to advance the fllHistoryTsLow without an immediate compact.
This merge make IncreaseFullHistoryTsLow to a public API so users can advance each CF's fullHistoryTsLow seperately.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9221
Reviewed By: akankshamahajan15
Differential Revision: D33201106
Pulled By: riversand963
fbshipit-source-id: 9cb1d013ba93260f72e16353e693ffee167b47ee
Summary:
The failure looked like this:
```
utilities/backupable/backupable_db_test.cc:3161: Failure
Value of: db_chroot_env_->FileExists(prev_manifest_path).IsNotFound()
Actual: false
Expected: true
```
The failure could be coerced consistently with the following patch:
```
diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 80410f671..637636791 100644
--- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -2772,6 +2772,8 @@ void DBImpl::BackgroundCallFlush(Env::Priority thread_pri) {
if (job_context.HaveSomethingToClean() ||
job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
mutex_.Unlock();
+ bg_cv_.SignalAll();
+ sleep(1);
TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:FilesFound");
// Have to flush the info logs before bg_flush_scheduled_--
// because if bg_flush_scheduled_ becomes 0 and the lock is
```
The cause was a familiar problem, which is manual flush/compaction may
return before files they obsoleted are removed. The solution is just to
wait for "scheduled" work to complete, which includes all phases
including cleanup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9327
Test Plan:
after this PR, even the above patch to coerce the bug cannot
cause the test to fail.
Reviewed By: riversand963
Differential Revision: D33252208
Pulled By: ajkr
fbshipit-source-id: 720a7eaca58c7247d221911fffe3d5e1dbf581e9
Summary:
- Make MemoryAllocator and its implementations into a Customizable class.
- Added a "DefaultMemoryAllocator" which uses new and delete
- Added a "CountedMemoryAllocator" that counts the number of allocs and free
- Updated the existing tests to use these new allocators
- Changed the memkind allocator test into a generic test that can test the various allocators.
- Added tests for creating all of the allocators
- Added tests to verify/create the JemallocNodumpAllocator using its options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980
Reviewed By: zhichao-cao
Differential Revision: D32990403
Pulled By: mrambacher
fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).
The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.
This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)
The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.
Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.
Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.
### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)
### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime
We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
```
Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```
These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day
After enough data, we get a result at the end:
```
(keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```
If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
```
(keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```
The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```
I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
Reviewed By: zhichao-cao
Differential Revision: D33171746
Pulled By: pdillinger
fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.
Reviewed By: ltamasi
Differential Revision: D31721350
fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
Summary:
Fix a bug that causes file temperature not preserved after DB is restarted, or options.max_manifest_file_size is hit.
Also, pass temperature information to NewRandomAccessFile() to allow users to hack a solution where they don't preserve tiering information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9242
Test Plan: Add a unit test that would fail without the fix.
Reviewed By: jay-zhuang
Differential Revision: D32818150
fbshipit-source-id: 36aa3f148c60107f7b8e9d65b63b039f9e1a1eec
Summary:
Saw error like this:
`Backup failed -- IO error: No such file or directory: While opening a
file for sequentially reading:
/dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or
directory`
Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.)
relies on no file deletions happening while its operating, which
means not only disabling (more) deletions, but ensuring any pending
deletions are completed. Two fixes related to this:
* There was a gap in several places between decrementing
pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where
the db mutex would be released and GetSortedWalFiles (and others) could
get false information that no deletions are pending.
* The fix to https://github.com/facebook/rocksdb/issues/8591 (disabling deletions in GetSortedWalFiles) seems
incomplete because it doesn't prevent pending deletions from occuring
during the operation (if deletions not already disabled, the case that
was to be fixed by the change).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9208
Test Plan:
existing tests (it's hard to write a test for interleavings
that are now excluded - this is what stress test is for)
Reviewed By: ajkr
Differential Revision: D32630675
Pulled By: pdillinger
fbshipit-source-id: a121e3da648de130cd24d44c524232f4eb22f178
Summary:
`ReadOptions::iter_start_seqnum` and `DBOptions::preserve_deletes` are
deprecated, please try using user defined timestamp feature instead.
The feature is used to support differential snapshots, but not well
maintained (https://github.com/facebook/rocksdb/issues/6837, https://github.com/facebook/rocksdb/issues/8472) and the interface is not user friendly which
returns an internal key from the iterator. The user defined timestamp
feature is a more flexible feature to support similar usecase, please
switch to that if you have such usecase.
The deprecated feature will be removed in a future release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9091
Test Plan:
check LOG
Fix https://github.com/facebook/rocksdb/issues/9090
Reviewed By: ajkr
Differential Revision: D32071750
Pulled By: jay-zhuang
fbshipit-source-id: b882c4668dd1bf26ce03c4c192f1bba584bf6104
Summary:
After RocksDB 6.19 and before this PR, RocksDB FlushJob may pick more memtables to flush beyond synced WALs.
This can be problematic if there are multiple column families, since it can prematurely advance the flushed column
family's log_number. Should subsequent attempts fail to sync the latest WALs and the database goes
through a recovery, it may detect corrupted WAL number below the flushed column family's log number
and complain about column family inconsistency.
To fix, we record the maximum memtable ID of the column family being flushed. Then we call SyncClosedLogs()
so that all closed WALs at the time when memtable ID is recorded will be synced.
I also disabled a unit test temporarily due to reasons described in https://github.com/facebook/rocksdb/issues/9151
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9142
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D32299956
Pulled By: riversand963
fbshipit-source-id: 0da75888177d91905cf8c9d00605b73afb5970a7
Summary:
- Fixed bug where bottom-pri manual compactions were counting towards `bg_compaction_scheduled_` instead of `bg_bottom_compaction_scheduled_`. It seems to have no negative effect.
- Fixed bug where automatic compaction scheduling did not consider `bg_bottom_compaction_scheduled_`. Now automatic compactions cannot be scheduled that exceed the per-DB compaction concurrency limit (`max_compactions`) when some existing compactions are bottommost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9179
Test Plan: new unit test for manual/automatic. Also verified the existing automatic/automatic test ("ConcurrentBottomPriLowPriCompactions") hanged until changing it to explicitly enable concurrency.
Reviewed By: riversand963
Differential Revision: D32488048
Pulled By: ajkr
fbshipit-source-id: 20c4c0693678e81e43f85ed3cc3402fcf26e3310
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162
Existing TransactionUtil::CheckKeyForConflict() performs only seq-based
conflict checking. If user-defined timestamp is enabled, it should perform
conflict checking based on timestamps too.
Update TransactionUtil::CheckKey-related methods to verify the timestamp of the
latest version of a key is smaller than the read timestamp. Note that
CheckKeysForConflict() is not updated since it's used only by optimistic
transaction, and we do not plan to update it in this upcoming batch of diffs.
Existing GetLatestSequenceForKey() returns the sequence of the latest
version of a specific user key. Since we support user-defined timestamp, we
need to update this method to also return the timestamp (if enabled) of the
latest version of the key. This will be needed for snapshot validation.
Reviewed By: ltamasi
Differential Revision: D31567960
fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44
Summary:
Track per-SST user-defined timestamp information in MANIFEST https://github.com/facebook/rocksdb/issues/8957
Rockdb has supported user-defined timestamp feature. Application can specify a timestamp
when writing each k-v pair. When data flush from memory to disk file called SST files, file
creation activity will commit to MANIFEST. This commit is for tracking timestamp info in the
MANIFEST for each file. The changes involved are as follows:
1) Track max/min timestamp in FileMetaData, and fix invoved codes.
2) Add NewFileCustomTag::kMinTimestamp and NewFileCustomTag::kMinTimestamp in
NewFileCustomTag ( in the kNewFile4 part ), and support invoved codes such as
VersionEdit Encode and Decode etc.
3) Add unit test code for VersionEdit EncodeDecodeNewFile4, and fix invoved test codes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9092
Reviewed By: ajkr, akankshamahajan15
Differential Revision: D32252323
Pulled By: riversand963
fbshipit-source-id: d2642898d6e3ad1fef0eb866b98045408bd4e162
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
`IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
forced, the same as above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903
Test Plan: run tests on btrfs and non btrfs
Reviewed By: ajkr
Differential Revision: D30885059
Pulled By: jay-zhuang
fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
Summary:
currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9026
Reviewed By: ajkr
Differential Revision: D31668241
Pulled By: jay-zhuang
fbshipit-source-id: c02f6c4a5df9fbf0b7510036594811152e8738af
Summary:
This PR fix wrong ticker `WRITE_WITH_WAL`.
`RecordTick(WRITE_WITH_WAL)` will be called later in `WriteToWAL` and `ConcurrentWriteToWAL`.
Fixes:
1. Delete these two extra `RecordTick(WRITE_WITH_WAL)`
2. Fix corresponding test case
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9064
Reviewed By: ajkr
Differential Revision: D31944459
Pulled By: riversand963
fbshipit-source-id: f1aa8d2a4320456bc357bc5b0902032f7dcad086
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.
Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.
The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.
Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).
The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.
Thanks to siying for finding the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077
Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.
Reviewed By: siying
Differential Revision: D31921908
Pulled By: ajkr
fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.
When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.
After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D31639225
Pulled By: riversand963
fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
Summary:
This PR does not change code sematics, it just changes for:
1. local obj `nonmem_w` and `lfile` are unused
2. null check for `delete ptr` is unnecessary
3. use `unique_ptr::reset` instead of `release` + `delete`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9052
Reviewed By: zhichao-cao
Differential Revision: D31801661
Pulled By: anand1976
fbshipit-source-id: 16a77d45da8c8833bf5bf3bce546bb3711b335df
Summary:
This PR has no semantic changes, just to make code shorter.
`stats_` has value same with `immutable_db_options_.stats`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9053
Reviewed By: zhichao-cao
Differential Revision: D31801603
Pulled By: anand1976
fbshipit-source-id: cbd8fe478d3e90ae078ace49b4f2eb9bb028ccf6
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).
Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990
Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.
Reviewed By: zhichao-cao, mrambacher
Differential Revision: D31582865
Pulled By: pdillinger
fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
Summary:
If `DB::Close()` is called in multi-thread env, the resource
could be double released, which causes exception or assert.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8970
Test Plan:
Test with multi-thread benchmark, with each thread try to
close the DB at the end.
Reviewed By: pdillinger
Differential Revision: D31242042
Pulled By: jay-zhuang
fbshipit-source-id: a61276b1b61e07732e375554106946aea86a23eb
Summary:
New classes FileStorageInfo and LiveFileStorageInfo and
'experimental' function DB::GetLiveFilesStorageInfo, which is intended
to largely replace several fragmented DB functions needed to create
checkpoints and backups.
This function is now used to create checkpoints and backups, because
it fixes many (probably not all) of the prior complexities of checkpoint
not having atomic access to DB metadata. This also ensures strong
functional test coverage of the new API. Specifically, much of the old
CheckpointImpl::CreateCustomCheckpoint has been migrated to and
updated in DBImpl::GetLiveFilesStorageInfo, with the former now
calling the latter.
Also, the class FileStorageInfo in metadata.h compatibly replaces
BackupFileInfo and serves as a new base class for SstFileMetaData.
Some old fields of SstFileMetaData are still provided (for now) but
deprecated.
Although FileStorageInfo::directory is accurate when using db_paths
and/or cf_paths, these have never been supported by Checkpoint
nor BackupEngine and still are not. This change does now detect
these cases and return NotSupported when appropriate. (More work
needed for support.)
Somehow this change broke ProgressCallbackDuringBackup, but
the progress_callback logic was dubious to begin with because it
would call the callback based on copy buffer size, not size actually
copied. Logic and test updated to track size actually copied
per-thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968
Test Plan:
tests updated.
DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl.
DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo,
including reading the data after DB close.
Added CheckpointTest.CheckpointWithDbPath (NotSupported).
Reviewed By: siying
Differential Revision: D31242045
Pulled By: pdillinger
fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0
Summary:
The code in `IngestExternalFiles()` that bumps the DB's sequence number
depending on what seqnos were assigned to the files has 3 bugs:
1) There is an assertion that the sequence number is increased in all the
affected column families, but this is unnecessary, it is fine if some files can
stick to a lower sequence number. It is very easy to hit the assertion: it is
sufficient to insert 2 files in 2 CFs, one which overlaps the CF and one that
doesn't (for example the CF is empty). The line added in the
`IngestFilesIntoMultipleColumnFamilies_Success` test makes the assertion fail.
2) SetLastSequence() is called with the sum of all the bumps across CFs, but we
should take the maximum instead, as all CFs start with the current seqno and bump
it independently.
3) The code above is accidentally under a `#ifndef NDEBUG`, so it doesn't run in
optimized builds, so some files may be assigned seqnos from the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9005
Test Plan:
Added line in `IngestFilesIntoMultipleColumnFamilies_Success` that
triggers the assertion, verified that the test (and all the others) pass after
the fix.
Reviewed By: ajkr
Differential Revision: D31597892
Pulled By: ot
fbshipit-source-id: c2d3237f90290df1178736ace8653a9623f5a770
Summary:
EndWriteStall has a data race: `queue_.empty()` is checked outside of the
mutex, so once we enter the critical section another thread may already have
cleared the list, and accessing the `front()` is undefined behavior (and causes
interesting crashes under high concurrency).
This PR fixes the bug, and also rewrites the logic to make it easier to reason
about it. It also fixes another subtle bug: if some writers are stalled and
`SetBufferSize(0)` is called, which disables the WBM, the writer are not
unblocked because of an early `enabled()` check in `EndWriteStall()`.
It doesn't significantly change the locking behavior, as before writers won't
lock unless entering a stall condition, and `FreeMem` almost always locks if
stalling is allowed, but that is inevitable with the current design. Liveness is
guaranteed by the fact that if some writes are blocked, eventually all writes
will be blocked due to `stall_active_`, and eventually all memory is freed.
While at it, do a couple of optimizations:
- In `WBMStallInterface::Signal()` signal the CV only after releasing the
lock. Signaling under the lock is a common pitfall, as it causes the woken-up
thread to immediately go back to sleep because the mutex is still locked by
the awaker.
- Move all allocations and deallocations outside of the lock.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9009
Test Plan:
```
USE_CLANG=1 make -j64 all check
```
Reviewed By: akankshamahajan15
Differential Revision: D31550668
Pulled By: ot
fbshipit-source-id: 5125387c3dc7ecaaa2b8bbc736e58c4156698580
Summary:
Add the file temperature to `IngestExternalFileArg` such that when SST files are ingested, user is able to assign the temperature to each SST file. If the temperature vector is empty or its size does not match the file name vector size, all ingested SST files will be assigned with `Temperature::unKnown`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8949
Test Plan: add the new test and make check
Reviewed By: siying
Differential Revision: D31127852
Pulled By: zhichao-cao
fbshipit-source-id: 141a81f0f7b473d88f4ab0cb2a21a114cbc6f83c
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8991
Test Plan: the new test hangs forever without this fix and passes with this fix.
Reviewed By: hx235
Differential Revision: D31456419
Pulled By: ajkr
fbshipit-source-id: a82c0e5560b6e6153089dccd8e46163c61b07bff
Summary:
If WAL dir is different from the DB dir, we should still honor the SstFileManager deletion rate limit for SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8967
Test Plan: Add a new unit test in db_sst_test
Reviewed By: pdillinger
Differential Revision: D31220116
Pulled By: anand1976
fbshipit-source-id: bcde8a53a7d728e15e597fb5d07ee86c1b38bd28
Summary:
This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead.
Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
Hopefully, this sort of code hygiene cleanup will speed up the builds...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
Reviewed By: pdillinger
Differential Revision: D31142788
Pulled By: mrambacher
fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
Summary:
Added support for SingleDelete for user-defined timestamps. Users can now Get and Iterate over keys deleted with SingleDelete. It also includes changes in CompactionIterator which preserves the same user key with different timestamps, unless the timestamp is below a certain threshold full_history_ts_low.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8921
Test Plan: Added new unit tests
Reviewed By: riversand963
Differential Revision: D31098191
Pulled By: akankshamahajan15
fbshipit-source-id: 78a59ef4b4884ae324fcd10f56e62a27d5ee2f49
Summary:
1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files.
2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted.
3. Test OnFile*Finish operations notifications with Blob Files.
4. Log the blob file creation/deletion events through EventLogger in Log file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8675
Test Plan: Add new unit tests in listener_test
Reviewed By: ltamasi
Differential Revision: D30412613
Pulled By: akankshamahajan15
fbshipit-source-id: ca51b63c6e8c8d0485a38c503572bc5a82bd5d07
Summary:
In the past, we unnecessarily requires all keys in the same write batch
to be from column families whose timestamps' formats are the same for
simplicity. Specifically, we cannot use the same write batch to write to
two column families, one of which enables timestamp while the other
disables it.
The limitation is due to the member `timestamp_size_` that used to exist
in each `WriteBatch` object. We pass a timestamp_size to the constructor
of `WriteBatch`. Therefore, users can simply use the old
`WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while
the internal implementation of `WriteBatch` will take care of memory
allocation for timestamps.
The above is not necessary.
One the one hand, users can set up a memory buffer to store user key and
then contiguously append the timestamp to the user key. Then the user
can pass this buffer to the `WriteBatch::Put(Slice&)` API.
On the other hand, users can set up a SliceParts object which is an
array of Slices and let the last Slice to point to the memory buffer
storing timestamp. Then the user can pass the SliceParts object to the
`WriteBatch::Put(SliceParts&)` API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8725
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D30654499
Pulled By: riversand963
fbshipit-source-id: 9d848c77ad3c9dd629aa5fc4e2bc16fb0687b4a2
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
Reviewed By: zhichao-cao
Differential Revision: D29558961
Pulled By: mrambacher
fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
Summary:
- Clarified some comments on compatibility for adding new ticker stats
- Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8741
Test Plan: new unit test
Reviewed By: zhichao-cao
Differential Revision: D30708578
Pulled By: ajkr
fbshipit-source-id: d06b961f7e199ae92c266b683e39870aa8f63449
Summary:
Old typedef syntax is confusing
Most but not all changes with
perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files
make format
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751
Test Plan: existing
Reviewed By: zhichao-cao
Differential Revision: D30745277
Pulled By: pdillinger
fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736
Test Plan: existing tests
Reviewed By: mrambacher
Differential Revision: D30700039
Pulled By: pdillinger
fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.
This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.
Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
* Lots of info from default Env like time, process id, and hostname.
* std::random_device
* port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.
DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.
GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.
Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708
Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.
Reviewed By: zhichao-cao, mrambacher
Differential Revision: D30563780
Pulled By: pdillinger
fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364
Summary:
DumpStats() iterates through the ColumnFamilySet. There is a potential
race condition because it does Ref the cfd, and the cfd could get
destroyed during the iteration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8714
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D30580199
Pulled By: anand1976
fbshipit-source-id: 60a3443ad0d4f7ac6a977dec780e6d2c1b70b850
Summary:
Pass BlobFileCompletionCallback in case of atomic flush and
compaction job which is currently nullptr(default parameter).
BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to
SstFileManager.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681
Test Plan: CircleCI jobs
Reviewed By: ltamasi
Differential Revision: D30445998
Pulled By: akankshamahajan15
fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021
Summary:
Previously, when a `FlushJob` was redirected to a MemPurge, the function `DBImpl::NotifyOnFlushComplete` was called, which created a series of issues because the JobInfo was not correctly collected from the memtables.
This diff aims at correcting these two issues (`FlushJobInfo` collection in `FlushJob::MemPurge` , no call to `DBImpl::NotifyOnFlushComplete` after successful mempurge).
Event listeners were added to the unit tests to handle these situations.
Surprisingly none of the crashtests caught this issue, I will try to add event listeners to crash tests in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8672
Reviewed By: akankshamahajan15
Differential Revision: D30383109
Pulled By: bjlemaire
fbshipit-source-id: 35a8d4295886923ee4049a6447f00022cb221c73
Summary:
Extends https://github.com/facebook/rocksdb/issues/8659 to work for ingested external SST files, even
the same file ingested into different DBs sharing a block cache.
Note: These new cache keys are currently only enabled when FileSystem
does not provide GetUniqueId. For now, they are typically larger,
so slightly less efficient.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8669
Test Plan: Extended unit test
Reviewed By: zhichao-cao
Differential Revision: D30398532
Pulled By: pdillinger
fbshipit-source-id: 1f13e2af4b8bfff5741953a69466e9589fbc23c7
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.
User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.
Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611
Reviewed By: ajkr
Differential Revision: D30266329
Pulled By: autopear
fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
Reviewed By: pdillinger
Differential Revision: D30149315
Pulled By: bjlemaire
fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
Summary:
`CompareKeyContext::operator()` on the trunk has a bug: when comparing
column family IDs, `lhs` is used for both sides of the comparison. This
results in the `KeyContext`s getting sorted solely based on key, which
in turn means that keys with the same column family do not necessarily
form a single range in the sorted list. This violates an assumption of the
batched `MultiGet` logic, leading to the same column family
showing up multiple times in the list of `MultiGetColumnFamilyData`.
The end result is the code attempting to check out the thread-local
`SuperVersion` for the same CF multiple times, causing an
assertion violation in debug builds and memory corruption/crash in
release builds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D30169182
Pulled By: ltamasi
fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023
Summary:
We've been seeing occasional crashes on CI while inserting into the
vectors in `ObsoleteFilesTest.DeleteObsoleteOptionsFile`. The crashes
don't reproduce locally (could be either a race or an object lifecycle
issue) but the good news is that the vectors in question are not really
used for anything meaningful by the test. (The assertion about the sizes
of the two vectors being equal is guaranteed to hold, since the two sync
points where they are populated are right after each other.) The patch
simply removes the vectors from the test, alongside the associated
callbacks and sync points.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8624
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D30118485
Pulled By: ltamasi
fbshipit-source-id: 0a4c3d06584e84cd2b1dcc212d274fa1b89cb647
Summary:
PR https://github.com/facebook/rocksdb/issues/5908 added `flush_jobs_info_` to `FlushJob` to make sure
`OnFlushCompleted()` is called after committing flush results to
MANIFEST. However, `flush_jobs_info_` is not updated in atomic
flush, causing `NotifyOnFlushCompleted()` to skip `OnFlushCompleted()`.
This PR fixes this, in a similar way to https://github.com/facebook/rocksdb/issues/5908 that handles regular flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8585
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D29913720
Pulled By: riversand963
fbshipit-source-id: 4ff023c98372fa2c93188d4a5c8a4e9ffa0f4dda
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
https://github.com/facebook/rocksdb/issues/8440 ,
https://github.com/facebook/rocksdb/issues/8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.
Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the
unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.
The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605
Test Plan: Ran `make check` and stress tests locally.
Reviewed By: pdillinger
Differential Revision: D30051035
Pulled By: ltamasi
fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
Summary:
Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.
After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).
Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582
Reviewed By: pdillinger, autopear
Differential Revision: D29881122
Pulled By: mrambacher
fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb
Summary:
This PR tries to remove some unnecessary checks as well as unreachable code blocks to
improve readability. An obvious non-public API method naming typo is also corrected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565
Test Plan: make check
Reviewed By: lth
Differential Revision: D29963984
Pulled By: riversand963
fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a
Summary:
Calling the GetImpl function could leave reference to a local
callback function in a field of a parameter struct. As this is
performance-critical code, I'm not going to attempt to sanitize this
code too much, but make the existing hack a bit cleaner by reverting
what it overwrites in the input struct.
Added SaveAndRestore utility class to make that easier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8590
Test Plan:
added unit test for SaveAndRestore; existing tests for
GetImpl
Reviewed By: riversand963
Differential Revision: D29947983
Pulled By: pdillinger
fbshipit-source-id: 2f608853f970bc06724e834cc84dcc4b8599ddeb
Summary:
If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done.
Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8591
Test Plan: Run all existing tests
Reviewed By: pdillinger
Differential Revision: D29969412
fbshipit-source-id: d5f42b5271608a35b9b07687ce18157d7447b0de
Summary:
The main challenge to make the memtable garbage collection prototype (nicknamed `mempurge`) was to not get rid of WAL files that contain unflushed (but mempurged) data. That was successfully guaranteed by not writing the VersionEdit to the MANIFEST file after a successful mempurge.
By not writing VersionEdits to the `MANIFEST` file after a succesful mempurge operation, we do not change the earliest log file number that contains unflushed data: `cfd->GetLogNumber()` (`cfd->SetLogNumber()` is only called in `VersionSet::ProcessManifestWrites`). As a result, a number of functions introduced earlier just for the mempurge operation are not obscolete/redundant. (e.g.: `FlushJob::ExtractEarliestLogFileNumber`), and this PR aims at cleaning up all these now-unnecessary functions. In particular, we no longer need to store the earliest log file number in the `MemTable` struct itself. This PR therefore also reverts the `MemTable` struct to its original form.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8558
Test Plan: Already included in `db_flush_test.cc`.
Reviewed By: anand1976
Differential Revision: D29764351
Pulled By: bjlemaire
fbshipit-source-id: 0f43b260fa270251862512f397d3f24ee62e8437
Summary:
If the primary's CURRENT file is missing or inaccessible, the secondary should not hang
trying repeatedly to switch to the next MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8200
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D27840627
Pulled By: riversand963
fbshipit-source-id: 071fed97cbab1bc5cdefd1dc235e5cd406c174e1
Summary:
Rare TSAN and valgrind failures are caused by unnecessary
reading of a field on the TaskLimiterToken::limiter_ for an assertion
after the token has been released and the limiter destroyed. To simplify
we can simply destroy the token before triggering DB shutdown
(potentially destroying the limiter). This makes the ReleaseOnce logic
unnecessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8567
Test Plan: watch for more failures in CI
Reviewed By: ajkr
Differential Revision: D29811795
Pulled By: pdillinger
fbshipit-source-id: 135549ebb98fe4f176d1542ed85d5bd6350a40b3
Summary:
Try avoid expensive updating options operation if
`SetDBOptions()` does not change any option value.
Skip updating is not guaranteed, for example, changing `bytes_per_sync`
to `0` may still trigger updating, as the value could be sanitized.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8518
Test Plan: added unittest
Reviewed By: riversand963
Differential Revision: D29672639
Pulled By: jay-zhuang
fbshipit-source-id: b7931de62ceea6f1bdff0d1209adf1197d3ed1f4
Summary:
Currently, the code shows that we delete memtables immedately after it is trimmed from history. Although it should never happen as the super version still holds the memtable, which is only switched after it, it feels a good practice not to do it, but use clean it up in the standard way: put it to WriteContext and clean it after DB mutex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8530
Test Plan: Run all existing tests.
Reviewed By: ajkr
Differential Revision: D29703410
fbshipit-source-id: 21d8068ac6377de4b6fa7a89697195742659fde4
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.
This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)
This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538
Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.
Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.
Before (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0)
rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0)
rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323
Notice the 5s P100 write time.
After (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0)
rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918
P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0)
rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832
No substantial difference.
Reviewed By: siying
Differential Revision: D29738847
Pulled By: pdillinger
fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
Summary:
In this PR, `mempurge` is made compatible with the Write Ahead Log: in case of recovery, the DB is now capable of recovering the data that was "mempurged" and kept in the `imm()` list of immutable memtables.
The twist was to add a uint64_t to the `memtable` struct to store the number of the earliest log file containing entries from the `memtable`. When a `Flush` operation is replaced with a `MemPurge`, the `VersionEdit` (which usually contains the new min log file number to pick up for recovery and the level 0 file path of the newly created SST file) is no longer appended to the manifest log, and every time the `deleteWal` method is called, a check is made on the list of immutable memtables.
This PR also includes a unit test that verifies that no data is lost upon Reopening of the database when the mempurge feature is activated. This extensive unit test includes two column families, with valid data contained in the imm() at time of "crash"/reopening (recovery).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8528
Reviewed By: pdillinger
Differential Revision: D29701097
Pulled By: bjlemaire
fbshipit-source-id: 072a900fb6ccc1edcf5eef6caf88f3060238edf9
Summary:
The removed function in this PR, just only have declared and dose not have any reference used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8508
Reviewed By: mrambacher
Differential Revision: D29649033
Pulled By: jay-zhuang
fbshipit-source-id: df98143b73d6c184a2a60c9f7ea2548a065ee35d
Summary:
When db is open as secondary, there are basically 2 step process:
1) Collect column families from wal log
2) Apply changes to Memtable
In case primary db is TransactionDB instance, wal log will contain some additional data, like noop, etc. ColumnFamilyCollector doesn't implement methods to handle these, so it fails to open a wal log written by TransactionDB. (Everything works fine with standard DB::Open).
Memtable recovery process knows how to handle such wal logs, so only missing piece seems to be ColumnFamilyCollector.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8456
Reviewed By: ajkr
Differential Revision: D29455945
Pulled By: mrambacher
fbshipit-source-id: 5b29560fcbc008e17e95d0dc4b07558f3d63e26f
Summary:
In ```DBImpl::WriteImpl()```, we call ```PreprocessWrite()``` which, among other things, checks the BG error and returns it set. This return status is later on passed to ```WriteStatusCheck()```, which calls ```SetBGError()```. This results in a spurious call, and info logs, on every user write request. We should avoid passing the ```PreprocessWrite()``` return status to ```WriteStatusCheck()```, as the former would have called ```SetBGError()``` already if it encountered any new errors, such as error when creating a new WAL file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8511
Test Plan: Run existing tests
Reviewed By: zhichao-cao
Differential Revision: D29639917
Pulled By: anand1976
fbshipit-source-id: 19234163969e1645dbeb273712aaf5cd9ea2b182
Summary:
In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype.
In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).
Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.
MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`.
The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505
Reviewed By: pdillinger
Differential Revision: D29619283
Pulled By: bjlemaire
fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95
Summary:
Previously, the following command:
```USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j$(nproc) analyze```
was raising an error/warning the new_mem could potentially be a `nullptr`. This error appeared due to code changes from https://github.com/facebook/rocksdb/issues/8454, including an if-statement containing "`... && new_mem != nullptr && ...`", which made the analyzer believe that past this `if`-statement, a `new_mem==nullptr` was a possible scenario.
This code patch simply introduces `assert`s and removes this condition in the `if`-statement.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8492
Reviewed By: jay-zhuang
Differential Revision: D29571275
Pulled By: bjlemaire
fbshipit-source-id: 75d72246b70ebbbae7dea11ccb5778686d8bcbea
Summary:
Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
Reviewed By: anand1976
Differential Revision: D29433971
Pulled By: bjlemaire
fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
Summary:
Added BlobMetaData to ColumnFamilyMetaData and LiveBlobMetaData and DB API GetLiveBlobMetaData to retrieve it.
First pass at struct. More tests and maybe fields to come...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8273
Reviewed By: ltamasi
Differential Revision: D29102400
Pulled By: mrambacher
fbshipit-source-id: 8a2383a4446328be6b91dced9841fdd3dfc80b73
Summary:
In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead.
In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412
Test Plan: make check, add new testing cases.
Reviewed By: anand1976
Differential Revision: D29151545
Pulled By: zhichao-cao
fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772
Summary:
`DeleteFilesInRange()` marks deleting files to `being_compacted`
before deleting, which may cause ongoing compactions report corruption
exception or ASSERT for debug build.
Adding the missing `ComputeCompactionScore()` when `being_compacted` is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8434
Test Plan: Unittest
Reviewed By: ajkr
Differential Revision: D29276127
Pulled By: jay-zhuang
fbshipit-source-id: f5b223e3c1fc6d821e100e3f3442bc70c1d50cf7
Summary:
Tracing the MultiGet information including timestamp, keys, and CF_IDs to the trace file for analyzing and replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8421
Test Plan: make check, add test to trace_analyzer_test
Reviewed By: anand1976
Differential Revision: D29221195
Pulled By: zhichao-cao
fbshipit-source-id: 30c677d6c39ab31ef4bbdf7e0d1fa1fd79f295ff
Summary:
RocksDB logs a warning if WAL truncation on DB open fails. Its possible that on some file systems, truncation is not required and they would return ```Status::NotSupported()``` for ```ReopenWritableFile```. Don't log a warning in such cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8414
Reviewed By: akankshamahajan15
Differential Revision: D29181738
Pulled By: anand1976
fbshipit-source-id: 6e01e9117e1e4c1d67daa4dcee7fa59d06e057a7
Summary:
This is the next part of the ImmutableOptions cleanup. After changing the use of ImmutableCFOptions to ImmutableOptions, there were places in the code that had did something like "ImmutableOptions* immutable_cf_options", where "cf" referred to the "old" type.
This change simply renames the variables to match the current type. No new functionality is introduced.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8409
Reviewed By: pdillinger
Differential Revision: D29166248
Pulled By: mrambacher
fbshipit-source-id: 96de97f8e743f5c5160f02246e3ed8269556dc6f
Summary:
DBImpl::DumpStats is supposed to do this:
Dump DB stats to LOG
For each CF, dump CFStatsNoFileHistogram to LOG
For each CF, dump CFFileHistogram to LOG
Instead, due to a longstanding bug from 2017 (https://github.com/facebook/rocksdb/issues/2126), it would dump
CFStats, which includes both CFStatsNoFileHistogram and CFFileHistogram,
in both loops, resulting in near-duplicate output.
This fixes the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8380
Test Plan: Manual inspection of LOG after db_bench
Reviewed By: jay-zhuang
Differential Revision: D29017535
Pulled By: pdillinger
fbshipit-source-id: 3010604c4a629a80347f129cd746ce9b0d0cbda6
Summary:
Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360
Test Plan: add the test to lru_cache_test
Reviewed By: pdillinger
Differential Revision: D29006215
Pulled By: zhichao-cao
fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
Summary:
Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure.
Tested via two tests added to db_tests2.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351
Reviewed By: ajkr
Differential Revision: D28808894
Pulled By: ddevec
fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd
Summary:
I noticed ```openat``` system call with ```O_WRONLY``` flag and ```sync_file_range``` and ```truncate``` on WAL file when using ```rocksdb::DB::OpenForReadOnly``` by way of ```db_bench --readonly=true --benchmarks=readseq --use_existing_db=1 --num=1 ...```
Noticed in ```strace``` after seeing the last modification time of the WAL file change after each run (with ```--readonly=true```).
I think introduced by 7d7f14480e from https://github.com/facebook/rocksdb/pull/8122
I added a test to catch the WAL file being truncated and the modification time on it changing.
I am not sure if a mock filesystem with mock clock could be used to avoid having to sleep 1.1s.
The test could also check the set of files is the same and that the sizes are also unchanged.
Before:
```
[ RUN ] DBBasicTest.ReadOnlyReopenMtimeUnchanged
db/db_basic_test.cc:182: Failure
Expected equality of these values:
file_mtime_after_readonly_reopen
Which is: 1621611136
file_mtime_before_readonly_reopen
Which is: 1621611135
file is: 000010.log
[ FAILED ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms)
```
After:
```
[ RUN ] DBBasicTest.ReadOnlyReopenMtimeUnchanged
[ OK ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8313
Reviewed By: pdillinger
Differential Revision: D28656925
Pulled By: jay-zhuang
fbshipit-source-id: ea9e215cb53e7c830e76bc5fc75c45e21f12a1d6
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
Reviewed By: pdillinger
Differential Revision: D28226540
Pulled By: mrambacher
fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
Summary:
Previously the shutdown process did not properly wait for all
`compaction_thread_limiter` tokens to be released before proceeding to
delete the DB's C++ objects. When this happened, we saw tests like
"DBCompactionTest.CompactionLimiter" flake with the following error:
```
virtual
rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl():
Assertion `outstanding_tasks_ == 0' failed.
```
There is a case where a token can still be alive even after the shutdown
process has waited for BG work to complete. In particular, this happens
because the shutdown process only waits for flush/compaction scheduled/unscheduled counters to all
reach zero. These counters are decremented in `BackgroundCallCompaction()`
functions. However, tokens are released in `BGWork*Compaction()` functions, which
actually wrap the `BackgroundCallCompaction()` function.
A simple sleep could repro the race condition:
```
$ diff --git a/db/db_impl/db_impl_compaction_flush.cc
b/db/db_impl/db_impl_compaction_flush.cc
index 806bc548a..ba59efa89 100644
--- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -2442,6 +2442,7 @@ void DBImpl::BGWorkCompaction(void* arg) {
static_cast<PrepickedCompaction*>(ca.prepicked_compaction);
static_cast_with_check<DBImpl>(ca.db)->BackgroundCallCompaction(
prepicked_compaction, Env::Priority::LOW);
+ sleep(1);
delete prepicked_compaction;
}
$ ./db_compaction_test --gtest_filter=DBCompactionTest.CompactionLimiter
db_compaction_test: util/concurrent_task_limiter_impl.cc:24: virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed.
Received signal 6 (Aborted)
#0 /usr/local/fbcode/platform007/lib/libc.so.6(gsignal+0xcf) [0x7f02673c30ff] ?? ??:0
https://github.com/facebook/rocksdb/issues/1 /usr/local/fbcode/platform007/lib/libc.so.6(abort+0x134) [0x7f02673ac934] ?? ??:0
...
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8253
Test Plan: sleeps to expose race conditions
Reviewed By: akankshamahajan15
Differential Revision: D28168064
Pulled By: ajkr
fbshipit-source-id: 9e5167c74398d323e7975980c5cc00f450631160
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.
To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.
I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)
At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)
Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246
Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost
Reviewed By: mrambacher
Differential Revision: D28099346
Pulled By: pdillinger
fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
Summary:
Greatly reduced the not-quite-copy-paste giant parameter lists
of rocksdb::NewTableBuilder, rocksdb::BuildTable,
BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.
Moved weird separate parameter `uint32_t column_family_id` of
TableFactory::NewTableBuilder into TableBuilderOptions.
Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
target_file_size` is not randomly placed between uint64_t timestamps
(was easy to mix up).
Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
FilterBuildingContext. The motivation for this change is making it
easier to pass along more data into new fields in FilterBuildingContext
(follow-up PR).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240
Test Plan: ASAN make check
Reviewed By: mrambacher
Differential Revision: D28075891
Pulled By: pdillinger
fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
Summary:
This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code. With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively.
readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release).
There are still fields in the ImmutableCFOptions that are not CF options but DB options. Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions). But that will be part of a future PR to minimize changes and disruptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176
Reviewed By: pdillinger
Differential Revision: D27954339
Pulled By: mrambacher
fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad
Summary:
Add compaction API for secondary instance, which compact the files to a secondary DB path without installing to the LSM tree.
The API will be used to remote compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8171
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D27694545
Pulled By: jay-zhuang
fbshipit-source-id: 8ff3ec1bffdb2e1becee994918850c8902caf731
Summary:
In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.
Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.
Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207
Test Plan: added unit test, make crash_test
Reviewed By: riversand963
Differential Revision: D27898839
Pulled By: zhichao-cao
fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
Summary:
Add comment to DisableManualCompaction() which was missing.
Also explictly return from DBImpl::CompactRange() to avoid memtable flush when manual compaction is disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8186
Test Plan: Run existing unit tests.
Reviewed By: jay-zhuang
Differential Revision: D27744517
fbshipit-source-id: 449548a48905903b888dc9612bd17480f6596a71
Summary:
When WriteBufferManager is shared across DBs and column families
to maintain memory usage under a limit, OOMs have been observed when flush cannot
finish but writes continuously insert to memtables.
In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write,
this change will stall incoming writers until flush is completed and memory_usage
drops.
Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_
(memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true.
DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall()
(which adds dummy stall object to the writer's queue).
Then DB is blocked on a state State::Blocked (current write doesn't go
through). WBStallInterface object maintained by every DB instance is added to the queue of
WriteBufferManager.
If multiple DBs tries to write during this stall, they will also be
blocked when check WriteBufferManager::ShouldStall() returns true.
End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory
waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush
to complete and avoid continous stalling if memory usage remains close to buffer_size.
WriterBufferManager::EndWriteStall() is called,
which removes all instances from its queue and signal them to continue.
Their state is changed to State::Running and they are unblocked. DBImpl
then signal all incoming writers of that DB to continue by calling
write_thread_.EndWriteStall() (which removes dummy stall object from the
queue).
DB instance creates WBMStallInterface which is an interface to block and
signal DBs during stall.
When DB needs to be blocked or signalled by WriteBufferManager,
state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898
Test Plan: Added a new test db/db_write_buffer_manager_test.cc
Reviewed By: anand1976
Differential Revision: D26093227
Pulled By: akankshamahajan15
fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
- Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
- If process reopens the db immediately after the failure, then the CURRENT file can point
to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
succeed and ignore the other.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D27804648
Pulled By: riversand963
fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
Summary:
As the name of `DBImpl::WriteLevel0TableForRecovery` suggests, the resulting table file
should be placed on L0. However, the argument `level` passed to `BuildTable()` is -1.
We need to correct this since the level information will be useful to determine file placement.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8187
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D27748570
Pulled By: riversand963
fbshipit-source-id: e1cd23128a8de31f14b1edc2ea92754c154e4f10
Summary:
Current flush reason attribution is misleading or incorrect (depending on what the original intention was):
- Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager`
- Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull`
This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150
Reviewed By: zhichao-cao
Differential Revision: D27569645
Pulled By: ot
fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc
Summary:
Fixing another crash test failure in the case of
write_dbid_to_manifest=true and reading a backup as read-only DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8164
Test Plan:
enhanced unit test for backup as read-only DB, ran
blackbox_crash_test more with elevated backup_one_in
Reviewed By: zhichao-cao
Differential Revision: D27622237
Pulled By: pdillinger
fbshipit-source-id: 680d0f99ddb465a601737f2e3f2c80efd47384fb
Summary:
A current limitation of backups is that you don't know the
exact database state of when the backup was taken. With this new
feature, you can at least inspect the backup's DB state without
restoring it by opening it as a read-only DB.
Rather than add something like OpenAsReadOnlyDB to the BackupEngine API,
which would inhibit opening stackable DB implementations read-only
(if/when their APIs support it), we instead provide a DB name and Env
that can be used to open as a read-only DB.
Possible follow-up work:
* Add a version of GetBackupInfo for a single backup.
* Let CreateNewBackup return the BackupID of the newly-created backup.
Implementation details:
Refactored ChrootFileSystem to split off new base class RemapFileSystem,
which allows more general remapping of files. We use this base class to
implement BackupEngineImpl::RemapSharedFileSystem.
To minimize API impact, I decided to just add these fields `name_for_open`
and `env_for_open` to those set by GetBackupInfo when
include_file_details=true. Creating the RemapSharedFileSystem adds a bit
to the memory consumption, perhaps unnecessarily in some cases, but this
has been mitigated by (a) only initialize the RemapSharedFileSystem
lazily when GetBackupInfo with include_file_details=true is called, and
(b) using the existing `shared_ptr<FileInfo>` objects to hold most of the
mapping data.
To enhance API safety, RemapSharedFileSystem is wrapped by new
ReadOnlyFileSystem which rejects any attempts to write. This uncovered a
couple of places in which DB::OpenForReadOnly would write to the
filesystem, so I fixed these. Added a release note because this affects
logging.
Additional minor refactoring in backupable_db.cc to support the new
functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142
Test Plan:
new test (run with ASAN and UBSAN), added to stress test and
ran it for a while with amplified backup_one_in
Reviewed By: ajkr
Differential Revision: D27535408
Pulled By: pdillinger
fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148
Summary:
Add request_id in IODebugContext which will be populated by
underlying FileSystem for IOTracing purposes. Update IOTracer to trace
request_id in the tracing records. Provided API
IODebugContext::SetRequestId which will set the request_id and enable
tracing for request_id. The API hides the implementation and underlying
file system needs to call this API directly.
Update DB::StartIOTrace API and remove redundant Env* from the
argument as its not used and DB already has Env that is passed down to
IOTracer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8045
Test Plan: Update unit test.
Differential Revision: D26899871
Pulled By: akankshamahajan15
fbshipit-source-id: 56adef52ee5af0fb3060b607c3af1ec01635fa2b
Summary:
In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_
and bg_flush_scheduled_ to drop to 0. Unschedule is called prior
to cancel any unscheduled flushes/compactions. It is assumed that
anything in the high priority is a flush, and anything in the low
priority pool is a compaction. This assumption, however, is broken when
the high-pri pool is full.
As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_
will remain > 0 and DB can be in hang state.
The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_`
inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB
`mutex_` will make the counts atomic in `Unschedule`.
Related discussion: https://github.com/facebook/rocksdb/issues/7928
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125
Test Plan: Added new test case which hangs without the fix.
Reviewed By: jay-zhuang
Differential Revision: D27390043
Pulled By: ajkr
fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110
Summary:
Currently, we only truncate the latest alive WAL files when the DB is opened. If the latest WAL file is empty or was flushed during Open, its not truncated since the file will be deleted later on in the Open path. However, before deletion, a new WAL file is created, and if the process crash loops between the new WAL file creation and deletion of the old WAL file, the preallocated space will keep accumulating and eventually use up all disk space. To prevent this, always truncate the latest WAL file, even if its empty or the data was flushed.
Tests:
Add unit tests to db_wal_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8122
Reviewed By: riversand963
Differential Revision: D27366132
Pulled By: anand1976
fbshipit-source-id: f923cc03ef033ccb32b140d36c6a63a8152f0e8e
Summary:
There is bug in the current code base introduced in https://github.com/facebook/rocksdb/issues/8049 , we still set the SST file write IO Error only case as hard error. Fix it by removing the logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8107
Test Plan: make check, error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D27321422
Pulled By: zhichao-cao
fbshipit-source-id: c014afc1553ca66b655e3bbf9d0bf6eb417ccf94
Summary:
Previously it only applied to block-based tables generated by flush. This restriction
was undocumented and blocked a new use case. Now compression sampling
applies to all block-based tables we generate when it is enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8105
Test Plan: new unit test
Reviewed By: riversand963
Differential Revision: D27317275
Pulled By: ajkr
fbshipit-source-id: cd9fcc5178d6515e8cb59c6facb5ac01893cb5b0
Summary:
As title. All core db implementations should stay in db_impl.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8082
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D27211442
Pulled By: riversand963
fbshipit-source-id: e0953fde75064740e899aaff7989ff033b7f5232
Summary:
In previous codebase, if WAL is used, all the retryable IO Error will be treated as hard error. So write is stalled. In this PR, the retryable IO error from WAL sync is separated from SST file flush io error. If WAL Sync is ok and retryable IO Error only happens during SST flush, the error is mapped to soft error. So user can continue insert to Memtable and append to WAL.
Resolve the bug that if WAL sync fails, the memtable status does not roll back due to calling PickMemtable early than calling and checking SyncClosedLog.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8049
Test Plan: added new unit test, make check
Reviewed By: anand1976
Differential Revision: D26965529
Pulled By: zhichao-cao
fbshipit-source-id: f5fecb66602212523c92ee49d7edcb6065982410
Summary:
WriteController had a number of issues:
* It could introduce a delay of 1ms even if the write rate never exceeded the
configured delayed_write_rate.
* The DB-wide delayed_write_rate could be exceeded in a number of ways
with multiple column families:
* Wiping all pending delay "debts" when another column family joins
the delay with GetDelayToken().
* Resetting last_refill_time_ to (now + sleep amount) means each
column family can write with delayed_write_rate for large writes.
* Updating bytes_left_ for a partial refill without updating
last_refill_time_ would essentially give out random bonuses,
especially to medium-sized writes.
Now the code is much simpler, with these issues fixed. See comments in
the new code and new (replacement) tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8064
Test Plan: new tests, better than old tests
Reviewed By: mrambacher
Differential Revision: D27064936
Pulled By: pdillinger
fbshipit-source-id: 497c23fe6819340b8f3d440bd634d8a2bc47323f
Summary:
Extend support to track blob files in SST File manager.
This PR notifies SstFileManager whenever a new blob file is created,
via OnAddFile and an obsolete blob file deleted via OnDeleteFile
and delete file via ScheduleFileDeletion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037
Test Plan: Add new unit tests
Reviewed By: ltamasi
Differential Revision: D26891237
Pulled By: akankshamahajan15
fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes.
For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved.
Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found)
6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found)
PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found)
(Note that the times for 6.18 are prior to revert of the SystemClock).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
Reviewed By: pdillinger
Differential Revision: D27014563
Pulled By: mrambacher
fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
Summary:
This PR
- adds a class `ManifestTailer` that inherits from `VersionEditHandlerPointInTime`. `ManifestTailer::Iterate()` can be called multiple times to tail the primary instance's MANIFEST and apply the changes to the secondary,
- updates the implementation of `ReactiveVersionSet::ReadAndApply` to use this class,
- removes unused code in version_set.cc,
- updates existing tests, e.g. removing deleted sync points from unit tests,
- adds a new test to address the bug in https://github.com/facebook/rocksdb/issues/7815.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7998
Test Plan:
make check
Existing and newly-added tests in version_set_test.cc and db_secondary_test.cc
Reviewed By: jay-zhuang
Differential Revision: D26926641
Pulled By: riversand963
fbshipit-source-id: 8d4dd15db0ba863c213f743e33b5a207e948c980
Summary:
## 1. Bug description:
When RocksDB Checkpoint, it may be stuck in `WaitUntilFlushWouldNotStallWrites` method.
## 2. Simple analysis of the reasons:
### 2.1 Configuration parameters:
```yaml
Compaction Style : Universal
max_write_buffer_number : 4
min_write_buffer_number_to_merge : 3
```
Checkpoint is usually very fast. When the Checkpoint is executed, `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable MemTables, which are less than `min_write_buffer_number_to_merge`, they will not be flushed. But will enter this code.
```c++
// method: GetWriteStallConditionAndCause
if (mutable_cf_options.max_write_buffer_number> 3 &&
num_unflushed_memtables >=
mutable_cf_options.max_write_buffer_number-1) {
return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit};
}
```
code link: fbed72f03c/db/column_family.cc (L847)
Checkpoint thought there was a FlushJob, but it didn't. So will always wait.
### 2.2 solution:
Increase the restriction: the `number of Immutable MemTable` >= `min_write_buffer_number_to_merge will wait`.
If there are other better solutions, you can correct me.
### 2.3 Code that can reproduce the problem:
https://github.com/1996fanrui/fanrui-learning/blob/flink-1.12/module-java/src/main/java/com/dream/rocksdb/RocksDBCheckpointStuck.java
## 3. Interesting point
This bug will be triggered only when `the number of sorted runs >= level0_file_num_compaction_trigger`.
Because there is a break in WaitUntilFlushWouldNotStallWrites.
```c++
if (cfd->imm()->NumNotFlushed() <
cfd->ioptions()->min_write_buffer_number_to_merge &&
vstorage->l0_delay_trigger_count() <
mutable_cf_options.level0_file_num_compaction_trigger) {
break;
}
```
code link: fbed72f03c/db/db_impl/db_impl_compaction_flush.cc (L1974)
Universal may have `l0_delay_trigger_count() >= level0_file_num_compaction_trigger`, so this bug is triggered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7921
Reviewed By: jay-zhuang
Differential Revision: D26900559
Pulled By: ajkr
fbshipit-source-id: 133c1252dad7393753f04a47590b68c7d8e670df
Summary:
The patch breaks down the "bytes written" (as well as the "number of output files")
compaction statistics into two, so the values are logged separately for table files
and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
This will also come in handy for fixing the write amplification statistics, which currently
do not consider the amount of data read from blob files during compaction. (This will
be fixed by an upcoming patch.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013
Test Plan: Ran `make check` and `db_bench`.
Reviewed By: riversand963
Differential Revision: D26742156
Pulled By: ltamasi
fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
Summary:
Allow applications to implement a custom compaction filter and pass it to BlobDB.
The compaction filter's custom logic can operate on blobs.
To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
db/blob/db_blob_compaction_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D26509280
Pulled By: riversand963
fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
Summary:
Extend VerifyFileChecksums API to verify blob files in case of
use_file_checksum.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7979
Test Plan: New unit test db_blob_corruption_test
Reviewed By: ltamasi
Differential Revision: D26534040
Pulled By: akankshamahajan15
fbshipit-source-id: 7dc5951a3df9d265ea1265e0122b43c966856ade
Summary:
The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977
Test Plan: make check. tested with old trace file in replay and analyzing.
Reviewed By: anand1976
Differential Revision: D26529948
Pulled By: zhichao-cao
fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76
Summary:
TransactionDB uses read callback to filter out un-committed data before
a snapshot. But `MultiGet()` API doesn't use that at all, which causes
returning unwanted data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963
Test Plan: Added unittest to reproduce
Reviewed By: anand1976
Differential Revision: D26455851
Pulled By: jay-zhuang
fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55
Summary:
Bug fix for status returned being overridden by Status::NotFound in
DBImpl::OpenForReadOnlyCheckExistence. This was casuing some service
owners to misinterpret the actual error and take appropriate steps.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7972
Reviewed By: riversand963
Differential Revision: D26499598
Pulled By: akankshamahajan15
fbshipit-source-id: 05e9fedbe2a2e0e53135760f8ff578a2816d2b8e
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523
Test Plan: add new unit test, pass make check/ make asan_check
Reviewed By: pdillinger
Differential Revision: D24313271
Pulled By: zhichao-cao
fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
Summary:
Closes https://github.com/facebook/rocksdb/issues/7035
Changed how build_version.cc was generated:
- Included the GIT tag/branch in the build_version file
- Changed the "Build Date" to be:
- If the GIT branch is "clean" (no changes), the date of the last git commit
- If the branch is not clean, the current date
- Added APIs to access the "build information", rather than accessing the strings directly.
The build_version.cc file is now regenerated whenever the library objects are rebuilt.
Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866
Reviewed By: pdillinger
Differential Revision: D26086565
Pulled By: mrambacher
fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f
Summary:
During recovery, RocksDB performs a kind of dummy flush; namely, entries
from the WAL are added to memtables, which then get written to SSTs and
blob files (if enabled) just like during a regular flush. Note that
multiple memtables might be flushed during recovery for the same column
family, for example, if the DB is reopened with a lower write buffer size,
and therefore, we need to make sure to collect all SST and blob file
additions. The patch fixes a bug in the earlier logic which resulted in
later blob file additions overwriting earlier ones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7903
Test Plan: Added a unit test and ran `db_stress`.
Reviewed By: jay-zhuang
Differential Revision: D26110847
Pulled By: ltamasi
fbshipit-source-id: eddb50a608a88f54f3cec3a423de8235aba951fd
Summary:
When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7899
Test Plan: tested with error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D26094097
Pulled By: zhichao-cao
fbshipit-source-id: c53424f11d237405592cd762f43cbbdf8da8234f
Summary:
Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
Reviewed By: pdillinger
Differential Revision: D26006406
Pulled By: mrambacher
fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
Summary:
This PR does the following:
-> Creates a WinFileSystem class. This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
-> Introduces a CustomEnv class. A CustomEnv is an Env that takes a FileSystem as constructor argument. I believe there will only ever be two implementations of this class (PosixEnv and WinEnv). There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
-> Eliminates the public uses of the LegacyFileSystemWrapper.
With this change in place, there are effectively the following patterns of Env:
- "Base Env classes" (PosixEnv, WinEnv). These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem. These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
- Wrapped Composite Env classes (MemEnv). These classes take in an Env and a FileSystem. The core env functions are re-directed to the wrapped env. The file system calls are redirected to the input file system
- Legacy Wrapped Env classes. These classes take in an Env input (but no FileSystem). The core env functions are re-directed to the wrapped env. A "Legacy File System" is created using this env and the file system calls directed to the env itself.
With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created. Any other use of the PosixEnv is via another wrapped env. This cleans up some of the issues with the env construction and destruction.
Additionally, there were places in the code that required had an Env when they required a FileSystem. Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem(). These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703
Reviewed By: zhichao-cao
Differential Revision: D25762190
Pulled By: anand1976
fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
Summary:
The returned Status is ignored here as some stress tests are failing, presumably when attempting to add an empty file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7826
Reviewed By: jay-zhuang
Differential Revision: D25742931
fbshipit-source-id: a1fcd620d9472993a009929306dfc421f93eb43b
Summary:
In GenerateOneFileChecksum(), RocksDB reads the file and computes its checksum. A rate limiter can be passed to the constructor of RandomAccessFileReader so that read I/O can be rate limited.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7811
Test Plan: make check
Reviewed By: cheng-chang
Differential Revision: D25699896
Pulled By: zhichao-cao
fbshipit-source-id: e2688bc1126c543979a3bcf91dda784bd7b74164
Summary:
Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds. This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it. In this case, without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798
Reviewed By: jay-zhuang
Differential Revision: D25680451
Pulled By: pdillinger
fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917
Summary:
In the write path, there is an optimization: when a new WAL is created during SwitchMemtable, we update the internal log number of the empty column families to the new WAL. `FindObsoleteFiles` marks a WAL as obsolete if the WAL's log number is less than `VersionSet::MinLogNumberWithUnflushedData`. After updating the empty column families' internal log number, `VersionSet::MinLogNumberWithUnflushedData` might change, so some WALs might become obsolete to be purged from disk.
For example, consider there are 3 column families: 0, 1, 2:
1. initially, all the column families' log number is 1;
2. write some data to cf0, and flush cf0, but the flush is pending;
3. now a new WAL 2 is created;
4. write data to cf1 and WAL 2, now cf0's log number is 1, cf1's log number is 2, cf2's log number is 2 (because cf1 and cf2 are empty, so their log numbers will be set to the highest log number);
5. now cf0's flush hasn't finished, flush cf1, a new WAL 3 is created, and cf1's flush finishes, now cf0's log number is 1, cf1's log number is 3, cf2's log number is 3, since WAL 1 still contains data for the unflushed cf0, no WAL can be deleted from disk;
6. now cf0's flush finishes, cf0's log number is 2 (because when cf0 was switching memtable, WAL 3 does not exist yet), cf1's log number is 3, cf2's log number is 3, so WAL 1 can be purged from disk now, but WAL 2 still cannot because `MinLogNumberToKeep()` is 2;
7. write data to cf2 and WAL 3, because cf0 is empty, its log number is updated to 3, so now cf0's log number is 3, cf1's log number is 3, cf2's log number is 3;
8. now if the background threads want to purge obsolete files from disk, WAL 2 can be purged because `MinLogNumberToKeep()` is 3. But there are only two flush results written to MANIFEST: the first is for flushing cf1, and the `MinLogNumberToKeep` is 1, the second is for flushing cf0, and the `MinLogNumberToKeep` is 2. So without this PR, if the DB crashes at this point and try to recover, `WalSet` will still expect WAL 2 to exist.
When WAL tracking is enabled, we assume WALs will only become obsolete after a flush result is written to MANIFEST in `MemtableList::TryInstallMemtableFlushResults` (or its atomic flush counterpart). The above situation breaks this assumption.
This PR tracks WAL obsoletion if necessary before updating the empty column families' log numbers.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7781
Test Plan:
watch existing tests and stress tests to pass.
`make -j48 blackbox_crash_test` on devserver
Reviewed By: ltamasi
Differential Revision: D25631695
Pulled By: cheng-chang
fbshipit-source-id: ca7fff967bdb42204b84226063d909893bc0a4ec
Summary:
When ConcurrentTaskLimiter is enabled and there are too many outstanding compactions, BackgroundCompaction returns Status::Busy(), which shouldn't be treat as compaction failure.
This caused performance issue when outstanding compactions reached the limit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7739
Reviewed By: cheng-chang
Differential Revision: D25508319
Pulled By: ltamasi
fbshipit-source-id: 3b181b16ada0ca3393cfa3a7412985764e79c719
Summary:
The patch adds initial support for reading blobs to the batched `MultiGet` API.
The current implementation simply retrieves the blob values as the blob indexes
are encountered; that is, reads from blob files are currently not batched. (This
will be optimized in a separate phase.) In addition, the patch removes some dead
code related to BlobDB from the batched `MultiGet` implementation, namely the
`is_blob` / `is_blob_index` flags that are passed around in `DBImpl` and `MemTable` /
`MemTableListVersion`. These were never hooked up to anything and wouldn't
work anyways, since a single flag is not sufficient to communicate the "blobness"
of multiple key-values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7766
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D25479290
Pulled By: ltamasi
fbshipit-source-id: 7aba2d290e31876ee592bcf1adfd1018713a8000
Summary:
Currently, when a WAL becomes obsolete after flushing, if VersionSet::WalSet does not contain the WAL, we do not track the WAL obsoletion event in MANIFEST.
But consider this case:
* WAL 10 is synced, a VersionEdit is LogAndApplied to MANIFEST to log this WAL addition event, but the VersionEdit is not applied to WalSet yet since its corresponding ManifestWriter is still pending in the write queue;
* Since the above ManifestWriter is blocking, the LogAndApply will block on a conditional variable and release the db mutex, so another LogAndApply can proceed to enqueue other VersionEdits concurrently;
* Now flush happens, and WAL 10 becomes obsolete, although WalSet does not contain WAL 10 yet, we should call LogAndApply to enqueue a VersionEdit to indicate the obsoletion of WAL 10;
* otherwise, when the queued edit indicating WAL 10 addition is logged to MANIFEST, and DB crashes and reopens, the WAL 10 might have been removed from disk, but it still exists in MANIFEST.
This PR changes the behavior to: always `LogAndApply` any WAL addition or obsoletion event, without considering the order issues caused by concurrency, but when applying the edits to `WalSet`, do not add the WALs if they are already obsolete. In this approach, the logical events of WAL addition and obsoletion are always tracked in MANIFEST, so we can inspect the MANIFEST and know all the previous WAL events, but we choose to ignore certain events due to the concurrency issues such as the case above, or the case in https://github.com/facebook/rocksdb/pull/7725.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7759
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D25423089
Pulled By: cheng-chang
fbshipit-source-id: 9cb9a7fbc1875bf954f2a42f9b6cfd6d49a7b21c
Summary:
Consider the case:
1. All column families are flushed, so all WALs become obsolete, but no WAL is removed from disk yet because the removal is asynchronous, a VersionEdit is written to MANIFEST indicating that WALs before a certain WAL number are obsolete, let's say this number is 3;
2. `SyncWAL` is called, so all the on-disk WALs are synced, and if track_and_verify_wal_in_manifest=true, the WALs will be tracked in MANIFEST, let's say the WAL numbers are 1 and 2;
3. DB crashes;
4. During DB recovery, when replaying MANIFEST, we first see that WAL with number < 3 are obsolete, then we see that WAL 1 and 2 are synced, so according to current implementation of `WalSet`, the `WalSet` will be recovered to include WAL 1 and 2;
5. WAL 1 and 2 are asynchronously deleted from disk, then the WAL verification algorithm fails with `Corruption: missing WAL`.
The above case is reproduced in a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal`.
The fix is to maintain the upper bound of the obsolete WAL numbers, any WAL with number less than the maintained number is considered to be obsolete, so shouldn't be tracked even if they are later synced. The number is maintained in `WalSet`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7725
Test Plan:
1. a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal` is added.
2. run `make crash_test` on devserver.
Reviewed By: riversand963
Differential Revision: D25238914
Pulled By: cheng-chang
fbshipit-source-id: f5dccd57c3d89f19565ec5731f2d42f06d272b72
Summary:
This change eliminates the need for a lot of the PermitUncheckedError calls on return from ErrorHandler methods. The calls are no longer needed as the status is returned as a reference rather than a copy. Additionally, this means that the originating status (recovery_error_, bg_error_) is not cleared implicitly as a result of calling one of these methods.
For this class, I do not know if the proper behavior should be to call PermitUncheckedError in the destructor or if the checked state should be cleared when the status is cleared. I did tests both ways. Without the code in the destructor, the status will need to be cleared in at least some of the places where it is set to OK. When running tests, I found no instances where this class was destructed with a non-OK, non-checked Status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7539
Reviewed By: anand1976
Differential Revision: D25340565
Pulled By: pdillinger
fbshipit-source-id: 1730c035c81a475875ea745226112030ec25136c
Summary:
Following https://github.com/facebook/rocksdb/issues/7655 and https://github.com/facebook/rocksdb/issues/7657, this PR adds `full_history_ts_low_` to `ColumnFamilyData`.
`ColumnFamilyData::full_history_ts_low_` will be used to create `FlushJob` and `CompactionJob`.
`ColumnFamilyData::full_history_ts_low` is persisted to the MANIFEST file. An application can only
increase its value. Consider the following case:
>
> The database has a key at ts=950. `full_history_ts_low` is first set to 1000, and then a GC is triggered
> and cleans up all data older than 1000. If the application sets `full_history_ts_low` to 900 afterwards,
> and tries to read at ts=960, the key at 950 is not seen. From the perspective of the read, the result
> is hard to reason. For simplicity, we just do now allow decreasing full_history_ts_low for now.
>
During recovery, the value of `full_history_ts_low` is restored for each column family if applicable. Note that
version edits in the MANIFEST file for the same column family may have `full_history_ts_low` unsorted due
to the potential interleaving of `LogAndApply` calls. Only the max will be used to restore the state of the
column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7740
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D25296217
Pulled By: riversand963
fbshipit-source-id: 24acda1df8262cd7cfdc6ce7b0ec56438abe242a
Summary:
The patch adds iterator support to the integrated BlobDB implementation.
Whenever a blob reference is encountered during iteration, the corresponding
blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index`
(formerly `allow_blob`) flag is *not* set. (Note: the flag is set by the old stacked
BlobDB implementation, which has its own blob file handling/blob retrieval logic.)
In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error
message `"BlobDB does not support merge operator."` when encountering a
blob reference while performing a merge (instead of potentially returning a
message that implies the database should be opened using the stacked BlobDB's
`Open`.)
TODO: We can implement support for lazily retrieving the blob value (or in other
words, bypassing the retrieval of blob values based on key) by extending the `Iterator`
API with a new `PrepareValue` method (similarly to `InternalIterator`, which already
supports lazy values).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D25256293
Pulled By: ltamasi
fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811
Summary:
In current code base, in FlushMemtable, when `(Flush_reason == FlushReason::kErrorRecoveryRetryFlush && (!cfd->mem()->IsEmpty() || !cached_recoverable_state_empty_.load()))`, we assert that cfd->imm()->NumNotFlushed() > 0. However, there are some corner cases that can fail this assert: 1) if there are multiple CFs, some CF has immutable memtable, some CFs don't. In ResumeImpl, all CFs will call FlushMemtable, which will hit the assert. 2) Regular flush is scheduled and running, the resume thread is waiting. New KVs are inserted and SchedulePendingFlush is called. Regular flush will continue call MaybeScheduleFlushAndCompaction until all the immutable memtables are flushed. When regular flush ends and auto resume thread starts to schedule new flushes, cfd->imm()->NumNotFlushed() can be 0.
Remove the assert and added the comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7744
Test Plan: make check and pass the stress test
Reviewed By: riversand963
Differential Revision: D25340573
Pulled By: zhichao-cao
fbshipit-source-id: eac357bdace660247c197f01a9ff6857e3c97672
Summary:
When 2 phase commit is enabled, if there are prepared data in a WAL, the WAL should be kept, the minimum log number for such a WAL is written to MANIFEST during flush. In atomic flush, such information is not written to MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7570
Test Plan: Added a new unit test `DBAtomicFlushTest.ManualFlushUnder2PC`, this test fails in atomic flush without this PR, after this PR, it succeeds.
Reviewed By: riversand963
Differential Revision: D24394222
Pulled By: cheng-chang
fbshipit-source-id: 60ce74b21b704804943be40c8de01b41269cf116
Summary:
In the current code base, all the manifest writes with IO error will be set with reason: BackgroundErrorReason::kManifestWrite, which will be mapped to the kHardError if the IO Error is retryable. However, if the system does not use the WAL, all the retryable IO error should be mapped to kSoftError. Create this PR to handle is special case by adding kManifestWriteNoWAL to BackgroundErrorReason.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7693
Test Plan: make check, add new testing cases to error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D25066204
Pulled By: zhichao-cao
fbshipit-source-id: d59553896c2eac3fb37c05238544d2b265379462
Summary:
Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D25015421
Pulled By: jay-zhuang
fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9
Summary:
https://github.com/facebook/rocksdb/issues/7340 reports and reproduces an assertion failure caused by a combination of the following:
- atomic flush is disabled.
- a column family can appear multiple times in the flush queue at the same time. This behavior was introduced in release 5.17.
Consequently, it is possible that two flushes race with each other. One bg flush thread flushes all memtables. The other thread calls `FlushMemTableToOutputFile()` afterwards, and hits the assertion error below.
```
assert(cfd->imm()->NumNotFlushed() != 0);
assert(cfd->imm()->IsFlushPending());
```
Fix this by reverting the behavior. In non-atomic-flush case, a column family can appear in the flush queue at most once at the same time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7362
Test Plan:
make check
Also run stress test successfully for 10 times.
```
make crash_test
```
Reviewed By: ajkr
Differential Revision: D25172996
Pulled By: riversand963
fbshipit-source-id: f1559b6366cc609e961e3fc83fae548f1fad08ce
Summary:
An empty WAL won't be backed up by the BackupEngine. So if we track the empty WALs in MANIFEST, then when restoring from a backup, it may report corruption that the empty WAL is missing, which is correct because the WAL is actually in the main DB but not in the backup DB, but missing an empty WAL does not logically break DB consistency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7697
Test Plan: watch existing tests to pass
Reviewed By: pdillinger
Differential Revision: D25077194
Pulled By: cheng-chang
fbshipit-source-id: 01917b57234b92b6063925f2ee9452c5732bdc03
Summary:
The logic for computing min_log_number_to_keep in atomic flush was incorrect.
For example, when all column families are flushed, the min_log_number_to_keep should be the latest new log. But the incorrect logic calls `PrecomputeMinLogNumberToKeepNon2PC` for each column family, and returns the minimum of them. However, `PrecomputeMinLogNumberToKeepNon2PC(cf)` assumes column families other than `cf` are flushed, but in case all column families are flushed, this assumption is incorrect.
Without this fix, the WAL referenced by the computed min_log_number_to_keep may actually contain no unflushed data, so the WAL might have actually been deleted from disk on recovery, then an incorrect error `Corruption: missing WAL` will be reported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7660
Test Plan:
run `make crash_test_with_atomic_flush` on devserver
added a unit test in `db_flush_test`
Reviewed By: riversand963
Differential Revision: D24906265
Pulled By: cheng-chang
fbshipit-source-id: 08deda62e71f67f59e3b7925cdd86dd09bd4f430
Summary:
https://github.com/facebook/rocksdb/issues/7556 enables `CompactionIterator` to perform garbage collection during compaction according
to a lower bound (user-defined) timestamp `full_history_ts_low_`.
This PR adds a data member `full_history_ts_low_` of type `std::string` to `FlushJob`, and
`full_history_ts_low_` does not change during flush. `FlushJob` will pass a pointer to this data member
to the `CompactionIterator` used during flush.
Also refactored flush_job_test.cc to re-use some existing code, which is actually the majority of this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7655
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D24933340
Pulled By: riversand963
fbshipit-source-id: 2e584bfd0cf6e5c295ab1af264e68e9d6a12fca3
Summary:
After replaying the WALs, the memtables are flushed synchronously to L0 instead of being flushed in background. Currently, we only track WAL obsoletion events in the code path of background flush jobs. This PR tracks these events in RecoverLogFiles.
After this change, we can enable `track_and_verify_wal_in_manifest` in `db_stress`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7649
Test Plan: `python tools/db_crashtest.py whitebox`
Reviewed By: riversand963
Differential Revision: D24824501
Pulled By: cheng-chang
fbshipit-source-id: 207129f7b845c50b333680ce6818a68a2fad54b9
Summary:
Consider the following sequence of events:
1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.
It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.
After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621
Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.
Reviewed By: riversand963
Differential Revision: D24668144
Pulled By: cheng-chang
fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
Summary:
When a WAL is synced, an edit is written to MANIFEST.
After flushing memtables, the obsoleted WALs are piggybacked to MANIFEST while writing the new L0 files to MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7601
Test Plan:
`track_and_verify_wals_in_manifest` is enabled by default for all tests extending `DBBasicTest`, and in db_stress_test.
Unit test `wal_edit_test`, `version_edit_test`, and `version_set_test` are also updated.
Watch all tests to pass.
Reviewed By: ltamasi
Differential Revision: D24553957
Pulled By: cheng-chang
fbshipit-source-id: 66a569ff1bdced38e22900bd240b73113906e040
Summary:
Existing API `VerifyChecksum()` allows application to verify sst files' block checksums.
Since whole file, user-specified checksum is tracked in MANIFEST, we can expose a new
API to verify sst files' file checksums.
```
// Compute table file checksums if applicable and compare with MANIFEST.
// Returns OK if no file has mismatching whole-file checksum.
Status DB::VerifyFileChecksums(const ReadOptions& /*read_options*/);
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7578
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D24436783
Pulled By: riversand963
fbshipit-source-id: 52b51519b842f2b3c4e3351998a97c86cbec85b3
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497
When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`
Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.
Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test
Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515
Reviewed By: ajkr
Differential Revision: D24240264
Pulled By: ramvadiv
fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
Summary:
Similarly to how https://github.com/facebook/rocksdb/issues/7345
integrated blob file writing into the flush process,
the patch adds support for writing blob files to the compaction logic.
Namely, if `enable_blob_files` is set, large values encountered during
compaction are extracted to blob files and replaced with blob indexes.
The resulting blob files are then logged to the MANIFEST as part of the
compaction job's `VersionEdit` and added to the `Version` alongside any
table files written by the compaction. Any errors during blob file building fail
the compaction job.
There will be a separate follow-up patch to perform blob garbage collection
during compactions.
In addition, the patch continues to chip away at the mess around computing
various compaction related statistics by eliminating some code duplication
and by making the `num_output_files` and `bytes_written` stats more consistent
for flushes, compactions, and recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7573
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24404696
Pulled By: ltamasi
fbshipit-source-id: 21216af3a172ad3ce8f85d11cd30923784ae426c
Summary:
This PR makes it able to `LogAndApply` `VersionEdit`s related to WALs, and also be able to `Recover` from MANIFEST with WAL related `VersionEdit`s.
The `VersionEdit`s related to WAL are treated similarly as those related to column family operations, they are not applied to versions, but can be in a commit group. Mixing WAL related `VersionEdit`s with other types of edits will make logic in `ProcessManifestWrite` more complicated, so `VersionEdit`s related to WAL can either be WAL additions or deletions, like column family add and drop.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7256
Test Plan: a set of unit tests are added in `version_set_test.cc`
Reviewed By: riversand963
Differential Revision: D23123238
Pulled By: cheng-chang
fbshipit-source-id: 246be2ed4744fd03fa2738aba408aaa611d0379c
Summary:
As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D24485420
Pulled By: zhichao-cao
fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
Summary:
Remove function calling in assert statement as assert is a no
op in opt build and that function might not be called. This causes hang
in closing RocksDB when refit level is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7581
Test Plan: make check -j64
Reviewed By: riversand963
Differential Revision: D24466420
Pulled By: akankshamahajan15
fbshipit-source-id: 97db4ec5a95ae693c3290e176a3c12a9b1ad2f6d
Summary:
Make LoadLatestOptions return PathNotFound if the options file does not exist. Added tests for the LoadOptions related methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7554
Reviewed By: akankshamahajan15
Differential Revision: D24298985
Pulled By: zhichao-cao
fbshipit-source-id: c9ae3cb12fc4a5bbef07743e1c1300f98a2441b3
Summary:
Added unit tests that have paranoid_check = true that perform range deletions. At the moment, the deleted ranges do not appear to be checked as part of the paranoid checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7521
Reviewed By: zhichao-cao
Differential Revision: D24262175
Pulled By: ajkr
fbshipit-source-id: 1035e968f7ab8ccaa7af086b835a4e72c7e56743
Summary:
If `BottommostLevelCompaction.kForce*` is set, compaction should avoid
trivial move and always compact the sst to the target size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7368
Reviewed By: ajkr
Differential Revision: D23629525
Pulled By: jay-zhuang
fbshipit-source-id: 79f23c79ecb31587e0593b28cce43131107bbcd0
Summary:
`BeginWriteStall()` removes no_slowdown write from the write
list and updates `link_newer`, which makes `CreateMissingNewerLinks()`
thought all write list has valid `link_newer` and failed to create link
for all writers.
It caused flaky test and SegFault for release build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7508
Test Plan: Add unittest to reproduce the issue.
Reviewed By: anand1976
Differential Revision: D24126601
Pulled By: jay-zhuang
fbshipit-source-id: f8ac5dba653f7ee1b0950296427d4f5f8ee34a06
Summary:
This PR schedules a background thread (shared across all DB instances)
to flush info log every ten seconds. This improves debuggability in case
of RocksDB hanging since it ensures the log messages leading up to the hang
will eventually become visible in the log.
The bulk of this PR is moving monitoring/stats_dump_scheduler* to db/periodic_work_scheduler*
and making the corresponding name changes since now the scheduler handles info
log flushing, not just stats dumping.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7488
Reviewed By: riversand963
Differential Revision: D24065165
Pulled By: ajkr
fbshipit-source-id: 339c47a0ff43b79fdbd055fbd9fefbb6f9d8d3b5
Summary:
Add db_basic_test status check list. Some of the warnings are suppressed. It is possible that some of them are due to real bugs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7452
Test Plan: See CI tests pass.
Reviewed By: zhichao-cao
Differential Revision: D23979764
fbshipit-source-id: 6151570c2a9b931b0fbb3fe939a94b2bd1583cbe
Summary:
Possible fix for a TSAN issue reported in EnableFileDeletions.
disable_delete_obsolete_files_ should only be accessed holding the db
mutex, but for logging it was being accessed outside holding the mutex,
now fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7435
Test Plan: existing tests, watch for recurrence
Reviewed By: ltamasi
Differential Revision: D23917578
Pulled By: pdillinger
fbshipit-source-id: 8573025bca3f6fe169b24b87bbfc4ce9667b0482
Summary:
Fix few test cases and add them in ASSERT_STATUS_CHECKED build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7427
Test Plan:
1. ASSERT_STATUS_CHECKED=1 make -j48 check,
2. travis build for ASSERT_STATUS_CHECKED,
3. Without ASSERT_STATUS_CHECKED: make check -j64, CircleCI build and travis build
Reviewed By: pdillinger
Differential Revision: D23909983
Pulled By: akankshamahajan15
fbshipit-source-id: 42d7e4aea972acb9fcddb7ca73fcb82f93272434
Summary:
Implement a parsing tool io_tracer_parser that takes IO trace file (binary file) with command line argument --io_trace_file and output file with --output_file and dumps the IO trace records in outputfile in human readable form.
Also added unit test cases that generates IO trace records and calls io_tracer_parse to parse those records.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7333
Test Plan:
make check -j64,
Add unit test cases.
Reviewed By: anand1976
Differential Revision: D23772360
Pulled By: akankshamahajan15
fbshipit-source-id: 9c20519c189362e6663352d08863326f3e496271
Summary:
Current implementation holds db mutex while calling
`GetAggregatedIntProperty()`. For property kEstimateTableReadersMem,
this can be expensive, especially if the number of table readers is
high.
We can release and re-acquire db mutex if
property_info.need_out_of_mutex is true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7412
Test Plan:
make check
COMPILE_WITH_ASAN=1 make check
COMPILE_WITH_TSAN=1 make check
Also test internally on a shadow host. Used bpf to verify the
excessively long db mutex holding no longer exists when applications
call GetApproximateMemoryUsageByType().
Reviewed By: jay-zhuang
Differential Revision: D23794824
Pulled By: riversand963
fbshipit-source-id: 6bc02a59fd25613d343a62cf817467c7122c9721
Summary:
In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310
Test Plan: adding new unit test, pass make check.
Reviewed By: anand1976
Differential Revision: D23710892
Pulled By: zhichao-cao
fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.
This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.
We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).
We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.
Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`
Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.
Actual file name patterns and priorities, as regexes:
kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
[0-9]+_[0-9]+_[0-9]+[.]sst
kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
[0-9]+_[0-9a-fA-F-]+[.]sst
kUseDbSessionId AND NOT kFlagIncludeFileSize ->
[0-9]+_s[0-9A-Z]{20}[.]sst
kUseDbSessionId AND kFlagIncludeFileSize (default) ->
[0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst
We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.
This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400
Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.
Reviewed By: ajkr
Differential Revision: D23759587
Pulled By: pdillinger
fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
Summary:
The patch adds support for extracting large values into blob files when
performing a flush during recovery (when `avoid_flush_during_recovery` is
`false`). Blob files are built and added to the `Version` similarly to flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7388
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23709912
Pulled By: ltamasi
fbshipit-source-id: ce48b4227849cf25429ae98574e72b0e1cb9c67d
Summary:
The patch adds support for writing blob files during flush by integrating
`BlobFileBuilder` with the flush logic, most importantly, `BuildTable` and
`CompactionIterator`. If `enable_blob_files` is set, large values are extracted
to blob files and replaced with references. The resulting blob files are then
logged to the MANIFEST as part of the flush job's `VersionEdit` and
added to the `Version`, similarly to table files. Errors related to writing
blob files fail the flush, and any blob files written by such jobs are immediately
deleted (again, similarly to how SST files are handled). In addition, the patch
extends the logging and statistics around flushes to account for the presence
of blob files (e.g. `InternalStats::CompactionStats::bytes_written`, which is
used for calculating write amplification, now considers the blob files as well).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7345
Test Plan: Tested using `make check` and `db_bench`.
Reviewed By: riversand963
Differential Revision: D23506369
Pulled By: ltamasi
fbshipit-source-id: 646885f22dfbe063f650d38a1fedc132f499a159
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753
Reviewed By: ajkr
Differential Revision: D23385030
Pulled By: zhichao-cao
fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
Summary:
In a distributed file system, directory ownership is enforced by fencing
off the previous owner once they've been preempted by a new owner. This
PR adds a IOStatus subcode for ```StatusCode::IOError``` to indicate this.
Once this error is returned for a file write, the DB is put in read-only
mode and not allowed to resume in read-write mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7374
Test Plan: Add new unit tests in ```error_handler_fs_test```
Reviewed By: riversand963
Differential Revision: D23687777
Pulled By: anand1976
fbshipit-source-id: bef948642089dc0af399057864d9a8ca339e8b2f
Summary:
Replace FSWritableFile pointer with FSWritableFilePtr
object in WritableFileWriter.
This new object wraps FSWritableFile pointer.
Objective: If tracing is enabled, FSWritableFile Ptr returns
FSWritableFileTracingWrapper pointer that includes all necessary
information in IORecord and calls underlying FileSystem and invokes
IOTracer to dump that record in a binary file. If tracing is disabled
then, underlying FileSystem pointer is returned directly.
FSWritableFilePtr wrapper class is added to bypass the
FSWritableFileWrapper when
tracing is disabled.
Test Plan: make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7193
Reviewed By: anand1976
Differential Revision: D23355915
Pulled By: akankshamahajan15
fbshipit-source-id: e62a27a13c1fd77e36a6dbafc7006d969bed25cf
Summary:
More tests now pass. When in doubt, I added a TODO comment to check what should happen with an ignored error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7305
Reviewed By: akankshamahajan15
Differential Revision: D23301262
Pulled By: ajkr
fbshipit-source-id: 5f120edc7393560aefc0633250277bbc7e8de9e6
Summary:
When a memtable is trimmed in MemTableListVersion, the memtable
is only added to delete list if it is
the last reference. However it is not the last reference as it is held
by the super version. But the super version would not be switched if the
delete list is empty. So the memtable is never destroyed and memory
usage increases beyond write_buffer_size +
max_write_buffer_size_to_maintain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7296
Test Plan:
1. ./db_bench -benchmarks=randomtransaction
-optimistic_transaction_db=1 -statistics -stats_interval_seconds=1
-duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000
--transaction_set_snapshot
Reviewed By: ltamasi
Differential Revision: D23267395
Pulled By: akankshamahajan15
fbshipit-source-id: 3a8d437fe9f4015f851ff84c0e29528aa946b650
Summary:
This test uses database functionality and required more extensive work to get it to pass than the other tests. The DB functionality required for this test now passes the check.
When it was unclear what the proper behavior was for unchecked status codes, a TODO was added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283
Reviewed By: akankshamahajan15
Differential Revision: D23251497
Pulled By: ajkr
fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523
Summary:
This diff contains following changes:
1. Replace `FSSequentialFile` pointer with `FSSequentialFilePtr` object that wraps `FSSequentialFile` pointer in `SequenceFileReader`.
Objective: If tracing is enabled, `FSSequentialFilePtr` returns `FSSequentialFileTracingWrapper` pointer that includes all necessary information in `IORecord` and calls underlying FileSystem and invokes `IOTracer` to dump that record in a binary file. If tracing is disabled then, underlying `FileSystem` pointer is returned directly. `FSSequentialFilePtr` wrapper class is added to bypass the `FSSequentialFileTracingWrapper` when tracing is disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7190
Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
Reviewed By: anand1976
Differential Revision: D23059616
Pulled By: akankshamahajan15
fbshipit-source-id: 1564b94dd1297cd0fbfe2ed5c9cc3e20f7395301
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.
This commit reinstates https://github.com/facebook/rocksdb/issues/7049, whose un-revert was lost in an automatic
infrastructure mis-merge.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7274
Test Plan: Run all existing files.
Reviewed By: pdillinger
Differential Revision: D23177444
fbshipit-source-id: 1f61690b2ac6333c3b2c87176fef6b2cba086b33
Summary:
The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271
Reviewed By: riversand963
Differential Revision: D23169923
Pulled By: ajkr
fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50
Summary:
There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261
Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds.
Reviewed By: ajkr
Differential Revision: D23142269
fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91
Summary:
Looks like somebody simply missed initializing a member variable. The column family ID, cf_id, is not set during OnCompactionBegin. But it is set properly in the next function for OnCompactionCompleted. Need this cf_id for tracking progress of a Stardog optimize since there may be multiple compactions required for a given column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6938
Reviewed By: siying
Differential Revision: D23153235
Pulled By: ajkr
fbshipit-source-id: 932938de3a4ebbc7ac89702f655583862587d251
Summary:
Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223
Reviewed By: riversand963
Differential Revision: D23056737
Pulled By: jay-zhuang
fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
Summary:
If user-defined timestamp is enabled, current implementation can expose
newer data to queries even if an older sequence number is specified via
read_options.snapshot. This PR makes Get() respect sequence-number-based
snapshot.
Solution is simple. Besides using <ukey, ts, seq> to search the index for the key,
we also verify that the candidate result's seq is smaller than or equal to seq. This
requires passing a seq via `GetContext`, which results in the majority of code
change caused by this PR.
Also added a few unit tests to demonstrate standard visibility during point lookup
and range scan when timestamp and snapshot are both present.
Test plan (devserver):
```
make check
$./db_bench --benchmarks=fillseq,readrandom -cache_size=$[64*1024*1024]
```
Result
this PR: readrandom : 4.827 micros/op 207180 ops/sec; 22.9 MB/s (1000000 of 1000000 found)
master: readrandom : 4.936 micros/op 202610 ops/sec; 22.4 MB/s (1000000 of 1000000 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7227
Reviewed By: ltamasi
Differential Revision: D23015242
Pulled By: riversand963
fbshipit-source-id: ea7b85a728654553ba357d2e6a207b5e40f7376a
Summary:
Manual compaction with `CompactRangeOptions::change_levels` set could
refit to a level targeted by another manual compaction. If
force_consistency_checks were disabled, it could be possible for
overlapping files to be written at that target level.
This PR prevents the possibility by calling `DisableManualCompaction()`
prior to `ReFitLevel()`. It also improves the manual compaction disabling
mechanism to wait for pending manual compactions to complete before
returning, and support disabling from multiple threads.
Fixes https://github.com/facebook/rocksdb/issues/6432.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250
Test Plan:
crash test command that repro'd the bug reliably:
```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0
```
Reviewed By: ltamasi
Differential Revision: D23090800
Pulled By: ajkr
fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13
Summary:
As part of the IOTracing project, this PR
1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer.
2. FileSystemPtr object is created using FileSystem pointer and IOTracer
pointer.
3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor.
4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr
returns FileSystemTracingWrapper pointer for tracing purpose and when
it is disabled underlying FileSystem pointer is returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180
Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
Reviewed By: anand1976
Differential Revision: D22987117
Pulled By: akankshamahajan15
fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1
Summary:
Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for
Iterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161
Test Plan: New unit tests in db_basic_test
Reviewed By: riversand963
Differential Revision: D22687352
Pulled By: anand1976
fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec
Summary:
Pointed out by https://github.com/facebook/rocksdb/issues/7197 , there is a double lock in WriteImplWALOnly.
Also find another deadlock in UnorderedWriteMemtable. Move the check after switch_all_.notify_all().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7199
Test Plan: pass make check
Reviewed By: anand1976
Differential Revision: D22961714
Pulled By: zhichao-cao
fbshipit-source-id: 0707922dc50d28ea141a15a8cdcbd1c8993ea0d8
Summary:
A colon will be added after 'msg' automatically when invoke function Status(Code _code, const Slice& msg, const Slice& msg2),
it's not needed to append a colon explicitly to 'msg'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7041
Reviewed By: ajkr
Differential Revision: D22292801
fbshipit-source-id: 8f2d69065bb779d2613468bf9fc9169f32c3f1ec
Summary:
Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator`
and every `LevelIterator`. This redundancy consumes extra memory,
resulting in the `Arena` making more allocations, and iteration
observing worse cache performance.
This PR migrates callers of `NewInternalIterator()` and
`MakeInputIterator()` to provide a `ReadOptions` object guaranteed to
outlive the returned iterator. When the iterator's lifetime will be managed by the
user, this lifetime guarantee is achieved by storing the `ReadOptions`
value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and
`MakeInputIterator()` can hold a reference-to-const `ReadOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210
Test Plan:
- `make check` under ASAN and valgrind
- benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes.
Reviewed By: anand1976
Differential Revision: D22861323
Pulled By: ajkr
fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce
Summary:
Make (most of) the env*_test pass when ASSERT_STATUS_CHECKED is enabled.
One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176
Reviewed By: cheng-chang
Differential Revision: D22799278
Pulled By: ajkr
fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed
Summary:
Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145
Reviewed By: riversand963
Differential Revision: D22593031
Pulled By: ajkr
fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765
Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check
Reviewed By: anand1976
Differential Revision: D21916789
Pulled By: zhichao-cao
fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
Summary:
There currently exist multiple `GetChildren()` calls in `DBImpl::Recover()`, which can be expensive in cases of distributed file systems.
This pull request try to call `DBImpl::Recover()` of each necessary directory only _once_ and reuse the results in the places of repeated calls in current code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7044
Test Plan:
Run `make check` and use the default test suite. The modified code should be semantically identical to the current code. As a proof of this solution, we may optionally deploy the system onto a (real or simulated) distributed system and expect reduced latency caused by manifest fetching.
(WIP)
Reviewed By: riversand963
Differential Revision: D22419925
Pulled By: roghnin
fbshipit-source-id: d3774fbfbc246c5527101bc16747eb5c90919886
Summary:
Cleans up some of the dependencies on test code in the Makefile while building tools:
- Moves the test::RandomString, DBBaseTest::RandomString into Random
- Moves the test::RandomHumanReadableString into Random
- Moves the DestroyDir method into file_utils
- Moves the SetupSyncPointsToMockDirectIO into sync_point.
- Moves the FaultInjection Env and FS classes under env
These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies. By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated.
Tested both release and debug builds via Make and CMake for both static and shared libraries.
More work remains to clean up how the tools are built and remove some unnecessary dependencies. There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097
Reviewed By: riversand963
Differential Revision: D22463160
Pulled By: pdillinger
fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2
Summary:
When table file checksums are enabled and stored in the DB manifest by using the RocksDB default crc32c checksum function, BackupEngine will calculate the crc32c checksum of the file to be copied and compare the calculated result with the one stored in the DB manifest before copying the file to the backup directory.
After copying to the backup directory, BackupEngine will verify the checksum of the copied file with the one calculated before copying. This helps detect some rare corruption events such as bit-flips during the copying process.
No verification with checksums in DB manifest will be performed if the table file checksum function is not the RocksDB default crc32c checksum function.
In addition, If `share_table_files` and `share_files_with_checksum` are true, BackupEngine will compare the checksums computed before and after copying of the table files.
Corresponding tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7015
Test Plan: Passed make check
Reviewed By: pdillinger
Differential Revision: D22165732
Pulled By: gg814
fbshipit-source-id: ee0e8cc397c455eba64545c29380b9d9853588ec
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7049
Test Plan: Run all existing files.
Reviewed By: pdillinger
Differential Revision: D22301700
fbshipit-source-id: f9a9e3b3b26ce640665a47cb8bff33ba0c89b565
Summary:
WriteCallbackTest.WriteWithCallbackTest has a deep for-loop and in some cases runs very long. Parameterimized it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7037
Test Plan: Run the test and see it passes.
Reviewed By: ltamasi
Differential Revision: D22269259
fbshipit-source-id: a1b6687b5bf4609754833d14cf383d68bc7ab27a
Summary:
Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.
The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.
Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982
Test Plan: Add new unit tests in db_basic_test
Reviewed By: riversand963
Differential Revision: D22219515
Pulled By: anand1976
fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
Summary:
After https://github.com/facebook/rocksdb/issues/6949 , VersionSet::io_status_ can be concurrently accessed by multiple
threads without lock, causing tsan test to fail. For example, a bg flush thread
resets io_status_ before calling LogAndApply(), while another thread already in
the process of LogAndApply() reads io_status_. This is a bug.
We do not have to reset io_status_ each time we call LogAndApply(). io_status_
is part of the state of VersionSet, and it indicates the outcome of preceding
MANIFEST/CURRENT files IO operations. Its value should be updated only when:
1. MANIFEST/CURRENT files IO fail for the first time.
2. MANIFEST/CURRENT files IO succeed as part of recovering from a prior
failure without process restart, e.g. calling Resume().
Test Plan (devserver):
COMPILE_WITH_TSAN=1 make check
COMPILE_WITH_TSAN=1 make db_test2
./db_test2 --gtest_filter=DBTest2.CompactionStall
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7034
Reviewed By: zhichao-cao
Differential Revision: D22247137
Pulled By: riversand963
fbshipit-source-id: 77b83e05390f3ee3cd2d96d3fdd6fe4f225e3216
Summary:
`BackupableDBOptions::new_naming_for_backup_files` is added. This option is false by default. When it is true, backup table filenames under directory shared_checksum are of the form `<file_number>_<crc32c>_<db_session_id>.sst`.
Note that when this option is true, it comes into effect only when both `share_files_with_checksum` and `share_table_files` are true.
Three new test cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6997
Test Plan: Passed make check.
Reviewed By: ajkr
Differential Revision: D22098895
Pulled By: gg814
fbshipit-source-id: a1d9145e7fe562d71cde7ac995e17cb24fd42e76
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
Summary:
The bug fixed in https://github.com/facebook/rocksdb/pull/1816/ is now applicable to iterator too. This was not an issue but https://github.com/facebook/rocksdb/pull/2886 caused the regression. If a put and DB flush happens just between iterator to get latest sequence number and getting super version, empty result for the key or an older value can be returned, which is wrong.
Fix it in the same way as the fix in https://github.com/facebook/rocksdb/issues/1816, that is to get the sequence number after referencing the super version.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6973
Test Plan: Will run stress tests for a while to make sure there is no general regression.
Reviewed By: ajkr
Differential Revision: D22029348
fbshipit-source-id: 94390f93630906796d6e2fec321f44a920953fd1
Summary:
`db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file.
The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`.
In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`.
A table property test is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983
Test Plan: make check and some manual tests.
Reviewed By: zhichao-cao
Differential Revision: D22048826
Pulled By: gg814
fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9
Summary:
Best-efforts recovery does not check the content of CURRENT file to determine which MANIFEST to recover from. However, it still checks the presence of CURRENT file to determine whether to create a new DB during `open()`. Therefore, we can tweak the logic in `open()` a little bit so that best-efforts recovery does not rely on CURRENT file at all.
Test plan (dev server):
make check
./db_basic_test --gtest_filter=DBBasicTest.RecoverWithNoCurrentFile
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6970
Reviewed By: anand1976
Differential Revision: D22013990
Pulled By: riversand963
fbshipit-source-id: db552a1868c60ed70e1f7cd252a3a076eb8ea58f
Summary:
Added DB::GetDbSessionId by using the same format and machinery as DB::GetDbIdentity.
The DB Session ID is generated (and therefore, updated) each time a DB object is opened. It is written to the LOG file right after the line of “DB SUMMARY”.
A test for the uniqueness, for different openings and during the same opening, is also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6959
Test Plan: Passed make check
Reviewed By: zhichao-cao
Differential Revision: D21951721
Pulled By: gg814
fbshipit-source-id: 958a48a612db49a39998ea703cded45987d3fa8b
Summary:
If `options.wal_recovery_mode == WALRecoveryMode::kPointInTimeRecovery`, RocksDB stops replaying WAL once hitting an error and discards the rest of the WAL. This can lead to data loss if the error occurs at an offset smaller than the last sync'ed offset.
Ideally, RocksDB point-in-time recovery should permit recovery if the error occurs after last synced offset while fail recovery if error occurs before the last synced offset. However, RocksDB does not track the synced offset of WALs. Consequently, RocksDB does not know whether an error occurs before or after the last synced offset. An error can be one of the following.
- WAL record checksum mismatch. This can result from both corruption of synced data and dropping of unsynced data during shutdown. We cannot be sure which one. In order not to defeat the original motivation to permit the latter case, we keep the original behavior of point-in-time WAL recovery.
- IOError. This means the WAL can be bad, an indicator of whole file becoming unavailable, not to mention synced part of the WAL. Therefore, we choose to modify the behavior of point-in-time recovery and fail the database recovery.
Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6963
Reviewed By: ajkr
Differential Revision: D22011083
Pulled By: riversand963
fbshipit-source-id: f9cbf29a37dc5cc40d3fa62f89eed1ad67ca1536
Summary:
Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum.
1. If generate_and_verify_file_checksum options is *FALSE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call.
2. If generate_and_verify_file_checksum options is *TRUE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891
Test Plan: added unit test, pass make asan_check
Reviewed By: pdillinger
Differential Revision: D21935988
Pulled By: zhichao-cao
fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed
Summary:
As title. The prior change to the line is a typo. Fixing it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6928
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D21873587
Pulled By: riversand963
fbshipit-source-id: f4837fc8792d7106bc230b7b499dfbb7a2847430
Summary:
DB::OpenForReadOnly will not write anything to the file system (i.e., create directories or files for the DB) unless create_if_missing is true.
This change also fixes some subcommands of ldb, which write to the file system even if the purpose is for readonly.
Two tests for this updated behavior of DB::OpenForReadOnly are also added.
Other minor changes:
1. Updated HISTORY.md to include this API change of DB::OpenForReadOnly;
2. Updated the help information for the put and batchput subcommands of ldb with the option [--create_if_missing];
3. Updated the comment of Env::DeleteDir to emphasize that it returns OK only if the directory to be deleted is empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6900
Test Plan: passed make check; also manually tested a few ldb subcommands
Reviewed By: pdillinger
Differential Revision: D21822188
Pulled By: gg814
fbshipit-source-id: 604cc0f0d0326a937ee25a32cdc2b512f9a3be6e
Summary:
Preliminary user-timestamp support for delete.
If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100.
Implementation
A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp.
The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear.
Test plan (dev server)
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253
Reviewed By: ltamasi
Differential Revision: D20995328
Pulled By: riversand963
fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619
Summary:
1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key.
2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826
Test Plan:
1. make check -j64
2. Add a new unit test case
Reviewed By: anand1976
Differential Revision: D21471483
Pulled By: akankshamahajan15
fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900
Summary:
1. Update column_family_memtables_ to point to latest column_family_set in
version_set after recovery.
2. Normalize file paths passed by application so that directories end with '/'
or '\\'.
3. In addition to missing files, corrupted files are also ignored in
best-efforts recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6824
Test Plan: COMPILE_WITH_ASAN=1 make check
Reviewed By: anand1976
Differential Revision: D21463905
Pulled By: riversand963
fbshipit-source-id: c48db8843cc93c8c1c7139c474b64e6f775307d2
Summary:
The patch extends `FindObsoleteFiles` and `PurgeObsoleteFiles` with
support for blob files. The behavior is analogous to SST files: obsolete
blob files are put on the "candidates for deletion" list, while live (and pending)
files are preserved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6807
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D21406249
Pulled By: ltamasi
fbshipit-source-id: 1948f71c31927564b61e8af394f50ca3964880d9