Commit Graph

975 Commits

Author SHA1 Message Date
Changyu Bi d5bc30befa Enforce status checking after Valid() returns false for IteratorWrapper (#11975)
Summary:
... when compiled with ASSERT_STATUS_CHECKED = 1.

The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782.

Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975

Test Plan:
* `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check`
* I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task.

Reviewed By: ajkr

Differential Revision: D50383790

Pulled By: cbi42

fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce
2023-10-18 09:38:38 -07:00
anand76 269478ee46 Support compressed and local flash secondary cache stacking (#11812)
Summary:
This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache.

The basic design is as follows -
1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache.
2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block.
3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache.
4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded.

We can easily implement additional admission policies if desired.

Todo (In a subsequent PR):
1. Add to db_bench and run benchmarks
2. Add to db_stress

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812

Reviewed By: pdillinger

Differential Revision: D49461842

Pulled By: anand1976

fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-21 20:30:53 -07:00
Yu Zhang 32fc1e6cdc Add unit test for the multiget fix when ReadOptions.read_tier == kPersistedTier and disableWAL == true (#11854)
Summary:
Add unit tests for the fix in https://github.com/facebook/rocksdb/pull/11700

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11854

Reviewed By: anand1976

Differential Revision: D49392462

Pulled By: jowlyzhang

fbshipit-source-id: bd6978e4888074fa5417f3ccda7a78a2c7eee9c6
2023-09-21 14:59:58 -07:00
Changyu Bi 9d71682d1b Add statistics `COMPACTION_CPU_TOTAL_TIME` for total compaction time (#11741)
Summary:
Existing compaction statistics are `COMPACTION_TIME` and `COMPACTION_CPU_TIME` which are histogram and are logged at the end of a compaction. The new statistics `COMPACTION_CPU_TOTAL_TIME` is for cumulative total compaction time which is updated regularly during a compaction. This allows user to more closely track compaction cpu usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11741

Test Plan: * new unit test `DBTestWithParam.CompactionTotalTimeTest`

Reviewed By: ajkr

Differential Revision: D48608094

Pulled By: cbi42

fbshipit-source-id: b597109f3e4bf2237fb5a216b6fd036e5363b4c0
2023-09-12 15:48:36 -07:00
Changyu Bi 76ed9a3990 Add missing status check when compiling with `ASSERT_STATUS_CHECKED=1` (#11686)
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by 9c2ebcc2c3/Makefile (L243)
Applying the change in PR https://github.com/facebook/rocksdb/issues/11675 shows a lot of missing status checks. This PR adds the missing status checks.

Most of changes are just adding asserts in unit tests. I'll add pr comment around more interesting changes that need review.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11686

Test Plan: change Makefile as in https://github.com/facebook/rocksdb/issues/11675, and run `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 check`

Reviewed By: hx235

Differential Revision: D48176132

Pulled By: cbi42

fbshipit-source-id: 6758946cfb1c6ff84c4c1e0ca540d05e6fc390bd
2023-08-09 15:46:44 -07:00
Peter Dillinger e214964f40 Fix a potential memory leak on row_cache insertion failure (#11682)
Summary:
Although the built-in Cache implementations never return failure on Insert without keeping a reference (Handle), a custom implementation could. The code for inserting into row_cache does not keep a reference but does not clean up appropriately on non-OK. This is a fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11682

Test Plan: unit test added that previously fails under ASAN

Reviewed By: ajkr

Differential Revision: D48153831

Pulled By: pdillinger

fbshipit-source-id: 86eb7387915c5b38b6ff5dd8deb4e1e223b7d020
2023-08-08 11:34:41 -07:00
Andrew Kryczka 05c3b8ecac Prepare for specialized interface for row cache (#11620)
Summary:
An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema.

This PR prepares for the specialized `RowCache` interface by updating the migration plan of https://github.com/facebook/rocksdb/issues/11450. I added a release note for the removed APIs and didn't mention the added ones for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11620

Reviewed By: pdillinger

Differential Revision: D47536962

Pulled By: ajkr

fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691
2023-07-18 19:12:58 -07:00
Changyu Bi bc04ec85db Make option `level_compaction_dynamic_level_bytes` true by default (#11525)
Summary:
after https://github.com/facebook/rocksdb/issues/11321 and https://github.com/facebook/rocksdb/issues/11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size.

The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit f742be330c and b1928e42b3 which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape.

Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in https://github.com/facebook/rocksdb/issues/10057.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11525

Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change.

Reviewed By: ajkr

Differential Revision: D46654256

Pulled By: cbi42

fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1
2023-06-15 21:12:39 -07:00
mayue.fight fa878a0107 Support to create a CF by importing multiple non-overlapping CFs (#11378)
Summary:
The original Feature Request is from [https://github.com/facebook/rocksdb/issues/11317](https://github.com/facebook/rocksdb/issues/11317).
Flink uses rocksdb as the state backend,  all DB options are the same, and the keys of each DB instance are adjacent and there is no key overlap between two db instances.
In the Flink rescaling scenario, it is necessary to quickly split the DB according to a certain key range or quickly merge multiple DBs into one.

This PR is mainly used to quickly merge multiple DBs into one.

We hope to extend the function of `CreateColumnFamilyWithImports` to support creating ColumnFamily by importing multiple ColumnFamily with no overlapping keys.

The import logic is almost the same as `CreateColumnFamilyWithImport`, but it will check whether there is key overlap between CF when importing. The import will fail if there are key overlaps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11378

Reviewed By: ajkr

Differential Revision: D46413709

Pulled By: cbi42

fbshipit-source-id: 846d0049fad11c59cf460fa846c345b26c658dfb
2023-06-15 12:25:04 -07:00
Jay Huh 81aeb15988 Add WaitForCompact with WaitForCompactOptions to public API (#11436)
Summary:
Context:

This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions`

1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB.
2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing.

1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct.
2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h`
3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436

Test Plan:
Following tests added
- `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish`
- `DBCompactionTest::WaitForCompactAbortOnPauseAborted`
- `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted`
- `DBCompactionTest::WaitForCompactShutdownWhileWaiting`
- `TransactionTest::WaitForCompactAbortOnPause`

NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place.

Reviewed By: pdillinger

Differential Revision: D45799659

Pulled By: jaykorean

fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b
2023-05-25 17:25:51 -07:00
Peter Dillinger 4067acabca Compatibility step for separating BlockCache and GeneralCache APIs (#11450)
Summary:
Add two type aliases for Cache: BlockCache and GeneralCache, and add LRUCacheOptions::MakeSharedGeneralCache(). This will ease upgrade to an intended future change to separate the cache API between block cache and other (general) uses, including row cache. Separating the APIs will make it easier to expose more details of block caching for customization. For example, it would be nice to pass the file unique ID and offset as the logical cache key instead of using a Slice, which could facilitate some file-specific customizations in block cache. This would also make it clear that HyperClockCache is not usable as a general cache, because it can only deal with fixed-size block cache keys.

block_cache, row_cache, and blob_cache are the uses of Cache in the public API. blob_cache should be able to use BlockCache while row_cache is a GeneralCache user, as its keys are of arbitrary size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11450

Test Plan: see updated unit test.

Reviewed By: anand1976

Differential Revision: D45882067

Pulled By: pdillinger

fbshipit-source-id: ff5d9f0b644f87ae337a29a7728ce3ed07b2a4b2
2023-05-18 20:40:19 -07:00
mayue.fight 8d8eb0e77e Support Clip DB to KeyRange (#11379)
Summary:
This PR is part of the request https://github.com/facebook/rocksdb/issues/11317.
(Another part is https://github.com/facebook/rocksdb/pull/11378)

ClipDB() will clip the entries in the CF according to the range [begin_key, end_key). All the entries outside this range will be completely deleted (including tombstones).
 This feature is mainly used to ensure that there is no overlapping Key when calling CreateColumnFamilyWithImports() to import multiple CFs.

When Calling ClipDB [begin, end), there are the following steps

1.  Quickly and directly delete files without overlap
 DeleteFilesInRanges(nullptr, begin) + DeleteFilesInRanges(end, nullptr)
2. Delete the Key outside the range
Delete[smallest_key, begin) + Delete[end, largest_key]
3. Delete the tombstone through Manul Compact
CompactRange(option, nullptr, nullptr)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11379

Reviewed By: ajkr

Differential Revision: D45840358

Pulled By: cbi42

fbshipit-source-id: 54152e8a45fd8ede137f99787eb252f0b51440a4
2023-05-18 13:25:01 -07:00
Jay Huh 586d78b31e Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:

In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.

In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.

This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.

Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.

Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443

Test Plan:
Existing tests that involve a shutdown in progress:

- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle

Reviewed By: pdillinger

Differential Revision: D45923426

Pulled By: jaykorean

fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-17 18:13:50 -07:00
Peter Dillinger 206fdea3d9 Change internal headers with duplicate names (#11408)
Summary:
In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.

No public API changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408

Test Plan: existing tests

Reviewed By: ltamasi

Differential Revision: D45456696

Pulled By: pdillinger

fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373
2023-05-17 11:27:09 -07:00
clundro 50b33ebb1b remove redundant move (#11418)
Summary:
when I use g++-13 to exec the `make all` command,  the output throws the warnings.
```
db/compaction/compaction_job_test.cc: In member function ‘void rocksdb::CompactionJobTestBase::AddMockFile(const rocksdb::mock::KVVector&, int)’:
db/compaction/compaction_job_test.cc:376:57: error: redundant move in initialization [-Werror=redundant-move]
  376 |           env_, GenerateFileName(file_number), std::move(contents)));
      |                                                ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
  375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
      |       ^~~~~~~~~
db/compaction/compaction_job_test.cc:376:57: note: remove ‘std::move’ call
  376 |           env_, GenerateFileName(file_number), std::move(contents)));
      |                                                ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
  375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
      |       ^~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [Makefile:2507: db/compaction/compaction_job_test.o] Error 1
```

and I also add some `(void)unused_variable` statements because of the cmake argument `-Wunused-but-set-variable -Wunused-but-set-variable`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11418

Reviewed By: akankshamahajan15

Differential Revision: D45528223

Pulled By: ajkr

fbshipit-source-id: fee1a77c30039a56b481de953f0a834cc788abbc
2023-05-03 09:37:21 -07:00
Changyu Bi b3c43a5b99 Drain unnecessary levels when `level_compaction_dynamic_level_bytes=true` (#11340)
Summary:
When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340

Test Plan:
- New unit tests
- `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run.

Reviewed By: ajkr

Differential Revision: D44563884

Pulled By: cbi42

fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8
2023-04-06 11:20:43 -07:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
Hui Xiao 98d5db5c2e Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
-  File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
       - insert k1@1 to memtable m1
       - ingest file s1 with k2@2, ingest file s2 with k3@3
        - insert k4@4 to m1
       - compact files s1, s2 and  result in new file s3 of seqno range [2, 3]
       - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
    - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr  for this example)
        - an existing SST s1 contains only k1@1
        - insert k1@2 to memtable m1
        - ingest file s2 with k3@3, ingest file s3 with k4@4
        - insert single delete k5@5 in m1
        - flush m1 and result in new file s4 of seqno range [2, 5]
        - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
        - compact s4 and result in new file s6 of seqno range [2] due to single delete
    - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`

Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number`  ordering check. This will result in file ordering s1 <  s2 <  s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.

**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
  - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
  - Compaction output file  is assigned with the minimum `epoch_number` among input files'
      - Refit level: reuse refitted file's epoch_number
  -  Other paths needing `epoch_number` treatment:
     - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
     - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
  -  Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or  by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
  - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
   - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
   - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
   - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
   - update existing tests with `epoch_number` so make check will pass
   - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
   - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922

Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run 36a5686ec0 (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761

Reviewed By: ajkr

Differential Revision: D41063187

Pulled By: hx235

fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 13:29:37 -08:00
Andrew Kryczka 5cf6ab6f31 Ran clang-format on db/ directory (#10910)
Summary:
Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910

Reviewed By: riversand963

Differential Revision: D40880683

Pulled By: ajkr

fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
2022-11-02 14:34:24 -07:00
Yanqin Jin 7d26e4c5a3 Basic Support for Merge with user-defined timestamp (#10819)
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.

Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```

Some code comments are added for clarity.

Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40603195

Pulled By: riversand963

fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
2022-10-31 22:28:58 -07:00
Yueh-Hsuan Chiang e267909ecf Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348)
Summary:
FIFO compaction can theoretically open a DB with any compaction style.
However, the current code only allows FIFO compaction to open a DB with
a single level.

This PR relaxes the limitation of FIFO compaction and allows it to open a
DB with multiple levels.  Below is the read / write / compaction behavior:

* The read behavior is untouched, and it works like a regular rocksdb instance.
* The write behavior is untouched as well.  When a FIFO compacted DB
is opened with multiple levels, all new files will still be in level 0, and no files
will be moved to a different level.
* Compaction logic is extended.  It will first identify the bottom-most non-empty level.
Then, it will delete the oldest file in that level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348

Test Plan:
Added a new test to verify the migration from level to FIFO where the db has multiple levels.
Extended existing test cases in db_test and db_basic_test to also verify
all entries of a key after reopening the DB with FIFO compaction.

Reviewed By: jay-zhuang

Differential Revision: D40233744

fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1
2022-10-18 14:38:13 -07:00
Changyu Bi a6ce1955b1 Fix flaky test ShuttingDownNotBlockStalledWrites (#10800)
Summary:
DBTest::ShuttingDownNotBlockStalledWrites is flaky, added new sync point dependency to fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10800

Test Plan: gtest-parallel --repeat=1000 ./db_test --gtest_filter="*ShuttingDownNotBlockStalledWrites"

Reviewed By: jay-zhuang

Differential Revision: D40239116

Pulled By: jay-zhuang

fbshipit-source-id: 8c2d7e7df58f202d287bd9f5c9b60b7eff270d0c
2022-10-10 13:58:55 -07:00
Qingping Wang a45e6878f3 fix issue 10751 (#10765)
Summary:
Fix https://github.com/facebook/rocksdb/issues/10751 where a stalled write could be blocked forever when DB shutdown.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10765

Reviewed By: ajkr

Differential Revision: D40110069

Pulled By: ajkr

fbshipit-source-id: 598c05777db9be85913a0a85e421b3295ecdff5e
2022-10-10 09:46:09 -07:00
Jay Zhuang edae671ce0 Re-enable SuggestCompactRangeTest and add Universal Compaction test (#10473)
Summary:
The feature `SuggestCompactRange()` is still experimental. Just
re-add the test back.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10473

Test Plan: CI

Reviewed By: akankshamahajan15

Differential Revision: D38427153

Pulled By: jay-zhuang

fbshipit-source-id: 0b4491c947cbce6c18ff147b167e3c678633129a
2022-08-05 13:16:58 -07:00
Jay Zhuang fcccc412d7 Remove Travis CI (#10407)
Summary:
Travis CI is depreciated and haven't been maintained for some time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10407

Reviewed By: ajkr

Differential Revision: D38078382

Pulled By: jay-zhuang

fbshipit-source-id: f42057f2f41f722bdce56bf195f67a94835191fb
2022-07-22 20:16:45 -07:00
Yu Zhao 00540916 bfc737da21 fix typos in some code and comment (#10139)
Summary:
Minor issue, I just found a few typos on db_test and column_family while reading the code. And I have this PR opened to contribute.  :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10139

Reviewed By: ajkr

Differential Revision: D38007098

Pulled By: jay-zhuang

fbshipit-source-id: 511947b32424c34348184691216640f32c410fb1
2022-07-22 19:25:52 -07:00
Andrew Kryczka 25cc564ff7 Make RateLimiter not Customizable (#10378)
Summary:
(PR created for informational/testing purposes only.)

- Fixes lost dynamic updates to GenericRateLimiter bandwidth using `SetBytesPerSecond()`
- Benefit over #10374 is eliminating race conditions with Configurable framework.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10378

Reviewed By: pdillinger

Differential Revision: D37914865

fbshipit-source-id: d4f566d60ec9726d26932388c61671adf0ee0f30
2022-07-18 14:48:42 -07:00
Levi Tamasi c73d2a9d18 Add API for writing wide-column entities (#10242)
Summary:
The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
a new API called `PutEntity` that can be used to write a wide-column entity
to the database. The new API is added to both `DB` and `WriteBatch`. Note
that currently there is no way to retrieve these entities; more precisely, all
read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
encounter a wide-column entity that is required to answer a query. Read-side
support (as well as other missing functionality like `Merge`, compaction filter,
and timestamp support) will be added in later PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D37369748

Pulled By: ltamasi

fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d
2022-06-25 15:30:47 -07:00
Andrew Kryczka 5d6005c780 Add WriteOptions::protection_bytes_per_key (#10037)
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.

Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.

There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037

Test Plan:
- Manual
  - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
  - Enabled in MyShadow for 1+ week
- Automated
  - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
  - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection

Reviewed By: cbi42

Differential Revision: D36614569

Pulled By: ajkr

fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
2022-06-16 23:10:07 -07:00
mpoeter 77f4799515 Fix potential leak when reusing PinnableSlice instances. (#10166)
Summary:
`PinnableSlice` may hold a handle to a cache value which must be released to correctly decrement the ref-counter. However, when `PinnableSlice` variables are reused, e.g. like this:
```
PinnableSlice pin_slice;
db.Get("foo", &pin_slice);
db.Get("foo", &pin_slice);
```
then the second `Get` simply overwrites the old value in `pin_slice` and the handle returned by the first `Get` is _not_ released.

This PR adds `Reset` calls to the `Get`/`MultiGet` calls that accept `PinnableSlice` arguments to ensure proper cleanup of old values.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10166

Reviewed By: hx235

Differential Revision: D37151632

Pulled By: ajkr

fbshipit-source-id: 9dd3c3288300f560531b843f67db11aeb569a9ff
2022-06-14 21:29:52 -07:00
Yanqin Jin 1777e5f7e9 Snapshots with user-specified timestamps (#9879)
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.

It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.

This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.

In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.

This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```

If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.

Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();

// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
    std::shared_ptr<TransactionNotifier> notifier,
    TxnTimestamp ts,
    std::shared_ptr<const Snapshot>* ret);

// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```

The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```

We also provide two additional APIs for stats collection and reporting purposes.

```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
    TxnTimestamp ts_lb,
    TxnTimestamp ts_ub,
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```

To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```

Before shutdown, RocksDB will release all timestamped snapshots.

Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.

Timestamped snapshots can also be used with user-defined timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879

Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```

Reviewed By: siying

Differential Revision: D35783919

Pulled By: riversand963

fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
2022-06-10 16:07:03 -07:00
Peter Dillinger d3a3b02134 Fix bug with kHashSearch and changing prefix_extractor with SetOptions (#10128)
Summary:
When opening an SST file created using index_type=kHashSearch,
the *current* prefix_extractor would be saved, and used with hash index
if the *new current* prefix_extractor at query time is compatible with
the SST file. This is a problem if the prefix_extractor at SST open time
is not compatible but SetOptions later changes (back) to one that is
compatible.

This change fixes that by using the known compatible (or missing) prefix
extractor we save for use with prefix filtering. Detail: I have moved the
InternalKeySliceTransform wrapper to avoid some indirection and remove
unnecessary fields.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10128

Test Plan:
expanded unit test (using some logic from https://github.com/facebook/rocksdb/issues/10122) that fails
before fix and probably covers some other previously uncovered cases.

Reviewed By: siying

Differential Revision: D36955738

Pulled By: pdillinger

fbshipit-source-id: 0c78a6b0d24054ef2f3cb237bf010c1c5589fb10
2022-06-10 08:51:45 -07:00
Akanksha Mahajan 700d597bd8 Expose unix time in rocksdb::Snapshot (#9923)
Summary:
RocksDB snapshot already has a member unix_time_ set after
snapshot is taken. It is now exposed through GetSnapshotTime() API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9923

Test Plan: Update unit tests

Reviewed By: riversand963

Differential Revision: D36048275

Pulled By: akankshamahajan15

fbshipit-source-id: 825210ec287deb0bc3aaa9b8e1f079f07ad686fa
2022-05-23 22:31:08 -07:00
Yanqin Jin b58a1a035b Revert "Bugfix/fix manual flush blocking bug (#9893)" (#9992)
Summary:
This reverts commit 6d2577e567.

A proposal for resolving our current internal test failures. A fix is
being planned.

More context can be found: https://github.com/facebook/rocksdb/pull/9893#issuecomment-1126230634
TSAN error: https://github.com/facebook/rocksdb/pull/9893#issuecomment-1126233132

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9992

Reviewed By: siying

Differential Revision: D36379154

Pulled By: riversand963

fbshipit-source-id: b240261e766eff099513799cf5631832093f4cd2
2022-05-13 12:31:30 -07:00
sdong 736a7b5433 Remove own ToString() (#9955)
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955

Test Plan: Watch CI tests

Reviewed By: riversand963

Differential Revision: D36176799

fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
2022-05-06 13:03:58 -07:00
RoeyMaor 6d2577e567 Bugfix/fix manual flush blocking bug (#9893)
Summary:
Fix https://github.com/facebook/rocksdb/issues/9892

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9893

Reviewed By: jay-zhuang

Differential Revision: D35880959

Pulled By: ajkr

fbshipit-source-id: dad1139ad0983cfbd5c5cd6fa6b71022f889735a
2022-04-25 18:52:33 -07:00
Peter Dillinger 1bac873fcf Mark GetLiveFilesStorageInfo ready for production use (#9868)
Summary:
... by filling out remaining testing hole: handling of
db_pathsi+cf_paths. (Note that while GetLiveFilesStorageInfo works
with db_paths / cf_paths, Checkpoint and BackupEngine do not and
are marked appropriately.)

Also improved comments for "live files" APIs, and grouped them
together in db.h.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9868

Test Plan: Adding to existing unit tests

Reviewed By: jay-zhuang

Differential Revision: D35752254

Pulled By: pdillinger

fbshipit-source-id: c70eb67748fad61826e2f554b674638700abefb2
2022-04-20 16:09:34 -07:00
Peter Dillinger 40e3f30a28 Fix FileStorageInfo fields from GetLiveFilesMetaData (#9769)
Summary:
In making `SstFileMetaData` inherit from `FileStorageInfo`, I
overlooked setting some `FileStorageInfo` fields when then default
`SstFileMetaData()` ctor is used. This affected `GetLiveFilesMetaData()`.

Also removed some buggy `static_cast<size_t>`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9769

Test Plan: Updated tests

Reviewed By: jay-zhuang

Differential Revision: D35220383

Pulled By: pdillinger

fbshipit-source-id: 05b4ee468258dbd3699517e1124838bf405fe7f8
2022-03-29 14:36:35 -07:00
Yanqin Jin 629e3e1d77 Fix spelling in public API (#9490)
Summary:
I feel it would be nice if we can fix this spelling error.

In `SizeApproximationOptions`, the `include_memtabtles` should be `include_memtables`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9490

Test Plan: make check

Reviewed By: hx235

Differential Revision: D33949862

Pulled By: riversand963

fbshipit-source-id: b2be67501b65d4aabb6b8df1bf25eb8d54cc1466
2022-02-03 15:15:23 -08:00
Yanqin Jin 3122cb4358 Revise APIs related to user-defined timestamp (#8946)
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".

According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).

For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.

Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.

The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.

Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946

Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0

Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```

Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom   :       1.831 micros/op 546235 ops/sec;   60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom   :       1.820 micros/op 549404 ops/sec;   60.8 MB/s
```

Reviewed By: ltamasi

Differential Revision: D33721359

Pulled By: riversand963

fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
2022-02-01 22:19:01 -08:00
Yanqin Jin d10c5c08d3 Remove iter_start_seqnum and preserve_deletes (#9430)
Summary:
According to https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl.cc#L2896:L2911 and https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl_open.cc#L203:L208,
we are going to remove `iter_start_seqnum` and `preserve_deletes` starting from RocksDB 7.0

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9430

Test Plan: make check and CI

Reviewed By: ajkr

Differential Revision: D33753639

Pulled By: riversand963

fbshipit-source-id: c80aab8e8d8fc33e52472fed524ed703d0ffc8b6
2022-01-28 13:28:38 -08:00
Hui Xiao 1e0e883ca5 Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452)
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452

Test Plan: Rely on my eyeball and CI

Reviewed By: ajkr

Differential Revision: D33804938

Pulled By: hx235

fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945
2022-01-27 13:01:09 -08:00
Si Ke 93b1de4f45 Enable db_test running in Centos 32 bit OS and Alpine 32 bit OS (#9294)
Summary:
Closes https://github.com/facebook/rocksdb/issues/9271

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9294

Reviewed By: riversand963, hx235

Differential Revision: D33586002

Pulled By: pdillinger

fbshipit-source-id: 3d1a2fa71023e108613ff03dbd37a5f954fc4920
2022-01-14 11:58:18 -08:00
slk 2e5f764294 Make IncreaseFullHistoryTsLow to a public API (#9221)
Summary:
As (https://github.com/facebook/rocksdb/issues/9210) discussed, the **full_history_ts_low** is a member of CompactRangeOptions currently, which means a CF's fullHistoryTsLow is advanced only when users submit a CompactRange request.
However, users may want to advance the fllHistoryTsLow without an immediate compact.
This merge make IncreaseFullHistoryTsLow to a public API so users can advance each CF's fullHistoryTsLow seperately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9221

Reviewed By: akankshamahajan15

Differential Revision: D33201106

Pulled By: riversand963

fbshipit-source-id: 9cb1d013ba93260f72e16353e693ffee167b47ee
2021-12-23 11:03:51 -08:00
mrambacher 9a116ab4b4 Add NewMetaDataIterator method (#8692)
Summary:
Fixes a problem where the iterator for metadata was being treated as a non-user key when in fact it was a user key.  This led to a problem where the property keys could not be searched for correctly.

The main exposure of this problem was that the HashIndexReader could not get the "prefixes" property correctly, resulting in the failure of retrieval/creation of the BlockPrefixIndex.

Added BlockBasedTableTest.SeekMetaBlocks test to validate this condition.

Fixing this condition exposed two other tests (SeekWithPrefixLongerThanKey, MultiGetPrefixFilter) that passed incorrectly previously and now failed.  Updated those two tests to pass.  Not sure if the tests are functionally correct/still appropriate, but made them pass...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8692

Reviewed By: riversand963

Differential Revision: D33119539

Pulled By: mrambacher

fbshipit-source-id: 658969fe9265f73dc184dab97cc3f4eaed2d881a
2021-12-21 11:32:49 -08:00
Si Ke 79f4a04ee3 Get DBTest passing Assert Status Checked (#7737)
Summary:
Closes https://github.com/facebook/rocksdb/pull/7737

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9231

Reviewed By: hx235

Differential Revision: D32978332

Pulled By: pdillinger

fbshipit-source-id: b28900b685d60c668529a90dbaa8e1b357b28f76
2021-12-09 11:00:17 -08:00
mrambacher 7cd5835a28 Make RateLimiter Customizable (#9141)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9141

Reviewed By: zhichao-cao

Differential Revision: D32432190

Pulled By: mrambacher

fbshipit-source-id: 7930ed88a02412128cd407b5063522484e45c6ce
2021-12-01 06:57:02 -08:00
anand76 dddb791c18 Enable a few unit tests to use custom Env objects (#9087)
Summary:
Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087

Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc.

Reviewed By: riversand963

Differential Revision: D32007222

Pulled By: anand1976

fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792
2021-11-08 11:05:59 -08:00
Peter Dillinger ad5325a736 Experimental support for SST unique IDs (#8990)
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).

Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990

Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D31582865

Pulled By: pdillinger

fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
2021-10-18 23:32:01 -07:00
Peter Dillinger 3ffb3baa0b Add (Live)FileStorageInfo API (#8968)
Summary:
New classes FileStorageInfo and LiveFileStorageInfo and
'experimental' function DB::GetLiveFilesStorageInfo, which is intended
to largely replace several fragmented DB functions needed to create
checkpoints and backups.

This function is now used to create checkpoints and backups, because
it fixes many (probably not all) of the prior complexities of checkpoint
not having atomic access to DB metadata. This also ensures strong
functional test coverage of the new API. Specifically, much of the old
CheckpointImpl::CreateCustomCheckpoint has been migrated to and
updated in DBImpl::GetLiveFilesStorageInfo, with the former now
calling the latter.

Also, the class FileStorageInfo in metadata.h compatibly replaces
BackupFileInfo and serves as a new base class for SstFileMetaData.
Some old fields of SstFileMetaData are still provided (for now) but
deprecated.

Although FileStorageInfo::directory is accurate when using db_paths
and/or cf_paths, these have never been supported by Checkpoint
nor BackupEngine and still are not. This change does now detect
these cases and return NotSupported when appropriate. (More work
needed for support.)

Somehow this change broke ProgressCallbackDuringBackup, but
the progress_callback logic was dubious to begin with because it
would call the callback based on copy buffer size, not size actually
copied. Logic and test updated to track size actually copied
per-thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968

Test Plan:
tests updated.
DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl.
DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo,
including reading the data after DB close.
Added CheckpointTest.CheckpointWithDbPath (NotSupported).

Reviewed By: siying

Differential Revision: D31242045

Pulled By: pdillinger

fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0
2021-10-16 10:04:32 -07:00