Commit Graph

1034 Commits

Author SHA1 Message Date
Jay Zhuang 1c290c785d RemoteCompaction support Fallback to local compaction (#8709)
Summary:
Add support for fallback to local compaction, the user can
return `CompactionServiceJobStatus::kUseLocal` to instruct RocksDB to
run the compaction locally instead of waiting for the remote compaction
result.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8709

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30560163

Pulled By: jay-zhuang

fbshipit-source-id: 65d8905a4a1bc185a68daa120997f21d3198dbe1
2021-09-18 00:25:04 -07:00
Yanqin Jin b512f4bc76 Batch blob read IO for MultiGet (#8699)
Summary:
In batched `MultiGet()`, RocksDB batches blob read IO and uses `RandomAccessFileReader::MultiRead()`
to read the blobs instead of issuing multiple `Read()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8699

Test Plan:
```
make check
```

Reviewed By: ltamasi

Differential Revision: D31030861

Pulled By: riversand963

fbshipit-source-id: a0df6060cbfd54cff9515a4eee08807b1dbcb0c8
2021-09-17 19:23:13 -07:00
Peter Dillinger 4149d044cd Change `SstFileMetaData::size` from `size_t` to `uint64_t` (#8926)
Summary:
Because even 32-bit systems can have large files

This is a "change" that I don't want intermingled with an upcoming refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8926

Test Plan: CI

Reviewed By: zhichao-cao

Differential Revision: D31020974

Pulled By: pdillinger

fbshipit-source-id: ca9eb4510697df6f1f55e37b37730b88b1809a92
2021-09-17 13:23:34 -07:00
mrambacher 272cc77751 Added a default Name method to Statistics (#8918)
Summary:
This keeps the implementations/API backward compatible.  Implementations of Statistics will need to override this method (and be registered with the ObjectRegistry) in order to be created via CreateFromString.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8918

Reviewed By: pdillinger

Differential Revision: D30958916

Pulled By: mrambacher

fbshipit-source-id: 75b99a84e9e11fda2a9e8eff9ee1ef69a17517b2
2021-09-17 07:25:43 -07:00
Akanksha Mahajan d6aa8c49f8 Expose blob file information through the EventListener interface (#8675)
Summary:
1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files.
 2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted.
 3. Test OnFile*Finish operations notifications with Blob Files.
 4. Log the blob file creation/deletion events through EventLogger in Log file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8675

Test Plan: Add new unit tests in listener_test

Reviewed By: ltamasi

Differential Revision: D30412613

Pulled By: akankshamahajan15

fbshipit-source-id: ca51b63c6e8c8d0485a38c503572bc5a82bd5d07
2021-09-16 17:23:36 -07:00
Jay Zhuang b97c53b629 Add compaction priority information in RemoteCompaction (#8707)
Summary:
Add compaction priority information in RemoteCompaction, which
can be used to schedule high priority job first.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8707

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30548401

Pulled By: jay-zhuang

fbshipit-source-id: b30446511fb31b4583c49edd8565d496cf013a34
2021-09-16 15:09:35 -07:00
Peter Dillinger 2819c7840e Fix PrepopulateBlockCache::kFlushOnly (#8750)
Summary:
kFlushOnly currently means "always" except in the case of
remote compaction. This makes it flushes only.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8750

Test Plan: test updated

Reviewed By: akankshamahajan15

Differential Revision: D30968034

Pulled By: pdillinger

fbshipit-source-id: 5dbd24dde18852a0e937a540995fba9bfbe89037
2021-09-15 15:33:20 -07:00
anand76 7743f033b1 More robust checking of IO uring completion data (#8894)
Summary:
Potential bugs in the IO uring implementation can cause bad data to be returned in the completion queue. Add some checks in the PosixRandomAccessFile::MultiRead completion handling code to catch such errors and fail the entire MultiRead. Also log some diagnostic messages and stack trace.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8894

Reviewed By: siying, pdillinger

Differential Revision: D30826982

Pulled By: anand1976

fbshipit-source-id: af91815ac760e095d6cc0466cf8bd5c10167fd15
2021-09-15 12:44:43 -07:00
eharry 0b6be7eb68 Fix WAL log data corruption #8723 (#8746)
Summary:
Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together (https://github.com/facebook/rocksdb/issues/8723)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8746

Reviewed By: ajkr

Differential Revision: D30758468

Pulled By: riversand963

fbshipit-source-id: 07c20899d5f2447dc77861b4845efc68a59aa4e8
2021-09-13 20:15:59 -07:00
Yanqin Jin 2a2b3e03a5 Allow WriteBatch to have keys with different timestamp sizes (#8725)
Summary:
In the past, we unnecessarily requires all keys in the same write batch
to be from column families whose timestamps' formats are the same for
simplicity. Specifically, we cannot use the same write batch to write to
two column families, one of which enables timestamp while the other
disables it.

The limitation is due to the member `timestamp_size_` that used to exist
in each `WriteBatch` object. We pass a timestamp_size to the constructor
of `WriteBatch`. Therefore, users can simply use the old
`WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while
the internal implementation of `WriteBatch` will take care of memory
allocation for timestamps.

The above is not necessary.
One the one hand, users can set up a memory buffer to store user key and
then contiguously append the timestamp to the user key. Then the user
can pass this buffer to the `WriteBatch::Put(Slice&)` API.
On the other hand, users can set up a SliceParts object which is an
array of Slices and let the last Slice to point to the memory buffer
storing timestamp. Then the user can pass the SliceParts object to the
`WriteBatch::Put(SliceParts&)` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8725

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D30654499

Pulled By: riversand963

fbshipit-source-id: 9d848c77ad3c9dd629aa5fc4e2bc16fb0687b4a2
2021-09-12 15:34:26 -07:00
Levi Tamasi 5f40b05c98 Update HISTORY.md for PR 8899 (#8905)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8905

Reviewed By: zhichao-cao

Differential Revision: D30873416

Pulled By: ltamasi

fbshipit-source-id: 6e55ec14a7fd2e562aa24cd0274e2436369923f5
2021-09-12 08:19:05 -07:00
Hui Xiao 12542488ef Add public API RateLimiter::GetTotalPendingRequests() (#8890)
Summary:
Context/Summary:
As users requested, a public API RateLimiter::GetTotalPendingRequests() is added to expose the total number of pending requests for bytes in the rate limiter, which is the size of the request queue of that priority (or of all priorities, if IO_TOTAL is interested) at the time when this API is called.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8890

Test Plan:
- Passing added new unit tests
- Passing existing unit tests

Reviewed By: ajkr

Differential Revision: D30815500

Pulled By: hx235

fbshipit-source-id: 2dfa990f651c1c47378b6215c751ad76a5824300
2021-09-10 08:37:04 -07:00
Hui Xiao 6785135bc2 Update HISTORY.md for new rate limiter io priorities (#8896)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8896

Reviewed By: ajkr

Differential Revision: D30846120

Pulled By: hx235

fbshipit-source-id: 9224ebce5437d63b0fb8af9171c6041a9ea5d90f
2021-09-09 13:26:20 -07:00
hx235 45175ca2e1 Charge read to rate limiter in BackupEngine (#8722)
Summary:
Context:
While all the non-trivial write operations in BackupEngine go through the RateLimiter, reads currently do not. In general, this is not a huge issue because (especially since some I/O efficiency fixes) reads in BackupEngine are mostly limited by corresponding writes, for both backup and restore. But in principle we should charge the RateLimiter for reads as well.
- Charged read operations in `BackupEngineImpl::CopyOrCreateFile`, `BackupEngineImpl::ReadFileAndComputeChecksum`, `BackupEngineImpl::BackupMeta::LoadFromFile` and `BackupEngineImpl::GetFileDbIdentities`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8722

Test Plan:
- Passed existing tests
- Passed added unit tests

Reviewed By: pdillinger

Differential Revision: D30610464

Pulled By: hx235

fbshipit-source-id: 9b08c9387159a5385c8d390d6666377a0d0117e5
2021-09-08 16:24:40 -07:00
Zhiyi Zhang 0cb0fc6fd3 Add DB properties for BlobDB (#8734)
Summary:
RocksDB exposes certain internal statistics via the DB property interface.
However, there are currently no properties related to BlobDB.

For starters, we would like to add the following BlobDB properties:
`rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree).
`rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version.
`rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions.
`rocksdb.live-blob-file-size`: the total size of all blob files in the current Version.
`rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734

Test Plan:
```
➜  rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test
[==========] Running 16 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 10 tests from DBBlobBasicTest
[ RUN      ] DBBlobBasicTest.GetBlob
[       OK ] DBBlobBasicTest.GetBlob (12 ms)
[ RUN      ] DBBlobBasicTest.MultiGetBlobs
[       OK ] DBBlobBasicTest.MultiGetBlobs (11 ms)
[ RUN      ] DBBlobBasicTest.GetBlob_CorruptIndex
[       OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms)
[ RUN      ] DBBlobBasicTest.GetBlob_InlinedTTLIndex
[       OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms)
[ RUN      ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber
[       OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms)
[ RUN      ] DBBlobBasicTest.GenerateIOTracing
[       OK ] DBBlobBasicTest.GenerateIOTracing (11 ms)
[ RUN      ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile
[       OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms)
[ RUN      ] DBBlobBasicTest.GetMergeBlobWithPut
[       OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms)
[ RUN      ] DBBlobBasicTest.MultiGetMergeBlobWithPut
[       OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms)
[ RUN      ] DBBlobBasicTest.BlobDBProperties
[       OK ] DBBlobBasicTest.BlobDBProperties (21 ms)
[----------] 10 tests from DBBlobBasicTest (124 ms total)

[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms)
[ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1
[       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms)
[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total)

[----------] Global test environment tear-down
[==========] 16 tests from 2 test cases ran. (2190 ms total)
[  PASSED  ] 16 tests.
```

Reviewed By: ltamasi

Differential Revision: D30690849

Pulled By: Zhiyi-Zhang

fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 12:22:04 -07:00
Peter Dillinger e40b04e9fa Fix POSIX LockFile after failure to create file (#8747)
Summary:
Failure to create the lock file (e.g. out of space) could
prevent future LockFile attempts in the same process on the same file
from succeeding.

Also added DEBUG code to fail assertion if PosixFileLock is destroyed
without using UnlockFile (which is a risk because FileLock is in the
public API with virtual destructor).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8747

Test Plan: test added

Reviewed By: ajkr

Differential Revision: D30732543

Pulled By: pdillinger

fbshipit-source-id: 4c30a959566d91f778d6fad3fbbd5f3941b097c1
2021-09-07 22:41:36 -07:00
Andrew Kryczka 9308ff366c Bytes read/written stats for `CreateNewBackup*()` (#8819)
Summary:
Gets `Statistics` from the options associated with the `DB` undergoing backup, and populates new ticker stats with the thread-local `IOContext` read/write counters for the threads doing backup work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8819

Reviewed By: pdillinger

Differential Revision: D30779238

Pulled By: ajkr

fbshipit-source-id: 75ccafc355f90906df5cf80367f7245b985772d8
2021-09-07 18:25:16 -07:00
Andrew Kryczka 941543721d Bytes read stat for `VerifyChecksum()` and `VerifyFileChecksums()` APIs (#8741)
Summary:
- Clarified some comments on compatibility for adding new ticker stats
- Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8741

Test Plan: new unit test

Reviewed By: zhichao-cao

Differential Revision: D30708578

Pulled By: ajkr

fbshipit-source-id: d06b961f7e199ae92c266b683e39870aa8f63449
2021-09-07 13:28:29 -07:00
Peter Dillinger c9cd5d25a8 Remove some unneeded code (#8736)
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736

Test Plan: existing tests

Reviewed By: mrambacher

Differential Revision: D30700039

Pulled By: pdillinger

fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
2021-09-01 14:28:58 -07:00
Peter Dillinger 32752551b9 Fix a buffer size race condition in BackupEngine (#8732)
Summary:
If RateLimiter burst bytes changes during concurrent Restore
operations

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8732

Test Plan: updated unit test fails with TSAN before change, passes after

Reviewed By: ajkr

Differential Revision: D30683879

Pulled By: pdillinger

fbshipit-source-id: d0ddb3587ade91ee2a4d926b475acf7781b03086
2021-09-01 14:28:58 -07:00
anand76 ec9f52ece6 Fix a race in LRUCacheShard::Promote (#8717)
Summary:
In ```LRUCacheShard::Promote```, a reference is released outside the LRU mutex. Fix the race condition.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8717

Reviewed By: zhichao-cao

Differential Revision: D30649206

Pulled By: anand1976

fbshipit-source-id: 09c0af05b2294a7fe2c02876a61b0bad6e3ada61
2021-08-30 19:10:55 -07:00
Peter Dillinger 13ded69484 Built-in support for generating unique IDs, bug fix (#8708)
Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.

This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.

Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
  * Lots of info from default Env like time, process id, and hostname.
  * std::random_device
  * port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.

DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.

GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.

Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708

Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D30563780

Pulled By: pdillinger

fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364
2021-08-30 15:20:41 -07:00
Merlin Mao 6c2bd28a61 Update comments, fix typos. (#8721)
Summary:
- Removed the default empty constructors of `TraceWriter` and `TraceReader`.
- Removed unused `ReadFooter()` from `ReplayerImpl`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8721

Test Plan: None

Reviewed By: zhichao-cao

Differential Revision: D30609743

Pulled By: autopear

fbshipit-source-id: 7e2626b015bd57ebb408a2836b4b4217cea10002
2021-08-27 13:16:32 -07:00
anand76 ebaa3c8a59 Fix a race condition in DumpStats() during iteration of the ColumnFamilySet (#8714)
Summary:
DumpStats() iterates through the ColumnFamilySet. There is a potential
race condition because it does Ref the cfd, and the cfd could get
destroyed during the iteration.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8714

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D30580199

Pulled By: anand1976

fbshipit-source-id: 60a3443ad0d4f7ac6a977dec780e6d2c1b70b850
2021-08-26 15:40:26 -07:00
Yanqin Jin f235f4b0a3 Fix a bug of secondary instance sequence going backward (#8653)
Summary:
Recent refactor of `ReactiveVersionSet::ReadAndApply()` uses
`ManifestTailer` whose `Iterate()` method can cause the db's
`last_sequence_` to go backward. Consequently, read requests can see
out-dated data. For example, latest changes to the primary will not be
seen on the secondary even after a `TryCatchUpWithPrimary()` if no new
write batches are read from the WALs and no new MANIFEST entries are
read from the MANIFEST.

Fix the bug so that `VersionEditHandler::CheckIterationResult` will
never decrease `last_sequence_`, `last_allocated_sequence_` and
`last_published_sequence_`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8653

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30272084

Pulled By: riversand963

fbshipit-source-id: c6a49c534b2509b93ef62d8936ed0acd5b860eaa
2021-08-24 18:18:36 -07:00
Yanqin Jin 229350ef48 Allow iterate refresh for secondary instance (#8700)
Summary:
Test plan
make check

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8700

Reviewed By: zhichao-cao

Differential Revision: D30523907

Pulled By: riversand963

fbshipit-source-id: 68928ab4dafb64ce80ab7bc69d83727a4713ab91
2021-08-24 15:40:56 -07:00
Jay Zhuang 249b1078c9 Add extra information to RemoteCompaction APIs (#8680)
Summary:
Currently, we only provide job_id in RemoteCompaction APIs, the
main problem of `job_id` is it cannot uniquely identify a compaction job
between DB instances or between sessions.
Providing DB and session id to the user, which will make building cross
DB compaction service easier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30444859

Pulled By: jay-zhuang

fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448
2021-08-23 16:27:38 -07:00
Peter Dillinger 0637c8d36c Fix typo in 6.24.0 HISTORY.md (#8694)
Summary:
fix typo

Also, clarified change of C API signatures.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8694

Test Plan: visual

Reviewed By: ltamasi

Differential Revision: D30492882

Pulled By: pdillinger

fbshipit-source-id: ac6dc3dcefa01c91fd87fc7f50279ea5e13fa41d
2021-08-23 13:30:34 -07:00
Levi Tamasi 8c9e689790 Update version.h and HISTORY.md for the 6.24 release (#8688)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8688

Reviewed By: ajkr, riversand963

Differential Revision: D30467746

Pulled By: ltamasi

fbshipit-source-id: 0fce0d42fe2fe3cb56d7a89607154b3b957f09b6
2021-08-20 22:28:16 -07:00
Peter Dillinger 2a383f21f4 Add Bloom/Ribbon hybrid API support (#8679)
Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.

So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)

I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.

C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.

BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679

Test Plan: new + updated tests, including crash test

Reviewed By: jay-zhuang

Differential Revision: D30445797

Pulled By: pdillinger

fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
2021-08-20 18:00:16 -07:00
anand76 f35042ca40 Add a PerfContext counter for secondary cache hits (#8685)
Summary:
Add a PerfContext counter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8685

Reviewed By: zhichao-cao

Differential Revision: D30453957

Pulled By: anand1976

fbshipit-source-id: 42888a3ced240e1c44446d52d3b04adfb01f5665
2021-08-20 15:17:30 -07:00
Akanksha Mahajan 5efec84c60 Fix blob callback in compaction and atomic flush (#8681)
Summary:
Pass BlobFileCompletionCallback  in case of atomic flush and
compaction job which is currently nullptr(default parameter).
BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to
SstFileManager.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681

Test Plan: CircleCI jobs

Reviewed By: ltamasi

Differential Revision: D30445998

Pulled By: akankshamahajan15

fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021
2021-08-20 11:41:14 -07:00
mrambacher 9eb002fcf0 Fix some minor issues in the Customizable infrastructure (#8566)
Summary:
- Fix issue with OptionType::Vector when the nested item is a Customizable with no names
- Fix issue with OptionType::Vector to appropriately wrap the elements in a Vector;
- Fix an issue with nested Customizable object with a null immutable object still appearing in the mutable options;
- Fix/Add tests for null/empty customizable objects
- Move the RegisterTestObjects from customizable_test into testutil.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8566

Reviewed By: zhichao-cao

Differential Revision: D30303724

Pulled By: mrambacher

fbshipit-source-id: 33fa8ea2a3b663210cb356da05e64aab7585b1b5
2021-08-19 10:10:47 -07:00
Merlin Mao d10801e983 Allow Replayer to report the results of TraceRecords. (#8657)
Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.

New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".

`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657

Reviewed By: ajkr

Differential Revision: D30290216

Pulled By: autopear

fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
2021-08-18 17:06:14 -07:00
Yanqin Jin 2b367fa8cc Fix bug caused by releasing snapshot(s) during compaction (#8608)
Summary:
In debug mode, we are seeing assertion failure as follows

```
db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \
Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed.
```

It is caused by releasing earliest snapshot during compaction between the execution of
`NextFromInput()` and `PrepareOutput()`.

In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`,
incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation
level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also
outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be
skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened.

Fix by taking a more conservative approach: make compaction zero out sequence number only
if key is in the earliest snapshot when the compaction starts.

Another assertion failure is
```
Assertion `current_user_key_snapshot_ == last_snapshot' failed.
```

It's caused by releasing the snapshot between the PUT and SingleDelete during compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30145645

Pulled By: riversand963

fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81
2021-08-17 22:14:20 -07:00
Levi Tamasi 6878cedcc3 Add statistics support to integrated BlobDB (#8667)
Summary:
The patch adds statistics support to the integrated BlobDB implementation,
namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and
`BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms
`BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like
`BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`,
`BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.)
Note that the vast majority of the old BlobDB's tickers/histograms are not
really applicable to the new implementation, since they e.g. pertain to calling
dedicated BlobDB APIs (which the integrated BlobDB does not have) or are
tied to the legacy BlobDB's design of writing blob files synchronously when
a write API is called. Such statistics are marked "legacy BlobDB only" in
`statistics.h`.

Fixes https://github.com/facebook/rocksdb/issues/8645 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667

Test Plan: Ran `make check` and tested the new statistics using `db_bench`.

Reviewed By: riversand963

Differential Revision: D30356884

Pulled By: ltamasi

fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4
2021-08-17 17:22:31 -07:00
anand76 add68bd28a Add a stat to count secondary cache hits (#8666)
Summary:
Add a stat for secondary cache hits. The ```Cache::Lookup``` API had an unused ```stats``` parameter. This PR uses that to pass the pointer to a ```Statistics``` object that ```LRUCache``` uses to record the stat.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8666

Test Plan: Update a unit test in lru_cache_test

Reviewed By: zhichao-cao

Differential Revision: D30353816

Pulled By: anand1976

fbshipit-source-id: 2046f78b460428877a26ffdd2bb914ae47dfbe77
2021-08-16 21:01:14 -07:00
Jay Zhuang c55460c734 Add property `LiveSstFilesSizeAtTemperature` for tiered storage (#8644)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8644

Reviewed By: siying, zhichao-cao

Differential Revision: D30236535

Pulled By: jay-zhuang

fbshipit-source-id: 1758d1c46d83a5087560fb63d53a016bf999da81
2021-08-15 14:17:45 -07:00
Merlin Mao 74a652a45f Code cleanup for trace replayer (#8652)
Summary:
- Remove extra `;` in trace_record.h
- Remove some unnecessary `assert` in trace_record_handler.cc
- Initialize `env_` after` exec_handler_` in `ReplayerImpl` to let db be asserted in creating the handler before getting `db->GetEnv()`.
- Update history to include the new `TraceReader::Reset()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8652

Reviewed By: ajkr

Differential Revision: D30276872

Pulled By: autopear

fbshipit-source-id: 476ee162e0f241490c6209307448343a5b326b37
2021-08-12 09:22:43 -07:00
Merlin Mao f58d276764 Make TraceRecord and Replayer public (#8611)
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.

User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.

Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611

Reviewed By: ajkr

Differential Revision: D30266329

Pulled By: autopear

fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
2021-08-11 19:32:46 -07:00
Jay Zhuang 87e2358736 Add suggestion for btrfs user to disable preallocation (#8646)
Summary:
Add comment for `options.allow_fallocate` that btrfs
preallocated space are not freed and a suggestion to disable
preallocation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8646

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D30240050

Pulled By: jay-zhuang

fbshipit-source-id: 75b7190bc8276ce8d8ac2d0cb9064b386cbf4768
2021-08-11 14:53:37 -07:00
sdong e7c24168d8 Move old files to warm tier in FIFO compactions (#8310)
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310

Test Plan: Add several unit tests.

Reviewed By: ajkr

Differential Revision: D28493792

fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
2021-08-09 12:51:14 -07:00
Levi Tamasi 87882736ef Fix the sorting of KeyContexts for batched MultiGet (#8633)
Summary:
`CompareKeyContext::operator()` on the trunk has a bug: when comparing
column family IDs, `lhs` is used for both sides of the comparison. This
results in the `KeyContext`s getting sorted solely based on key, which
in turn means that keys with the same column family do not necessarily
form a single range in the sorted list. This violates an assumption of the
batched `MultiGet` logic, leading to the same column family
showing up multiple times in the list of `MultiGetColumnFamilyData`.
The end result is the code attempting to check out the thread-local
`SuperVersion` for the same CF multiple times, causing an
assertion violation in debug builds and memory corruption/crash in
release builds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D30169182

Pulled By: ltamasi

fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023
2021-08-06 16:27:42 -07:00
Peter Dillinger a7fd1d0881 Make backup restore atomic, with sync option (#8568)
Summary:
Guarantees that if a restore is interrupted, DB::Open will fail. This works by
restoring CURRENT first to CURRENT.tmp then as a final step renaming to CURRENT.

Also makes restore respect BackupEngineOptions::sync (default true). When set,
the restore is guaranteed persisted by the time it returns OK. Also makes the above
atomicity guarantee work in case the interruption is power loss or OS crash (not just
process interruption or crash).

Fixes https://github.com/facebook/rocksdb/issues/8500

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8568

Test Plan:
added to backup mini-stress unit test. Passes with
gtest_repeat=100 (whereas fails 7 times without the CURRENT.tmp)

Reviewed By: akankshamahajan15

Differential Revision: D29812605

Pulled By: pdillinger

fbshipit-source-id: 24e9a993b305b1835ca95558fa7a7152e54cda8e
2021-08-06 09:50:21 -07:00
Akanksha Mahajan fd2079938d Dynamically configure BlockBasedTableOptions.prepopulate_block_cache (#8620)
Summary:
Dynamically configure BlockBasedTableOptions.prepopulate_block_cache using DB::SetOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8620

Test Plan: Added new unit test

Reviewed By: anand1976

Differential Revision: D30091319

Pulled By: akankshamahajan15

fbshipit-source-id: fb586d1848a8dd525bba7b2f9eeac34f2fc6d82c
2021-08-05 19:44:51 -07:00
Yanqin Jin b01a428d9b Update HISTORY for PR8585 (#8623)
Summary:
Update HISTORY.md for PR https://github.com/facebook/rocksdb/issues/8585 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8623

Reviewed By: ltamasi

Differential Revision: D30121910

Pulled By: riversand963

fbshipit-source-id: 525af43fad908a498f22ed4f934ec5cbf60e6d25
2021-08-04 18:45:52 -07:00
Andrew Kryczka a685a701ca Do not attempt to rename non-existent info log (#8622)
Summary:
Previously we attempted to rename "LOG" to "LOG.old.*" without checking
its existence first. "LOG" had no reason to exist in a new DB.

Errors in renaming a non-existent "LOG" were swallowed via
`PermitUncheckedError()` so things worked. However the storage service's
error monitoring was detecting all these benign rename failures. So it
is better to fix it. Also with this PR we can now distinguish rename failure
for other reasons and return them.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8622

Test Plan: new unit test

Reviewed By: akankshamahajan15

Differential Revision: D30115189

Pulled By: ajkr

fbshipit-source-id: e2f337ffb2bd171be0203172abc8e16e7809b170
2021-08-04 17:25:00 -07:00
Levi Tamasi 08af0ae3f0 Mention PR 8605 in HISTORY.md (#8619)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8619

Reviewed By: riversand963

Differential Revision: D30081937

Pulled By: ltamasi

fbshipit-source-id: 57505957ae2c22d4b194aa28cb3fd261b3b39919
2021-08-03 16:15:12 -07:00
Akanksha Mahajan 8b2f60b668 Cache warming blocks during flush (#8561)
Summary:
Insert warm blocks  (data, uncompressed dict, index and filter blocks) during flush in Block cache which is enabled under option BlockBasedTableOptions.prepopulate_block_cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8561

Test Plan: Added unit test

Reviewed By: anand1976

Differential Revision: D29773411

Pulled By: akankshamahajan15

fbshipit-source-id: 6631123c10134340ef0bd7e90baafaa6deba0e66
2021-08-03 12:44:15 -07:00
Mikhail Golubev 8f52972cf9 Allow to use a string as a delimiter in StringAppendOperator (#8536)
Summary:
An arbitrary string can be used as a delimiter in StringAppend merge operator
flavor. In particular, it allows using an empty string, combining binary values for
the same key byte-to-byte one next to another.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8536

Reviewed By: mrambacher

Differential Revision: D29962120

Pulled By: zhichao-cao

fbshipit-source-id: 4ef5d846a47835cf428a11200409e30e2dbffc4f
2021-08-02 16:50:41 -07:00