Commit Graph

1002 Commits

Author SHA1 Message Date
anand76 799cf37cb1 Update HISTORY and version to 6.21 (#8363)
Summary:
Update HISTORY and version to 6.21 on master.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8363

Reviewed By: jay-zhuang

Differential Revision: D28888818

Pulled By: anand1976

fbshipit-source-id: 9e5fac3b99ecc9f3b7d9f21474a39fa50decb117
2021-06-03 19:32:14 -07:00
Peter Dillinger 956ce9bde2 Some API clarification for manual compaction and listeners (#8330)
Summary:
Avoid people hitting bugs

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8330

Test Plan: comments only

Reviewed By: siying

Differential Revision: D28683157

Pulled By: pdillinger

fbshipit-source-id: 2b34d3efb5e2fa34bea93d54c940cbd425212d25
2021-05-26 08:14:38 -07:00
Peter Dillinger 3469d60fcc Add table properties for number of entries added to filters (#8323)
Summary:
With Ribbon filter work and possible variance in actual bits
per key (or prefix; general term "entry") to achieve certain FP rates,
I've received a request to be able to track actual bits per key in
generated filters. This change adds a num_filter_entries table
property, which can be combined with filter_size to get bits per key
(entry).

This can vary from num_entries in at least these ways:
* Different versions of same key are only counted once in filters.
* With prefix filters, several user keys map to the same filter entry.
* A single filter can include both prefixes and user keys.

Note that FilterBlockBuilder::NumAdded() didn't do anything useful
except distinguish empty from non-empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8323

Test Plan: basic unit test included, others updated

Reviewed By: jay-zhuang

Differential Revision: D28596210

Pulled By: pdillinger

fbshipit-source-id: 529a111f3c84501e5a470bc84705e436ee68c376
2021-05-21 17:11:32 -07:00
Jay Zhuang 6c86543590 Fix manual compaction `max_compaction_bytes` under-calculated issue (#8269)
Summary:
Fix a bug that for manual compaction, `max_compaction_bytes` is only
limit the SST files from input level, but not overlapped files on output
level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8269

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D28231044

Pulled By: jay-zhuang

fbshipit-source-id: 9d7d03004f30cc4b1b9819830141436907554b7c
2021-05-21 14:03:44 -07:00
sdong bd3d080ef8 Try to build with liburing by default. (#8322)
Summary:
By default, try to build with liburing. For make, if ROCKSDB_USE_IO_URING is not set, treat as 1, which means RocksDB will try to build with liburing. For cmake, add WITH_LIBURING to control it, with default on.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8322

Test Plan: Build using cmake and make.

Reviewed By: anand1976

Differential Revision: D28586498

fbshipit-source-id: cfd39159ab697f4b93a9293a59c07f839b1e7ed5
2021-05-21 10:21:53 -07:00
sdong 2f1984dd45 Compare memtable insert and flush count (#8288)
Summary:
When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\
heck against memory corruption. This change will also allow more counters to be added in the future for better validation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288

Test Plan: Pass all existing tests

Reviewed By: ajkr

Differential Revision: D28369194

fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad
2021-05-20 16:07:28 -07:00
Jay Zhuang 3786181a90 Add remote compaction public API (#8300)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8300

Reviewed By: ajkr

Differential Revision: D28464726

Pulled By: jay-zhuang

fbshipit-source-id: 49e9f4fb791808a6cbf39a7b1a331373f645fc5e
2021-05-19 21:41:31 -07:00
Peter Dillinger 311a544c2a Use deleters to label cache entries and collect stats (#8297)
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.

The new information can be seen in info LOG, for example:

    Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
    Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)

And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):

    $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
    rocksdb.block-cache-entry-stats.bytes.data-block: 0
    rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.bytes.index-block: 178992
    rocksdb.block-cache-entry-stats.bytes.misc: 0
    rocksdb.block-cache-entry-stats.bytes.other-block: 0
    rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
    rocksdb.block-cache-entry-stats.capacity: 8388608
    rocksdb.block-cache-entry-stats.count.data-block: 0
    rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.count.index-block: 215
    rocksdb.block-cache-entry-stats.count.misc: 1
    rocksdb.block-cache-entry-stats.count.other-block: 0
    rocksdb.block-cache-entry-stats.count.write-buffer: 0
    rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
    rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
    rocksdb.block-cache-entry-stats.percent.misc: 0.000000
    rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
    rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
    rocksdb.block-cache-entry-stats.secs_since_last_collection: 0

Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.

This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.

An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".

Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.

Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.

This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297

Test Plan: manual testing with db_bench, and a couple of basic unit tests

Reviewed By: ltamasi

Differential Revision: D28488721

Pulled By: pdillinger

fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 16:51:13 -07:00
anand76 9d61a0856d Sync ingested files only if reopen is supported by the FS (#8296)
Summary:
Some file systems (especially distributed FS) do not support reopening a file for writing. The ExternalSstFileIngestionJob calls ReopenWritableFile in order to sync the ingested file, which typically makes sense only on a local file system with a page cache (i.e Posix). So this change tries to sync the ingested file only if ReopenWritableFile doesn't return Status::NotSupported().

Tests:
Add a new unit test in external_sst_file_basic_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8296

Reviewed By: jay-zhuang

Differential Revision: D28420865

Pulled By: anand1976

fbshipit-source-id: 380e7f5ff95324997f7a59864a9ac96ebbd0100c
2021-05-18 19:33:55 -07:00
sdong 60e5af83c1 Handle return code by io_uring_submit_and_wait() and io_uring_wait_cqe() (#8311)
Summary:
Right now return codes by io_uring_submit_and_wait() and io_uring_wait_cqe() are not handled. It is not the good practice. Although these two functions are not supposed to return non-0 values in normal exeuction, people suspect that they might return non-0 value when an interruption happens, and the code might cause hanging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8311

Test Plan: Make sure at least normal test cases still pass.

Reviewed By: anand1976

Differential Revision: D28500828

fbshipit-source-id: 8a76cea9cafbd041102e0b6a8eef9d0bfed7c211
2021-05-18 16:09:14 -07:00
Peter Dillinger 78a309bf86 New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:

* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)

I have enhanced cache_bench to validate this approach:

* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.

To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225

Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.

The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.

Baseline typical output:

```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662

Operation latency (ns):
Count: 80000000 Average: 11223.9494  StdDev: 29.61
Min: 0  Median: 7759.3973  Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[       0,       1 ]       68   0.000%   0.000%
(    2900,    4400 ]       89   0.000%   0.000%
(    4400,    6600 ] 33630240  42.038%  42.038% ########
(    6600,    9900 ] 18129842  22.662%  64.700% #####
(    9900,   14000 ]  7877533   9.847%  74.547% ##
(   14000,   22000 ] 15193238  18.992%  93.539% ####
(   22000,   33000 ]  3037061   3.796%  97.335% #
(   33000,   50000 ]  1626316   2.033%  99.368%
(   50000,   75000 ]   421532   0.527%  99.895%
(   75000,  110000 ]    56910   0.071%  99.966%
(  110000,  170000 ]    16134   0.020%  99.986%
(  170000,  250000 ]     5166   0.006%  99.993%
(  250000,  380000 ]     3017   0.004%  99.996%
(  380000,  570000 ]     1337   0.002%  99.998%
(  570000,  860000 ]      805   0.001%  99.999%
(  860000, 1200000 ]      319   0.000% 100.000%
( 1200000, 1900000 ]      231   0.000% 100.000%
( 1900000, 2900000 ]      100   0.000% 100.000%
( 2900000, 4300000 ]       39   0.000% 100.000%
( 4300000, 6500000 ]       16   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
```

New, gather_stats=false. Median thread ops/sec of 5 runs:

```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458

Operation latency (ns):
Count: 80000000 Average: 11298.1027  StdDev: 42.18
Min: 0  Median: 7722.0822  Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[       0,       1 ]      109   0.000%   0.000%
(    2900,    4400 ]      793   0.001%   0.001%
(    4400,    6600 ] 34054563  42.568%  42.569% #########
(    6600,    9900 ] 17482646  21.853%  64.423% ####
(    9900,   14000 ]  7908180   9.885%  74.308% ##
(   14000,   22000 ] 15032072  18.790%  93.098% ####
(   22000,   33000 ]  3237834   4.047%  97.145% #
(   33000,   50000 ]  1736882   2.171%  99.316%
(   50000,   75000 ]   446851   0.559%  99.875%
(   75000,  110000 ]    68251   0.085%  99.960%
(  110000,  170000 ]    18592   0.023%  99.983%
(  170000,  250000 ]     7200   0.009%  99.992%
(  250000,  380000 ]     3334   0.004%  99.997%
(  380000,  570000 ]     1393   0.002%  99.998%
(  570000,  860000 ]      700   0.001%  99.999%
(  860000, 1200000 ]      293   0.000% 100.000%
( 1200000, 1900000 ]      196   0.000% 100.000%
( 1900000, 2900000 ]       69   0.000% 100.000%
( 2900000, 4300000 ]       32   0.000% 100.000%
( 4300000, 6500000 ]       10   0.000% 100.000%
```

New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:

```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551

Operation latency (ns):
Count: 80000000 Average: 11311.2629  StdDev: 45.28
Min: 0  Median: 7686.5458  Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[       0,       1 ]       71   0.000%   0.000%
(    2900,    4400 ]      291   0.000%   0.000%
(    4400,    6600 ] 34492060  43.115%  43.116% #########
(    6600,    9900 ] 16727328  20.909%  64.025% ####
(    9900,   14000 ]  7845828   9.807%  73.832% ##
(   14000,   22000 ] 15510654  19.388%  93.220% ####
(   22000,   33000 ]  3216533   4.021%  97.241% #
(   33000,   50000 ]  1680859   2.101%  99.342%
(   50000,   75000 ]   439059   0.549%  99.891%
(   75000,  110000 ]    60540   0.076%  99.967%
(  110000,  170000 ]    14649   0.018%  99.985%
(  170000,  250000 ]     5242   0.007%  99.991%
(  250000,  380000 ]     3260   0.004%  99.995%
(  380000,  570000 ]     1599   0.002%  99.997%
(  570000,  860000 ]     1043   0.001%  99.999%
(  860000, 1200000 ]      471   0.001%  99.999%
( 1200000, 1900000 ]      275   0.000% 100.000%
( 1900000, 2900000 ]      143   0.000% 100.000%
( 2900000, 4300000 ]       60   0.000% 100.000%
( 4300000, 6500000 ]       27   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
( 9800000, 14000000 ]        1   0.000% 100.000%

Gather stats latency (us):
Count: 46 Average: 980387.5870  StdDev: 60911.18
Min: 879155  Median: 1033777.7778  Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
(  860000, 1200000 ]       45  97.826%  97.826% ####################
( 1200000, 1900000 ]        1   2.174% 100.000%

Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```

Reviewed By: mrambacher

Differential Revision: D28295742

Pulled By: pdillinger

fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 16:17:10 -07:00
mrambacher 9f2d255aed Add ObjectRegistry to ConfigOptions (#8166)
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used

The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166

Reviewed By: zhichao-cao

Differential Revision: D27657952

Pulled By: mrambacher

fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
2021-05-11 06:47:22 -07:00
mrambacher ff463742b5 Add Merge Operator support to WriteBatchWithIndex (#8135)
Summary:
The WBWI has two differing modes of operation dependent on the value
of the constructor parameter `overwrite_key`.
Currently, regardless of the parameter, neither mode performs as
expected when using Merge. This PR remedies this by correctly invoking
the appropriate Merge Operator before returning results from the WBWI.

Examples of issues that exist which are solved by this PR:

## Example 1 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `v2`, that is to say that the Merge behaves like a Put.

## Example 2 with o`verwrite_key=true`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

## Example 3 with `overwrite_key=false`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`

## Example 4 with `overwrite_key=true`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v1')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

## Example 5 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`

## Example 6 with `overwrite_key=true`
Currently, from an empty database, `('k1' -> 'v1')`, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135

Reviewed By: pdillinger

Differential Revision: D27657938

Pulled By: mrambacher

fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df
2021-05-10 12:50:25 -07:00
sdong f89a53655d Change date format in HISTORY.md (#8278)
Summary:
Per previous discussion, change date format in HISTORY.md to follow ISO 8601.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8278

Reviewed By: jay-zhuang

Differential Revision: D28294022

fbshipit-source-id: 563f29c56143519b4a871df82a17dd0a168a578c
2021-05-07 16:16:30 -07:00
Andrew Kryczka a639c02f8e Allow applying `CompactionFilter` outside of compaction (#8243)
Summary:
From HISTORY.md release note:

- Allow `CompactionFilter`s to apply in more table file creation scenarios such as flush and recovery. For compatibility, `CompactionFilter`s by default apply during compaction. Users can customize this behavior by overriding `CompactionFilterFactory::ShouldFilterTableFileCreation()`.
- Removed unused structure `CompactionFilterContext`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8243

Test Plan: added unit tests

Reviewed By: pdillinger

Differential Revision: D28088089

Pulled By: ajkr

fbshipit-source-id: 0799be7908e3b39fea09fc3f1ab00e13ad817fae
2021-05-07 16:01:40 -07:00
Peter Dillinger c26b75baa5 Deprecate obsolete "backupable db" from public APIs (#8274)
Summary:
An early design of BackupEngine used stackable DB, so I guess a
DB had to opt-in to being backupable. Unfortunately the naming of that
obsolete design still infects our public API and implementation.

This change fixes the public API, with a deprecated
backward-compatibility header. `BackupableDBOptions` is renamed to
`BackupEngineOptions` (copy-replace in the public header) and
backup_engine.h replaces backupable_db.h (present for backward
compatibility). The only other change in backupable_db.h ->
backup_engine.h is cleaning up headers.

Later changes will fix the internal implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8274

Test Plan:
The internal implementation of BackupEngine uses the name
BackupEngineOptions, while the unit tests use the old name
BackupableDBOptions. This gives me confidence that both still work.

Reviewed By: mrambacher

Differential Revision: D28259471

Pulled By: pdillinger

fbshipit-source-id: a25dbe327b9772143488e7bb0ec7139ee42d0613
2021-05-07 13:53:15 -07:00
sdong a4919d6b62 Cap automatic arena block size to 1 MB (#7907)
Summary:
Larger arena block size does provide the benefit of reducing allocation overhead, however it may cause other troubles. For example, allocator is more likely not to allocate them to physical memory and trigger page fault. Weighing the risk, we cap the arena block size to 1MB. Users can always use a larger value if they want.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7907

Test Plan: Run all existing tests

Reviewed By: pdillinger

Differential Revision: D26135269

fbshipit-source-id: b7f55afd03e6ee1d8715f90fa11b6c33944e9ea8
2021-05-07 13:15:34 -07:00
Andrew Kryczka 0f42e50fec Fix `GetLiveFiles()` returning OPTIONS-000000 (#8268)
Summary:
See release note in HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8268

Test Plan: unit test repro

Reviewed By: siying

Differential Revision: D28227901

Pulled By: ajkr

fbshipit-source-id: faf61d13b9e43a761e3d5dcf8203923126b51339
2021-05-05 12:54:46 -07:00
Peter Dillinger 3b981eaa1d Fix use-after-free threading bug in ClockCache (#8261)
Summary:
In testing for https://github.com/facebook/rocksdb/issues/8225 I found cache_bench would crash with
-use_clock_cache, as well as db_bench -use_clock_cache, but not
single-threaded. Smaller cache size hits failure much faster. ASAN
reported the failuer as calling malloc_usable_size on the `key` pointer
of a ClockCache handle after it was reportedly freed. On detailed
inspection I found this bad sequence of operations for a cache entry:

state=InCache=1,refs=1
[thread 1] Start ClockCacheShard::Unref (from Release, no mutex)
[thread 1] Decrement ref count
state=InCache=1,refs=0
[thread 1] Suspend before CalcTotalCharge (no mutex)

[thread 2] Start UnsetInCache (from Insert, mutex held)
[thread 2] clear InCache bit
state=InCache=0,refs=0
[thread 2] Calls RecycleHandle (based on pre-updated state)
[thread 2] Returns to Insert which calls Cleanup which deletes `key`

[thread 1] Resume ClockCacheShard::Unref
[thread 1] Read `key` in CalcTotalCharge

To fix this, I've added a field to the handle to store the metadata
charge so that we can efficiently remember everything we need from
the handle in Unref. We must not read from the handle again if we
decrement the count to zero with InCache=1, which means we don't own
the entry and someone else could eject/overwrite it immediately.

Note before this change, on amd64 sizeof(Handle) == 56 even though there
are only 48 bytes of data. Grouping together the uint32_t fields would
cut it down to 48, but I've added another uint32_t, which takes it
back up to 56. Not a big deal.

Also fixed DisownData to cooperate with ASAN as in LRUCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8261

Test Plan:
Manual + adding use_clock_cache to db_crashtest.py

Base performance
./cache_bench -use_clock_cache
Complete in 17.060 s; QPS = 2458513
New performance
./cache_bench -use_clock_cache
Complete in 17.052 s; QPS = 2459695

Any difference is easily buried in small noise.

Crash test shows still more bug(s) in ClockCache, so I'm expecting to
disable ClockCache from production code in a follow-up PR (if we
can't find and fix the bug(s))

Reviewed By: mrambacher

Differential Revision: D28207358

Pulled By: pdillinger

fbshipit-source-id: aa7a9322afc6f18f30e462c75dbbe4a1206eb294
2021-05-04 22:18:00 -07:00
Peter Dillinger d2ca04e3ed Add more LSM info to FilterBuildingContext (#8246)
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.

To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.

I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)

At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)

Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246

Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost

Reviewed By: mrambacher

Differential Revision: D28099346

Pulled By: pdillinger

fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
2021-04-30 13:50:13 -07:00
Peter Dillinger 85becd94c1 Refactor: use TableBuilderOptions to reduce parameter lists (#8240)
Summary:
Greatly reduced the not-quite-copy-paste giant parameter lists
of rocksdb::NewTableBuilder, rocksdb::BuildTable,
BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.

Moved weird separate parameter `uint32_t column_family_id` of
TableFactory::NewTableBuilder into TableBuilderOptions.

Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
target_file_size` is not randomly placed between uint64_t timestamps
(was easy to mix up).

Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
FilterBuildingContext. The motivation for this change is making it
easier to pass along more data into new fields in FilterBuildingContext
(follow-up PR).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240

Test Plan: ASAN make check

Reviewed By: mrambacher

Differential Revision: D28075891

Pulled By: pdillinger

fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
2021-04-29 07:00:50 -07:00
Akanksha Mahajan a0e0feca62 Improve BlockPrefetcher to prefetch only for sequential scans (#7394)
Summary:
BlockPrefetcher is used by iterators to prefetch data if they
anticipate more data to be used in future and this is valid for forward sequential
scans. But BlockPrefetcher tracks only num_file_reads_ and not if reads
are sequential. This presents problem for MultiGet with large number of
keys when it reseeks index iterator and data block. FilePrefetchBuffer
can end up doing large readahead for reseeks as readahead size
increases exponentially once readahead is enabled. Same issue is with
BlockBasedTableIterator.

Add previous length and offset read as well in BlockPrefetcher (creates
FilePrefetchBuffer) and FilePrefetchBuffer (does prefetching of data) to
determine if reads are sequential and then  prefetch.

Update the last block read after cache hit to take reads from cache also
in account.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7394

Test Plan: Add new unit test case

Reviewed By: anand1976

Differential Revision: D23737617

Pulled By: akankshamahajan15

fbshipit-source-id: 8e6917c25ed87b285ee495d1b68dc623d71205a3
2021-04-28 12:53:46 -07:00
Zhichao Cao 09a9ec3ac0 Fix the false positive alert of CF consistency check in WAL recovery (#8207)
Summary:
In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.

Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.

Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207

Test Plan: added unit test, make crash_test

Reviewed By: riversand963

Differential Revision: D27898839

Pulled By: zhichao-cao

fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
2021-04-22 10:28:37 -07:00
Akanksha Mahajan 596e9008e4 Stall writes in WriteBufferManager when memory_usage exceeds buffer_size (#7898)
Summary:
When WriteBufferManager is shared across DBs and column families
to maintain memory usage under a limit, OOMs have been observed when flush cannot
finish but writes continuously insert to memtables.
In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write,
this change will stall incoming writers until flush is completed and memory_usage
drops.

Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_
(memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true.

DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall()
(which adds dummy stall object to the writer's queue).
Then DB is blocked on a state State::Blocked (current write doesn't go
through). WBStallInterface object maintained by every DB instance is added to the queue of
WriteBufferManager.

If multiple DBs tries to write during this stall, they will also be
blocked when check WriteBufferManager::ShouldStall() returns true.

End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory
waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush
to complete and avoid continous stalling if memory usage remains close to buffer_size.

WriterBufferManager::EndWriteStall() is called,
which removes all instances from its queue and signal them to continue.
Their state is changed to State::Running and they are unblocked. DBImpl
then signal all incoming writers of that DB to continue by calling
write_thread_.EndWriteStall() (which removes dummy stall object from the
queue).

DB instance creates WBMStallInterface which is an interface to block and
signal DBs during stall.
When DB needs to be blocked or signalled by WriteBufferManager,
state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898

Test Plan: Added a new test db/db_write_buffer_manager_test.cc

Reviewed By: anand1976

Differential Revision: D26093227

Pulled By: akankshamahajan15

fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae
2021-04-21 13:54:02 -07:00
Peter Dillinger 95f6add746 Revert Ribbon starting level support from #8198 (#8212)
Summary:
This partially reverts commit 10196d7edc.

The problem with this change is because of important filter use cases:
FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so
would only use Ribbon filters if specifically including level 0 for the
Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate
unknown level, and this would be treated the same as level 0 unless
fixed.

We are keeping the part about committing to permanent schema, which is
only changes to API comments and HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212

Test Plan: CI

Reviewed By: jay-zhuang

Differential Revision: D27896468

Pulled By: pdillinger

fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e
2021-04-20 19:46:40 -07:00
Andrew Kryczka 905dd17b35 Fix seqno in ingested file boundary key metadata (#8209)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/6245.

Adapted from https://github.com/facebook/rocksdb/issues/8201 and https://github.com/facebook/rocksdb/issues/8205.

Previously we were writing the ingested file's smallest/largest internal keys
with sequence number zero, or `kMaxSequenceNumber` in case of range
tombstone. The former (sequence number zero) is incorrect and can lead
to files being incorrectly ordered. The fix in this PR is to overwrite
boundary keys that have sequence number zero with the ingested file's assigned
sequence number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8209

Test Plan: repro unit test

Reviewed By: riversand963

Differential Revision: D27885678

Pulled By: ajkr

fbshipit-source-id: 4a9f2c6efdfff81c3a9923e915ea88b250ee7b6a
2021-04-20 14:00:21 -07:00
Levi Tamasi 1b99947e99 Mention PR 8206 in HISTORY.md (#8210)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8210

Reviewed By: akankshamahajan15

Differential Revision: D27887612

Pulled By: ltamasi

fbshipit-source-id: 0db8d0b6047334dc47fe30a98804449043454386
2021-04-20 12:07:40 -07:00
Yanqin Jin a376c22066 Handle rename() failure in non-local FS (#8192)
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.

This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.

As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
  MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
  code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
  POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
  new MANIFEST.) Therefore, we keep the new MANIFEST.
    - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
    - If process reopens the db immediately after the failure, then the CURRENT file can point
      to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
      succeed and ignore the other.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D27804648

Pulled By: riversand963

fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
2021-04-19 18:11:13 -07:00
Akanksha Mahajan 531a5f88a1 Update release version to 6.20 (#8199)
Summary:
Update release version to 6.20

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8199

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D27838750

Pulled By: akankshamahajan15

fbshipit-source-id: f02f722fc6bdd37d626d47a0e932bbecea3507a8
2021-04-16 20:15:36 -07:00
Peter Dillinger 10196d7edc Ribbon long-term support, starting level support (#8198)
Summary:
Since the Ribbon filter schema seems good (compatible back to
6.15.0), this change commits to long term support of the SST schema,
even though we expect the API for enabling Ribbon to change (still
called NewExperimentalRibbonFilterPolicy).

This also adds support for "hybrid" configuration in which some levels
use Bloom (higher levels, lower numbered) for speed and the rest use
Ribbon (lower levels, higher numbered) for memory space efficiency.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198

Test Plan: unit test added, crash test support

Reviewed By: jay-zhuang

Differential Revision: D27831232

Pulled By: pdillinger

fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab
2021-04-16 15:43:08 -07:00
Akanksha Mahajan 296b47db25 Extend file_checksum_dump ldb command and DB::GetLiveFilesChecksumInfo to blob files (#8179)
Summary:
Extend the DB::GetLiveFilesChecksumInfo API to blob files.
This API is also used by the file_checksum_dump ldb command to dump checksum
of SST files which now also dumps blob files checksum.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8179

Test Plan: Add new unit test

Reviewed By: zhichao-cao

Differential Revision: D27714965

Pulled By: akankshamahajan15

fbshipit-source-id: d8b7343ea845a64c83800336d88cced7152a8c92
2021-04-15 09:38:13 -07:00
Justin Chapman d89483098f Assert unlimited max_open_files for FIFO compaction. (#8172)
Summary:
Resolves https://github.com/facebook/rocksdb/issues/8014

- Add an assertion on `DB::Open` to ensure `db_options.max_open_files` is unlimited if FIFO Compaction is being used.
- This is to align with what the docs mention and to prevent premature data deletion.
- Update tests to work with this assertion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8172

Test Plan:
```bash
$ make check -j$(nproc)

Generated TARGETS Summary:
- 6 libs
- 0 binarys
- 180 tests
```

Reviewed By: ajkr

Differential Revision: D27768792

Pulled By: thejchap

fbshipit-source-id: cf6350535e3a3577fec72bcba75b3c094dc7a6f3
2021-04-14 12:05:47 -07:00
Yanqin Jin fd00f39f97 Disable IOStatsContext/PerfContext if no thread local (#8117)
Summary:
Before this PR, `get_iostats_context()` will silently return a nullptr if no thread_local support is detected.
This can be the result of build_detect_platform's failure to compile the simple code snippet on certain platforms, as
reported in https://github.com/facebook/mysql-5.6/issues/904.
To be safe, we should fail the compilation if user does not opt out IOStatsContext and
ROCKSDB_SUPPORT_THREAD_LOCAL is not defined.

If RocksDB relies on c++11, can we just always use thread_local? It turns out there might be
performance concerns (https://github.com/facebook/rocksdb/issues/5774),
which is beyond the scope of this PR. We can revisit this later. Here, we stick to the original impl.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8117

Reviewed By: ajkr

Differential Revision: D27356847

Pulled By: riversand963

fbshipit-source-id: f7d5776842277598d8341b955febb601946801ae
2021-04-13 07:56:59 -07:00
Peter Dillinger bb75092574 Misc Backup API enhancements (#8170)
Summary:
* CreateNewBackup(WithMetadata) returning the BackupID of new backup
through optional new output param. This is especially useful with the
new mutithreading support, so that you can transactionally determine the
ID of a backup you create.
* GetBackupInfo / GetLatestBackupInfo for individual backups, so that
you don't have to comb through a vector of backups if you don't want to.

Updated HISTORY.md (including re: BlobDB support as new feature)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8170

Test Plan:
Added test logic to existing tests, to minimize increase in
cost of running tests

Reviewed By: zhichao-cao

Differential Revision: D27680410

Pulled By: pdillinger

fbshipit-source-id: 1fc45b73d81aae293ccd4a43d9583d7fd915d3eb
2021-04-12 11:00:47 -07:00
Giuseppe Ottaviano 48cd7a3aae Fix flush reason attribution (#8150)
Summary:
Current flush reason attribution is misleading or incorrect (depending on what the original intention was):

- Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager`
- Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull`

This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150

Reviewed By: zhichao-cao

Differential Revision: D27569645

Pulled By: ot

fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc
2021-04-07 23:18:37 -07:00
Akanksha Mahajan d52b520d51 Integrated BlobDB for backup/restore support (#8129)
Summary:
Add support for blob files for backup/restore like table files.
    Since DB session ID is currently not supported for blob files (there is no place to store it in
    the header), so for blob files uses the
    kLegacyCrc32cAndFileSize naming scheme even if
    share_files_with_checksum_naming is set to kUseDbSessionId.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8129

Test Plan: Add new test units

Reviewed By: ltamasi

Differential Revision: D27408510

Pulled By: akankshamahajan15

fbshipit-source-id: b27434d189a639ef3e6ad165c61a143a2daaf06e
2021-04-07 13:38:54 -07:00
Peter Dillinger 879357fdb0 Make backups openable as read-only DBs (#8142)
Summary:
A current limitation of backups is that you don't know the
exact database state of when the backup was taken. With this new
feature, you can at least inspect the backup's DB state without
restoring it by opening it as a read-only DB.

Rather than add something like OpenAsReadOnlyDB to the BackupEngine API,
which would inhibit opening stackable DB implementations read-only
(if/when their APIs support it), we instead provide a DB name and Env
that can be used to open as a read-only DB.

Possible follow-up work:

* Add a version of GetBackupInfo for a single backup.
* Let CreateNewBackup return the BackupID of the newly-created backup.

Implementation details:

Refactored ChrootFileSystem to split off new base class RemapFileSystem,
which allows more general remapping of files. We use this base class to
implement BackupEngineImpl::RemapSharedFileSystem.

To minimize API impact, I decided to just add these fields `name_for_open`
and `env_for_open` to those set by GetBackupInfo when
include_file_details=true. Creating the RemapSharedFileSystem adds a bit
to the memory consumption, perhaps unnecessarily in some cases, but this
has been mitigated by (a) only initialize the RemapSharedFileSystem
lazily when GetBackupInfo with include_file_details=true is called, and
(b) using the existing `shared_ptr<FileInfo>` objects to hold most of the
mapping data.

To enhance API safety, RemapSharedFileSystem is wrapped by new
ReadOnlyFileSystem which rejects any attempts to write. This uncovered a
couple of places in which DB::OpenForReadOnly would write to the
filesystem, so I fixed these. Added a release note because this affects
logging.

Additional minor refactoring in backupable_db.cc to support the new
functionality.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142

Test Plan:
new test (run with ASAN and UBSAN), added to stress test and
ran it for a while with amplified backup_one_in

Reviewed By: ajkr

Differential Revision: D27535408

Pulled By: pdillinger

fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148
2021-04-06 14:37:53 -07:00
Yanqin Jin 09528f9fa1 Fix a bug for SeekForPrev with partitioned filter and prefix (#8137)
Summary:
According to https://github.com/facebook/rocksdb/issues/5907, each filter partition "should include the bloom of the prefix of the last
key in the previous partition" so that SeekForPrev() in prefix mode can return correct result.
The prefix of the last key in the previous partition does not necessarily have the same prefix
as the first key in the current partition. Regardless of the first key in current partition, the
prefix of the last key in the previous partition should be added. The existing code, however,
does not follow this. Furthermore, there is another issue: when finishing current filter partition,
`FullFilterBlockBuilder::AddPrefix()` is called for the first key in next filter partition, which effectively
overwrites `last_prefix_str_` prematurely. Consequently, when the filter block builder proceeds
to the next partition, `last_prefix_str_` will be the prefix of its first key, leaving no way of adding
the bloom of the prefix of the last key of the previous partition.

Prefix extractor is FixedLength.2.
```
[  filter part 1   ]    [  filter part 2    ]
                  abc    d
```
When SeekForPrev("abcd"), checking the filter partition will land on filter part 2 because "abcd" > "abc"
but smaller than "d".
If the filter in filter part 2 happens to return false for the test for "ab", then SeekForPrev("abcd") will build
incorrect iterator tree in non-total-order mode.

Also fix a unit test which starts to fail following this PR. `InDomain` should not fail due to assertion
error when checking on an arbitrary key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8137

Test Plan:
```
make check
```

Without this fix, the following command will fail pretty soon.
```
./db_stress --acquire_snapshot_one_in=10000 --avoid_flush_during_recovery=0 \
--avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 \
--batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=17 \
--bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 \
--checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \
--compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 \
--compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 \
--compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 \
--continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox \
--db_write_buffer_size=8388608 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_blob_files=0 \
--enable_compaction_filter=0 --enable_pipelined_write=1 --file_checksum_impl=big --flush_one_in=1000000 \
--format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \
--get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=2 --ingest_external_file_one_in=0 \
--iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \
--log2_keys_per_lock=10 --long_running_snapshots=1 --mark_for_compaction_one_file_in=0 \
--max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000000 --max_key_len=3 \
--max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 \
--max_write_buffer_size_to_maintain=8388608 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False \
--nooverwritepercent=0 --open_files=500000 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=0 --pause_background_one_in=1000000 \
--periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --read_fault_one_in=0 --read_only=0 \
--readpercent=45 --recycle_log_file_num=0 --reopen=20 --secondary_catch_up_one_in=0 \
--snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 \
--sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False \
--target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 \
--top_level_index_pinning=0 --unpartitioned_pinning=1 --use_blob_db=0 --use_block_based_filter=0 \
--use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 \
--use_multiget=0 --use_ribbon_filter=0 --use_txn=0 --user_timestamp_size=8 --verify_checksum=1 \
--verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 \
--write_dbid_to_manifest=1 --writepercent=35
```

Reviewed By: pdillinger

Differential Revision: D27553054

Pulled By: riversand963

fbshipit-source-id: 60e391e4a2d8d98a9a3172ec5d6176b90ec3de98
2021-04-06 12:14:08 -07:00
Akanksha Mahajan 689b13e639 Add request_id in IODebugContext. (#8045)
Summary:
Add request_id in IODebugContext which will be populated by
    underlying FileSystem for IOTracing purposes. Update IOTracer to trace
    request_id in the tracing records. Provided API
    IODebugContext::SetRequestId which will set the request_id and enable
    tracing for request_id. The API hides the implementation and underlying
    file system needs to call this API directly.

Update DB::StartIOTrace API and remove redundant Env* from the
    argument as its not used and DB already has Env that is passed down to
    IOTracer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8045

Test Plan: Update unit test.

Differential Revision: D26899871

Pulled By: akankshamahajan15

fbshipit-source-id: 56adef52ee5af0fb3060b607c3af1ec01635fa2b
2021-04-01 13:14:51 -07:00
Andrew Kryczka c43a37a922 Fix compression dictionary sampling with dedicated range tombstone SSTs (#8141)
Summary:
Return early in case there are zero data blocks when
`BlockBasedTableBuilder::EnterUnbuffered()` is called. This crash can
only be triggered by applying dictionary compression to SST files that
contain only range tombstones. It cannot be triggered by a low buffer
limit alone since we only consider entering unbuffered mode after
buffering a data block causing the limit to be breached, or `Finish()`ing the file. It also cannot
be triggered by a totally empty file because those go through
`Abandon()` rather than `Finish()` so unbuffered mode is never entered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8141

Test Plan: added a unit test that repro'd the "Floating point exception"

Reviewed By: riversand963

Differential Revision: D27495640

Pulled By: ajkr

fbshipit-source-id: a463cfba476919dc5c5c380800a75a86c31ffa23
2021-04-01 05:08:17 -07:00
Andrew Kryczka 1ba2b8a568 Add sample_for_compression results to table properties (#8139)
Summary:
Added `TableProperties::{fast,slow}_compression_estimated_data_size`.
These properties are present in block-based tables when
`ColumnFamilyOptions::sample_for_compression > 0` and the necessary
compression library is supported when the file is generated. They
contain estimates of what `TableProperties::data_size` would be if the
"fast"/"slow" compression library had been used instead. One
limitation is we do not record exactly which "fast" (ZSTD or Zlib)
or "slow" (LZ4 or Snappy) compression library produced the result.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8139

Test Plan:
- new unit test
- ran `db_bench` with `sample_for_compression=1`; verified the `data_size` property matches the `{slow,fast}_compression_estimated_data_size` when the same compression type is used for the output file compression and the sampled compression

Reviewed By: riversand963

Differential Revision: D27454338

Pulled By: ajkr

fbshipit-source-id: 9529293de93ddac7f03b2e149d746e9f634abac4
2021-03-31 18:21:50 -07:00
sherriiiliu e6534900bd Fix possible hang issue in ~DBImpl() when flush is scheduled in LOW pool (#8125)
Summary:
In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_
and bg_flush_scheduled_ to drop to 0. Unschedule is called prior
to cancel any unscheduled flushes/compactions. It is assumed that
anything in the high priority is a flush, and anything in the low
priority pool is a compaction. This assumption, however, is broken when
the high-pri pool is full.
As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_
will remain > 0 and DB can be in hang state.
The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_`
inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB
`mutex_` will make the counts atomic in `Unschedule`.
Related discussion: https://github.com/facebook/rocksdb/issues/7928

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125

Test Plan: Added new test case which hangs without the fix.

Reviewed By: jay-zhuang

Differential Revision: D27390043

Pulled By: ajkr

fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110
2021-03-30 18:35:20 -07:00
Peter Dillinger ec11c23caa Add thread safety to BackupEngine, explain more (#8115)
Summary:
BackupEngine previously had unclear but strict concurrency
requirements that the API user must follow for safe use. Now we make
that clear, by separating operations into "Read," "Append," and "Write"
operations, and specifying which combinations are safe across threads on
the same BackupEngine object (previously none; now all, using a
read-write lock), and which are safe across different BackupEngine
instances open on the same backup_dir.

The changes to backupable_db.h should be backward compatible. It is
mostly about eliminating copies of what should be the same function and
(unsurprisingly) useful documentation comments were often placed on
only one of the two copies. With the re-organization, we are also
grouping different categories of operations. In the future we might add
BackupEngineReadAppendOnly, but that didn't seem necessary.

To mark API Read operations 'const', I had to mark some implementation
functions 'const' and some fields mutable.

Functional changes:
* Added RWMutex locking around public API functions to implement thread
safety on a single object. To avoid future bugs, this is another
internal class layered on top (removing many "override" in
BackupEngineImpl). It would be possible to allow more concurrency
between operations, rather than mutual exclusion, but IMHO not worth the
work.
* Fixed a race between Open() (Initialize()) and CreateNewBackup() for
different objects on the same backup_dir, where Initialize() could
delete the temporary meta file created during CreateNewBackup().
(This was found by the new test.)

Also cleaned up a couple of "status checked" TODOs, and improved a
checksum mismatch error message to include involved files.

Potential follow-up work:
* CreateNewBackup has an API wart because it doesn't tell you the
BackupID it just created, which makes it of limited use in a multithreaded
setting.
* We could also consider a Refresh() function to catch up to
changes made from another BackupEngine object to the same dir.
* Use a lock file to prevent multiple writer BackupEngines, but this
won't work on remote filesystems not supporting lock files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8115

Test Plan:
new mini-stress test in backup unit tests, run with gcc,
clang, ASC, TSAN, and UBSAN, 100 iterations each.

Reviewed By: ajkr

Differential Revision: D27347589

Pulled By: pdillinger

fbshipit-source-id: 28d82ed2ac672e44085a739ddb19d297dad14b15
2021-03-29 22:41:51 -07:00
Jay Zhuang a037bb35e9 Compaction should not move data to up level (#8116)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8116

Reviewed By: ajkr, mrambacher

Differential Revision: D27353828

Pulled By: jay-zhuang

fbshipit-source-id: 42703fb01b04d92cc097d7979e64798448852e88
2021-03-29 17:10:42 -07:00
wolfkdy 63748c2204 On ARM platform, use yield op to relax CPU. See issue 7376 (#7438)
Summary:
see https://github.com/facebook/rocksdb/issues/7376.
The `wfe` op on ARM platform is not suitable to relax CPU. Use `yield` op.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7438

Reviewed By: riversand963

Differential Revision: D24063427

Pulled By: jay-zhuang

fbshipit-source-id: b0ebc5590d7555bd21b30f15cd59f84dc006367a
2021-03-26 18:13:24 -07:00
Andrew Kryczka c20a7cd6c7 Apply `sample_for_compression` to all block-based tables (#8105)
Summary:
Previously it only applied to block-based tables generated by flush. This restriction
was undocumented and blocked a new use case. Now compression sampling
applies to all block-based tables we generate when it is enabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8105

Test Plan: new unit test

Reviewed By: riversand963

Differential Revision: D27317275

Pulled By: ajkr

fbshipit-source-id: cd9fcc5178d6515e8cb59c6facb5ac01893cb5b0
2021-03-25 15:00:45 -07:00
Jay Zhuang 45c65d6dcf Use thread-safe `strerror_r()` to get error message (#8087)
Summary:
`strerror()` is not thread-safe, using `strerror_r()` instead. The API could be different on the different platforms, used the code from 0deef031cb/folly/String.cpp (L457)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8087

Reviewed By: mrambacher

Differential Revision: D27267151

Pulled By: jay-zhuang

fbshipit-source-id: 4b8856d1ec069d5f239b764750682c56e5be9ddb
2021-03-24 23:07:27 -07:00
Zhichao Cao 7457c7cd00 Update release version to 6.19 (#8083)
Summary:
Update release version to 6.19

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8083

Test Plan: no code change

Reviewed By: riversand963

Differential Revision: D27222083

Pulled By: zhichao-cao

fbshipit-source-id: 94b49997019347e6e6a9e341837f4f9d3149428c
2021-03-21 18:33:46 -07:00
Zhichao Cao dd0447ae2c Add new Append API with DataVerificationInfo to Env WritableFile (#8071)
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071

Test Plan: make check, added new unit test.

Reviewed By: anand1976

Differential Revision: D27177043

Pulled By: zhichao-cao

fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
2021-03-19 11:44:13 -07:00
Zhichao Cao c810947184 Separate handling of WAL Sync io error with SST flush io error (#8049)
Summary:
In previous codebase, if WAL is used, all the retryable IO Error will be treated as hard error. So write is stalled. In this PR, the retryable IO error from WAL sync is separated from SST file flush io error. If WAL Sync is ok and retryable IO Error only happens during SST flush, the error is mapped to soft error. So user can continue insert to Memtable and append to WAL.

Resolve the bug that if WAL sync fails, the memtable status does not roll back due to calling PickMemtable early than calling and checking SyncClosedLog.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8049

Test Plan: added new unit test, make check

Reviewed By: anand1976

Differential Revision: D26965529

Pulled By: zhichao-cao

fbshipit-source-id: f5fecb66602212523c92ee49d7edcb6065982410
2021-03-18 14:33:16 -07:00
Peter Dillinger e7a60d01b2 Revamp WriteController (#8064)
Summary:
WriteController had a number of issues:
* It could introduce a delay of 1ms even if the write rate never exceeded the
configured delayed_write_rate.
* The DB-wide delayed_write_rate could be exceeded in a number of ways
with multiple column families:
  * Wiping all pending delay "debts" when another column family joins
  the delay with GetDelayToken().
  * Resetting last_refill_time_ to (now + sleep amount) means each
  column family can write with delayed_write_rate for large writes.
  * Updating bytes_left_ for a partial refill without updating
  last_refill_time_ would essentially give out random bonuses,
  especially to medium-sized writes.

Now the code is much simpler, with these issues fixed. See comments in
the new code and new (replacement) tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8064

Test Plan: new tests, better than old tests

Reviewed By: mrambacher

Differential Revision: D27064936

Pulled By: pdillinger

fbshipit-source-id: 497c23fe6819340b8f3d440bd634d8a2bc47323f
2021-03-18 09:47:31 -07:00
Zhichao Cao 08ec5e7321 Add the statistics and info log for Error handler (#8050)
Summary:
Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050

Test Plan: make check and add test to error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D26990565

Pulled By: zhichao-cao

fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f
2021-03-17 22:38:13 -07:00
Akanksha Mahajan 27d57a035e Use SST file manager to track blob files as well (#8037)
Summary:
Extend support to track blob files in SST File manager.
 This PR notifies SstFileManager whenever a new blob file is created,
 via OnAddFile and  an obsolete blob file deleted via OnDeleteFile
 and delete file via ScheduleFileDeletion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037

Test Plan: Add new unit tests

Reviewed By: ltamasi

Differential Revision: D26891237

Pulled By: akankshamahajan15

fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
2021-03-17 20:44:49 -07:00
Mark Callaghan 326670d265 Add new db_bench --benchmarks options for controlling compaction (#8027)
Summary:
The new options are:
* compact0 - compact L0 into L1 using one thread
* compact1 - compact L1 into L2 using one thread
* flush - flush memtable
* waitforcompaction - wait for compaction to finish

These are useful for reproducible benchmarks to help get the LSM tree shape
into a deterministic state. I wrote about this at:
http://smalldatum.blogspot.com/2021/02/read-only-benchmarks-with-lsm-are.html

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8027

Reviewed By: riversand963

Differential Revision: D27053861

Pulled By: ajkr

fbshipit-source-id: 1646f35584a3db03740fbeb47d91c3f00fb35d6e
2021-03-17 09:12:27 -07:00
Peter Dillinger 01c2ec3fcb Add ROCKSDB_GTEST_BYPASS (#8048)
Summary:
This is for cases that do not meet the Facebook criteria for
SKIP (see new comments). Also made ROCKSDB_GTEST_{SKIP,BYPASS} print the
message because gtest doesn't ever seem to.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8048

Test Plan: manual inspection of ./ribbon_test output, CI

Reviewed By: mrambacher

Differential Revision: D26953688

Pulled By: pdillinger

fbshipit-source-id: c914eaffe7d419db6ab90a193d474531e23582e5
2021-03-12 16:02:06 -08:00
Peter Dillinger 589ea6bec2 Add BackupEngine API for backup file details (#8042)
Summary:
This API can be used for things like determining how much space
can be freed up by deleting a particular backup, etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042

Test Plan:
validation of the API added to many existing backup unit
tests

Reviewed By: mrambacher

Differential Revision: D26936577

Pulled By: pdillinger

fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a
2021-03-12 11:03:54 -08:00
Yanqin Jin 82b3888433 Enable backward iterator for keys with user-defined timestamp (#8035)
Summary:
This PR does the following:

- Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet.
- Introduces a new helper API `Comparator::EqualWithoutTimestamp()`.
- Fix a typo in `SetTimestamp()`.
- Add/update unit tests

Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps.

1. Disable timestamp, run:
```
./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Results:
> Baseline
> - seekrandom [AVG    6 runs] : 96115 ops/sec;   53.2 MB/sec
> - seekrandom [MEDIAN 6 runs] : 98075 ops/sec;   54.2 MB/sec
>
> This PR
> - seekrandom [AVG    6 runs] : 95521 ops/sec;   52.8 MB/sec
> - seekrandom [MEDIAN 6 runs] : 96338 ops/sec;   53.3 MB/sec

2. Enable timestamp, run:
```
./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Result:
> Baseline: not supported
>
> This PR
> - seekrandom [AVG    6 runs] : 90514 ops/sec;   50.1 MB/sec
> - seekrandom [MEDIAN 6 runs] : 90834 ops/sec;   50.2 MB/sec

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035

Reviewed By: ltamasi

Differential Revision: D26926668

Pulled By: riversand963

fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5
2021-03-10 11:15:46 -08:00
Peter Dillinger 847ca9f964 Make default share_files_with_checksum=true (#8020)
Summary:
New comment for share_files_with_checksum:
// Only used if share_table_files is set to true. Setting to false is
// DEPRECATED and potentially dangerous because in that case BackupEngine
// can lose data if backing up databases with distinct or divergent
// history, for example if restoring from a backup other than the latest,
// writing to the DB, and creating another backup. Setting to true (default)
// prevents these issues by ensuring that different table files (SSTs) with
// the same number are treated as distinct. See
// share_files_with_checksum_naming and ShareFilesNaming.

I have also removed interim option kFlagMatchInterimNaming, which is no
longer needed and was never needed for correct+compatible operation
(just performance).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8020

Test Plan:
tests updated. Backward+forward compatibility verified with
SHORT_TEST=1 check_format_compatible.sh. ldb uses default backup
options, and I manually verified shared_checksum in
/tmp/rocksdb_format_compatible_peterd/bak/current/ after run.

Reviewed By: ajkr

Differential Revision: D26786331

Pulled By: pdillinger

fbshipit-source-id: 36f968dfef1f5cacbd65154abe1d846151a55130
2021-03-09 16:27:13 -08:00
Peter Dillinger 0028e3398b Make format_version=5 new default (#8017)
Summary:
Haven't seen any production issues with new Bloom filter and
it's now > 1 year old (added in 6.6.0).

Updated check_format_compatible.sh and HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017

Test Plan: tests updated (or prior bugs fixed)

Reviewed By: ajkr

Differential Revision: D26762197

Pulled By: pdillinger

fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2
2021-03-09 12:42:53 -08:00
Peter Dillinger ce391ff84b Clarifying comments for Read() APIs (#8029)
Summary:
I recently discovered the confusing, undocumented semantics of
Read() functions in the FileSystem and Env APIs. I have added
clarification to the best of my reverse-engineered understanding, and
made a note in HISTORY.md for implementors to check their
implementations, as a subtly non-adherent implementation could lead to
RocksDB quietly ignoring some portion of a file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8029

Test Plan: no code changes

Reviewed By: anand1976

Differential Revision: D26831698

Pulled By: pdillinger

fbshipit-source-id: 208f97ff6037bc13bb2ef360b987c2640c79bd03
2021-03-05 14:42:19 -08:00
Levi Tamasi cb25bc1128 Update compaction statistics to include the amount of data read from blob files (#8022)
Summary:
The patch does the following:
1) Exposes the amount of data (number of bytes) read from blob files from
`BlobFileReader::GetBlob` / `Version::GetBlob`.
2) Tracks the total number and size of blobs read from blob files during a
compaction (due to garbage collection or compaction filter usage) in
`CompactionIterationStats` and propagates this data to
`InternalStats::CompactionStats` / `CompactionJobStats`.
3) Updates the formulae for write amplification calculations to include the
amount of data read from blob files.
4) Extends the compaction stats dump with a new column `Rblob(GB)` and
a new line containing the total number and size of blob files in the current
`Version` to complement the information about the shape and size of the LSM tree
that's already there.
5) Updates `CompactionJobStats` so that the number of files and amount of data
written by a compaction are broken down per file type (i.e. table/blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26801199

Pulled By: ltamasi

fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
2021-03-04 00:43:48 -08:00
matthewvon 4126bdc0e1 Feature: add SetBufferSize() so that managed size can be dynamic (#7961)
Summary:
This PR adds SetBufferSize() to the WriteBufferManager object.  This enables user code to adjust the global budget for write_buffers based upon other memory conditions such as growth in table reader memory as the dataset grows.

The buffer_size_ member variable is now atomic to match design of other changeable size_t members within WriteBufferManager.

This change is useful as is.  However, this change is also essential if someone decides they wanted to enable db_write_buffer_size modifications through the DB::SetOptions() API, i.e. no waste taking this as is.

Any format / spacing changes are due to clang-format as required by check-in automation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7961

Reviewed By: ajkr

Differential Revision: D26639075

Pulled By: akankshamahajan15

fbshipit-source-id: 0604348caf092d35f44e85715331dc920e5c1033
2021-03-03 14:22:11 -08:00
Levi Tamasi a46f080cce Break down the amount of data written during flushes/compactions per file type (#8013)
Summary:
The patch breaks down the "bytes written" (as well as the "number of output files")
compaction statistics into two, so the values are logged separately for table files
and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
This will also come in handy for fixing the write amplification statistics, which currently
do not consider the amount of data read from blob files during compaction. (This will
be fixed by an upcoming patch.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26742156

Pulled By: ltamasi

fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
2021-03-02 09:48:00 -08:00
Akanksha Mahajan f19612970d Support retrieving checksums for blob files from the MANIFEST when checkpointing (#8003)
Summary:
The checkpointing logic supports passing file level checksums
to the copy_file_cb callback function which is used by the backup code
for detecting corruption during file copies.
However, this is currently implemented only for table files.

This PR extends the checksum retrieval to blob files as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003

Test Plan: Add new test units

Reviewed By: ltamasi

Differential Revision: D26680701

Pulled By: akankshamahajan15

fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5
2021-03-01 20:07:07 -08:00
Yanqin Jin cef4a6c49f Compaction filter support for (new) BlobDB (#7974)
Summary:
Allow applications to implement a custom compaction filter and pass it to BlobDB.

The compaction filter's custom logic can operate on blobs.
To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
db/blob/db_blob_compaction_test.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D26509280

Pulled By: riversand963

fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
2021-02-25 16:32:35 -08:00
Akanksha Mahajan 2772eb7735 Update History.md for VerifyFileChecksums API supporting blob file (#7995)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7995

Reviewed By: ltamasi

Differential Revision: D26625766

Pulled By: akankshamahajan15

fbshipit-source-id: d83c9e77695f4193da979b1ce7103b43bc1dd46c
2021-02-24 10:25:03 -08:00
xinyuliu b085ee13e0 Append all characters not captured by xsputn() in overflow() function (#7991)
Summary:
In the adapter class `WritableFileStringStreamAdapter`, which wraps WritableFile to be used for std::ostream, previouly only `std::endl` is considered a special case because `endl` is written by `os.put()` directly without going through `xsputn()`. `os.put()` will call `sputc()` and if we further check the internal implementation of `sputc()`, we will see it is
```
int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) {  // put a character
    return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) : overflow(_Traits::to_int_type(_Ch));
```
As we explicitly disabled buffering, _Pnavail() is always 0. Thus every write, not captured by xsputn, becomes an overflow.

When I run tests on Windows, I found not only `std::endl` will drop into this case, writing an unsigned long long will also call `os.put()` then followed by `sputc()` and eventually call `overflow()`. Therefore, instead of only checking `std::endl`, we should try to append other characters as well unless the appending operation fails.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7991

Reviewed By: jay-zhuang

Differential Revision: D26615692

Pulled By: ajkr

fbshipit-source-id: 4c0003de1645b9531545b23df69b000e07014468
2021-02-23 21:44:48 -08:00
Akanksha Mahajan cd79a00903 Make BlockBasedTable::kMaxAutoReadAheadSize configurable (#7951)
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file. The readahead starts at 8KB and doubles on every
additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is
256*1024.
This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which
replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be
configured.
If max_auto_readahead_size is set 0 then no implicit auto prefetching will
be done. If max_auto_readahead_size provided is less than
8KB (which is initial readahead size used by rocksdb in case of
auto-readahead), readahead size will remain same as max_auto_readahead_size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951

Test Plan: Add new unit test case.

Reviewed By: anand1976

Differential Revision: D26568085

Pulled By: akankshamahajan15

fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af
2021-02-23 16:54:08 -08:00
Yanqin Jin 7343eb4a74 Update HISTORY and bump version (#7984)
Summary:
Prepare to cut 6.18.fb branch

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7984

Reviewed By: ajkr

Differential Revision: D26557151

Pulled By: riversand963

fbshipit-source-id: 8c144c807090cdae67e6655e7a17056ce8c50bc0
2021-02-19 19:21:49 -08:00
Andrew Kryczka d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
mrambacher 4bc9df9459 Fix handling of Mutable options; Allow DB::SetOptions to update mutable TableFactory Options (#7936)
Summary:
Added a "only_mutable_options" flag to the ConfigOptions.  When set, the Configurable methods will only look at/update options that are marked as kMutable.

Fixed DB::SetOptions to allow for the update of any mutable TableFactory options.  Fixes https://github.com/facebook/rocksdb/issues/7385.

Added tests for the new flag.  Updated HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936

Reviewed By: akankshamahajan15

Differential Revision: D26389646

Pulled By: mrambacher

fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
2021-02-19 10:29:02 -08:00
Zhichao Cao b0fd1cc45a Introduce a new trace file format (v 0.2) for better extension (#7977)
Summary:
The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977

Test Plan: make check. tested with old trace file in replay and analyzing.

Reviewed By: anand1976

Differential Revision: D26529948

Pulled By: zhichao-cao

fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76
2021-02-18 23:05:35 -08:00
Akanksha Mahajan eacb14a10a Update history.md for bug fix of actual error returned in DB::OpenForReadOnly (#7978)
Summary:
Update history.md for bug fix of actual error returned in DB::OpenForReadOnly

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7978

Reviewed By: jay-zhuang

Differential Revision: D26519195

Pulled By: akankshamahajan15

fbshipit-source-id: 39fd2bcc12ab92a492e8254090b742efa377ed51
2021-02-18 11:42:05 -08:00
Jay Zhuang 59ba104e4a Fix txn `MultiGet()` return un-committed data with snapshot (#7963)
Summary:
TransactionDB uses read callback to filter out un-committed data before
a snapshot. But `MultiGet()` API doesn't use that at all, which causes
returning unwanted data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963

Test Plan: Added unittest to reproduce

Reviewed By: anand1976

Differential Revision: D26455851

Pulled By: jay-zhuang

fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55
2021-02-18 08:49:00 -08:00
Levi Tamasi ba8008c870 Mention the new BlobDB in HISTORY.md and remove the "under construction" signs (#7969)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7969

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D26467043

Pulled By: ltamasi

fbshipit-source-id: c69a725669d18af6e911743c998e3a1db75948c0
2021-02-16 16:20:22 -08:00
Zhichao Cao d1c510baec Handoff checksum Implementation (#7523)
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523

Test Plan: add new unit test, pass make check/ make asan_check

Reviewed By: pdillinger

Differential Revision: D24313271

Pulled By: zhichao-cao

fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
2021-02-10 22:20:32 -08:00
Peter Dillinger e4f1e64c30 Add prefetching (batched MultiGet) for experimental Ribbon filter (#7889)
Summary:
Adds support for prefetching data in Ribbon queries,
which especially optimizes batched Ribbon queries for MultiGet
(~222ns/key to ~97ns/key) but also single key queries on cold memory
(~333ns to ~226ns) because many queries span more than one cache line.

This required some refactoring of the query algorithm, and there
does not appear to be a noticeable regression in "hot memory" query
times (perhaps from 48ns to 50ns).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7889

Test Plan:
existing unit tests, plus performance validation with
filter_bench:

Each data point is the best of two runs. I saturated the machine
CPUs with other filter_bench runs in the background.

Before:

    $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 125.86
    Number of filters: 1993
    Total size (MB): 168.166
    Reported total allocated memory (MB): 183.211
    Reported internal fragmentation: 8.94626%
    Bits/key stored: 7.05341
    Prelim FP rate %: 0.951827
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 48.0111
      Batched, prepared net ns/op: 222.384
      Batched, unprepared net ns/op: 343.908
      Skewed 50% in 1% net ns/op: 252.916
      Skewed 80% in 20% net ns/op: 320.579
      Random filter net ns/op: 332.957

After:

    $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 128.117
    Number of filters: 1993
    Total size (MB): 168.166
    Reported total allocated memory (MB): 183.211
    Reported internal fragmentation: 8.94626%
    Bits/key stored: 7.05341
    Prelim FP rate %: 0.951827
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 49.8812
      Batched, prepared net ns/op: 97.1514
      Batched, unprepared net ns/op: 222.025
      Skewed 50% in 1% net ns/op: 197.48
      Skewed 80% in 20% net ns/op: 212.457
      Random filter net ns/op: 226.464

Bloom comparison, for reference:

    $ ./filter_bench -impl=2 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 35.3042
    Number of filters: 1993
    Total size (MB): 238.488
    Reported total allocated memory (MB): 262.875
    Reported internal fragmentation: 10.2255%
    Bits/key stored: 10.0029
    Prelim FP rate %: 0.965327
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 9.09931
      Batched, prepared net ns/op: 34.21
      Batched, unprepared net ns/op: 88.8564
      Skewed 50% in 1% net ns/op: 139.75
      Skewed 80% in 20% net ns/op: 181.264
      Random filter net ns/op: 173.88

Reviewed By: jay-zhuang

Differential Revision: D26378710

Pulled By: pdillinger

fbshipit-source-id: 058428967c55ed763698284cd3b4bbe3351b6e69
2021-02-10 21:04:56 -08:00
Andrew Kryczka c16d5a4fda Makefile support to statically link external plugin code (#7918)
Summary:
Added support for detecting plugins linked in the "plugin/" directory and building them from our Makefile in a standardized way. See "plugin/README.md" for details. An example of a plugin that can be built in this way can be found in https://github.com/ajkr/dedupfs.

There will be more to do in terms of making this process more convenient and adding support for CMake.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7918

Test Plan: my own plugin (https://github.com/ajkr/dedupfs) and also heard this patch worked with ZenFS.

Reviewed By: pdillinger

Differential Revision: D26189969

Pulled By: ajkr

fbshipit-source-id: 6624d4357d0ffbaedb42f0d12a3fcb737c78f758
2021-02-10 08:35:34 -08:00
Jay Zhuang cf160b98e1 Add full_history_ts_low option to compaction (#7884)
Summary:
The full_history_ts_low is used for user-defined timestamp GC
compaction, which is introduced in https://github.com/facebook/rocksdb/issues/7740, https://github.com/facebook/rocksdb/issues/7657 and https://github.com/facebook/rocksdb/issues/7655.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7884

Reviewed By: ltamasi

Differential Revision: D25982553

Pulled By: jay-zhuang

fbshipit-source-id: 36303d412d65b5d8166b6da24fa21ad85adbabee
2021-02-08 13:45:48 -08:00
Levi Tamasi 974458891c Revert "Turn on memtable bloom filter by default. (#6584)" (#7939)
Summary:
This reverts commit ee79a28963.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7939

Reviewed By: siying

Differential Revision: D26298564

Pulled By: ltamasi

fbshipit-source-id: 6d663516e82e6de436f8d5317932ca9a98e152bd
2021-02-06 22:34:30 -08:00
Andrew Kryczka 8d2bbdd04f Allow range deletions in `*TransactionDB` only when safe (#7929)
Summary:
Explicitly reject all range deletions on `TransactionDB` or `OptimisticTransactionDB`, except when the user provides sufficient promises that allow us to proceed safely. The necessary promises are described in the API doc for `TransactionDB::DeleteRange()`. There is currently no way to provide enough promises to make it safe in `OptimisticTransactionDB`.

Fixes https://github.com/facebook/rocksdb/issues/7913.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7929

Test Plan: unit tests covering the cases it's permitted/rejected

Reviewed By: ltamasi

Differential Revision: D26240254

Pulled By: ajkr

fbshipit-source-id: 2834a0ce64cc3e4c3799e35b885a5e79c2f4f6d9
2021-02-05 15:57:26 -08:00
sdong ee79a28963 Turn on memtable bloom filter by default. (#6584)
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584

Test Plan: Run all existing tests.

Reviewed By: pdillinger

Differential Revision: D20626739

fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
2021-02-05 12:59:46 -08:00
Andrew Kryczka 78ee8564ad Integrity protection for live updates to WriteBatch (#7748)
Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).

The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.

When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748

Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.

Reviewed By: pdillinger

Differential Revision: D25754492

Pulled By: ajkr

fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
2021-01-29 12:18:58 -08:00
mrambacher 0a9a05ae12 Make builds reproducible (#7866)
Summary:
Closes https://github.com/facebook/rocksdb/issues/7035

Changed how build_version.cc was generated:
- Included the GIT tag/branch in the build_version file
- Changed the "Build Date" to be:
      - If the GIT branch is "clean" (no changes), the date of the last git commit
      - If the branch is not clean, the current date
 - Added APIs to access the "build information", rather than accessing the strings directly.

The build_version.cc file is now regenerated whenever the library objects are rebuilt.

Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866

Reviewed By: pdillinger

Differential Revision: D26086565

Pulled By: mrambacher

fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f
2021-01-28 17:42:16 -08:00
Zhichao Cao 95013df278 Do not set bg error for compaction in retryable IO Error case (#7899)
Summary:
When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7899

Test Plan: tested with error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D26094097

Pulled By: zhichao-cao

fbshipit-source-id: c53424f11d237405592cd762f43cbbdf8da8234f
2021-01-27 17:58:12 -08:00
mrambacher 12f1137355 Add a SystemClock class to capture the time functions of an Env (#7858)
Summary:
Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.

Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.

There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).

Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858

Reviewed By: pdillinger

Differential Revision: D26006406

Pulled By: mrambacher

fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
2021-01-25 22:09:11 -08:00
Levi Tamasi 19076c95aa Update HISTORY.md for PR 7888 (#7890)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7890

Reviewed By: ajkr

Differential Revision: D26005509

Pulled By: ltamasi

fbshipit-source-id: e7eb732180d447900788d0e3a17dfd1c3f1e708a
2021-01-21 14:20:10 -08:00
Andrew Kryczka e18a4df62a workaround race conditions during `PeriodicWorkScheduler` registration (#7888)
Summary:
This provides a workaround for two race conditions that will be fixed in
a more sophisticated way later. This PR:

(1) Makes the client serialize calls to `Timer::Start()` and `Timer::Shutdown()` (see https://github.com/facebook/rocksdb/issues/7711). The long-term fix will be to make those functions thread-safe.
(2) Makes `PeriodicWorkScheduler` atomically add/cancel work together with starting/shutting down its `Timer`. The long-term fix will be for `Timer` API to offer more specialized APIs so the client will not need to synchronize.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7888

Test Plan: ran the repro provided in https://github.com/facebook/rocksdb/issues/7881

Reviewed By: jay-zhuang

Differential Revision: D25990891

Pulled By: ajkr

fbshipit-source-id: a97fdaebbda6d7db7ddb1b146738b68c16c5be38
2021-01-21 08:50:38 -08:00
Cheng Chang b0c43e7081 Update HISTORY.md (#7887)
Summary:
Mention the forward compatibility fix for WAL related version edits.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7887

Reviewed By: ltamasi

Differential Revision: D25982494

Pulled By: cheng-chang

fbshipit-source-id: 4be292aa4bf7fbc8a27c0bef1e7a98ad3ea8e1fa
2021-01-20 14:33:59 -08:00
Cheng Chang 928dea0e32 Update HISTORY.md (#7874)
Summary:
I find that the `track_and_verify_wals_in_manifest` option was only removed from 6.15 branch's HISTORY, but still appears under 6.15 in master branch's HISTORY. It should be moved to 6.16 since that's when the feature should be available.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7874

Reviewed By: jay-zhuang

Differential Revision: D25935971

Pulled By: cheng-chang

fbshipit-source-id: fe8bf1ec111597f9207e109aa3be65f8f919f1fd
2021-01-19 16:10:13 -08:00
Levi Tamasi ffe4906192 Update version to 6.17 (#7871)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7871

Test Plan: `make check`

Reviewed By: jay-zhuang

Differential Revision: D25932233

Pulled By: ltamasi

fbshipit-source-id: 8b80b0638a4f34f21a27ba80b3eda7d75410b2e8
2021-01-15 18:53:00 -08:00
anand76 8e7b068ecc Make ldb load column family options from OPTIONS file (#7847)
Summary:
When the --try_load_options is used in conjunction with the
--column_family option, ldb incorrectly sets the ColumnFamilyOptions for
that column family to defaults. This PR fixes that by retaining from the
OPTIONS file and applying command line overrides.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7847

Test Plan: Add a unit test in ldb_cmd_test

Reviewed By: ajkr

Differential Revision: D25874720

Pulled By: anand1976

fbshipit-source-id: 04bcf23b55e5a30b5b6a59b0e5cb4faef3da7429
2021-01-11 20:56:34 -08:00
Cheng Chang fdbebdf484 Add note for PR 7789 in history (#7855)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7855

Reviewed By: ajkr

Differential Revision: D25872797

Pulled By: cheng-chang

fbshipit-source-id: 82159a13f897aaaad5f3c70c7dfa822e073bc623
2021-01-11 13:34:15 -08:00
Adam Retter 4926b33742 Improvements to Env::GetChildren (#7819)
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html

There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.

Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819

Reviewed By: ajkr

Differential Revision: D25837394

Pulled By: jay-zhuang

fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
2021-01-09 09:44:34 -08:00
Akanksha Mahajan 8ed680bdb0 Add new API to report dummy entries size in cache in WriteBufferManager (#7837)
Summary:
Add new API WriteBufferManager::dummy_entries_in_cache_usage() which reports the dummy entries size stored in cache to account for DataBlocks in WriteBufferManager.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7837

Test Plan: Updated test ./write_buffer_manager_test

Reviewed By: ajkr

Differential Revision: D25794312

Pulled By: akankshamahajan15

fbshipit-source-id: 197f5e8701e3dc57a7df72dab1735624f90daf4b
2021-01-08 13:26:24 -08:00
Zhichao Cao 48c0843e69 Treat File Scope Write IO Error the same as Retryable IO Error (#7840)
Summary:
In RocksDB, when IO error happens, the flags of IOStatus can be set. If the IOStatus is set as "File Scope IO Error", it indicate that the error is constrained in the file level. Since RocksDB does not continues write data to a file when any IO Error happens, File Scope IO Error can be treated the same as Retryable IO Error. Adding the logic to ErrorHandler::SetBGError to include the file scope IO Error in its error handling logic, which is the same as retryable IO Error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7840

Test Plan: added new unit tests in error_handler_fs_test. make check

Reviewed By: anand1976

Differential Revision: D25820481

Pulled By: zhichao-cao

fbshipit-source-id: 69cabd3d010073e064d6142ce1cabf341b8a6806
2021-01-07 16:31:33 -08:00
Adam Retter 6e0f62f2b6 Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715)
Summary:
Third batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_compaction_filter_test
* db_compaction_test
* db_dynamic_level_test
* db_inplace_update_test
* db_sst_test
* db_tailing_iter_test
* db_io_failure_test

Also update GetApproximateSizes APIs to all return Status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715

Reviewed By: jay-zhuang

Differential Revision: D25806896

Pulled By: pdillinger

fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
2021-01-06 14:15:02 -08:00
Andrew Kryczka 225abffd8f Verify file checksum generator name (#7824)
Summary:
Previously we only had a debug assertion to check the right generator was being used for verification. However a user hit a problem in production where their factory was creating the wrong generator for some files, leading to checksum mismatches. It would have been easier to debug if we verified in optimized builds that the generator with the proper name is used. This PR adds such verification.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7824

Reviewed By: zhichao-cao

Differential Revision: D25740254

Pulled By: ajkr

fbshipit-source-id: a6231521747605021bad3231484b5d4f99f4044f
2021-01-04 11:51:50 -08:00
Jay Zhuang a8aeefd0fd Update release version to 6.16 (#7782)
Summary:
Update release version to 6.8

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7782

Reviewed By: siying

Differential Revision: D25648579

Pulled By: jay-zhuang

fbshipit-source-id: c536d606868b95c5fb2ae8f19c17eb259d67bc51
2020-12-19 12:39:21 -08:00
Peter Dillinger 4d1ac19e3d aggregated-table-properties with GetMapProperty (#7779)
Summary:
So that we can more easily get aggregate live table data such
as total filter, index, and data sizes.

Also adds ldb support for getting properties

Also fixed some missing/inaccurate related comments in db.h

For example:

    $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties
    rocksdb.aggregated-table-properties.data_size: 102871
    rocksdb.aggregated-table-properties.filter_size: 0
    rocksdb.aggregated-table-properties.index_partitions: 0
    rocksdb.aggregated-table-properties.index_size: 2232
    rocksdb.aggregated-table-properties.num_data_blocks: 100
    rocksdb.aggregated-table-properties.num_deletions: 0
    rocksdb.aggregated-table-properties.num_entries: 15000
    rocksdb.aggregated-table-properties.num_merge_operands: 0
    rocksdb.aggregated-table-properties.num_range_deletions: 0
    rocksdb.aggregated-table-properties.raw_key_size: 288890
    rocksdb.aggregated-table-properties.raw_value_size: 198890
    rocksdb.aggregated-table-properties.top_level_index_size: 0
    $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1
    rocksdb.aggregated-table-properties-at-level1.data_size: 80909
    rocksdb.aggregated-table-properties-at-level1.filter_size: 0
    rocksdb.aggregated-table-properties-at-level1.index_partitions: 0
    rocksdb.aggregated-table-properties-at-level1.index_size: 1787
    rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81
    rocksdb.aggregated-table-properties-at-level1.num_deletions: 0
    rocksdb.aggregated-table-properties-at-level1.num_entries: 12466
    rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0
    rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0
    rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210
    rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414
    rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0
    $

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779

Test Plan: Added a test to ldb_test.py

Reviewed By: jay-zhuang

Differential Revision: D25653103

Pulled By: pdillinger

fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 08:00:14 -08:00