Commit Graph

992 Commits

Author SHA1 Message Date
Hui Xiao 632746bb5b Improve DBTest.DynamicLevelCompressionPerLevel (#13044)
Summary:
**Context/Summary:**

A part of this test is to verify compression conditionally happens depending on the shape of the LSM when `options.level_compaction_dynamic_level_bytes = true;`. It uses the total file size to determine whether compression has happened or not. This involves some hard-coded math hard to understand. This PR replaces those with statistics that directly shows whether compression has happened or not.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13044

Test Plan: Existing test

Reviewed By: jaykorean

Differential Revision: D63666361

Pulled By: hx235

fbshipit-source-id: 8c9b1bea9b06ff1e3ed95c576aec6705159af137
2024-10-09 12:51:19 -07:00
Peter Dillinger dd23e84cad Re-implement GetApproximateMemTableStats for skip lists (#13047)
Summary:
GetApproximateMemTableStats() could return some bad results with the standard skip list memtable. See this new db_bench test showing the dismal distribution of results when the actual number of entries in range is 1000:

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000
...
filluniquerandom :       1.391 micros/op 718915 ops/sec 1.391 seconds 1000000 operations;   11.7 MB/s
approximatememtablestats :       3.711 micros/op 269492 ops/sec 3.711 seconds 1000000 operations;
Reported entry count stats (expected 1000):
Count: 1000000 Average: 2344.1611  StdDev: 26587.27
Min: 0  Median: 965.8555  Max: 835273
Percentiles: P50: 965.86 P75: 1610.77 P99: 12618.01 P99.9: 74991.58 P99.99: 830970.97
------------------------------------------------------
[       0,       1 ]   131344  13.134%  13.134% ###
(       1,       2 ]      115   0.011%  13.146%
(       2,       3 ]      106   0.011%  13.157%
(       3,       4 ]      190   0.019%  13.176%
(       4,       6 ]      214   0.021%  13.197%
(       6,      10 ]      522   0.052%  13.249%
(      10,      15 ]      748   0.075%  13.324%
(      15,      22 ]     1002   0.100%  13.424%
(      22,      34 ]     1948   0.195%  13.619%
(      34,      51 ]     3067   0.307%  13.926%
(      51,      76 ]     4213   0.421%  14.347%
(      76,     110 ]     5721   0.572%  14.919%
(     110,     170 ]    11375   1.137%  16.056%
(     170,     250 ]    17928   1.793%  17.849%
(     250,     380 ]    36597   3.660%  21.509% #
(     380,     580 ]    77882   7.788%  29.297% ##
(     580,     870 ]   160193  16.019%  45.317% ###
(     870,    1300 ]   210098  21.010%  66.326% ####
(    1300,    1900 ]   167461  16.746%  83.072% ###
(    1900,    2900 ]    78678   7.868%  90.940% ##
(    2900,    4400 ]    47743   4.774%  95.715% #
(    4400,    6600 ]    17650   1.765%  97.480%
(    6600,    9900 ]    11895   1.190%  98.669%
(    9900,   14000 ]     4993   0.499%  99.168%
(   14000,   22000 ]     2384   0.238%  99.407%
(   22000,   33000 ]     1966   0.197%  99.603%
(   50000,   75000 ]     2968   0.297%  99.900%
(  570000,  860000 ]      999   0.100% 100.000%

readrandom   :       1.967 micros/op 508487 ops/sec 1.967 seconds 1000000 operations;    8.2 MB/s (1000000 of 1000000 found)
```

Perhaps the only good thing to say about the old implementation was that it was fast, though apparently not that fast.

I've implemented a much more robust and reasonably fast new version of the function. It's still logarithmic but with some larger constant factors. The standard deviation from true count is around 20% or less, and roughly the CPU cost of two memtable point look-ups. See code comments for detail.

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000
...
filluniquerandom :       1.478 micros/op 676434 ops/sec 1.478 seconds 1000000 operations;   11.0 MB/s
approximatememtablestats :       2.694 micros/op 371157 ops/sec 2.694 seconds 1000000 operations;
Reported entry count stats (expected 1000):
Count: 1000000 Average: 1073.5158  StdDev: 197.80
Min: 608  Median: 1079.9506  Max: 2176
Percentiles: P50: 1079.95 P75: 1223.69 P99: 1852.36 P99.9: 1898.70 P99.99: 2176.00
------------------------------------------------------
(     580,     870 ]   134848  13.485%  13.485% ###
(     870,    1300 ]   747868  74.787%  88.272% ###############
(    1300,    1900 ]   116536  11.654%  99.925% ##
(    1900,    2900 ]      748   0.075% 100.000%

readrandom   :       1.997 micros/op 500654 ops/sec 1.997 seconds 1000000 operations;    8.1 MB/s (1000000 of 1000000 found)
```

We can already see that the distribution of results is dramatically better and wonderfully normal-looking, with relative standard deviation around 20%. The function is also FASTER, at least with these parameters. Let's look how this behavior generalizes, first *much* larger range:

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=30000
filluniquerandom :       1.390 micros/op 719654 ops/sec 1.376 seconds 990000 operations;   11.7 MB/s
approximatememtablestats :       1.129 micros/op 885649 ops/sec 1.129 seconds 1000000 operations;
Reported entry count stats (expected 30000):
Count: 1000000 Average: 31098.8795  StdDev: 3601.47
Min: 21504  Median: 29333.9303  Max: 43008
Percentiles: P50: 29333.93 P75: 33018.00 P99: 43008.00 P99.9: 43008.00 P99.99: 43008.00
------------------------------------------------------
(   14000,   22000 ]      408   0.041%   0.041%
(   22000,   33000 ]   749327  74.933%  74.974% ###############
(   33000,   50000 ]   250265  25.027% 100.000% #####

readrandom   :       1.894 micros/op 528083 ops/sec 1.894 seconds 1000000 operations;    8.5 MB/s (989989 of 1000000 found)
```

This is *even faster* and relatively *more accurate*, with relative standard deviation closer to 10%. Code comments explain why. Now let's look at smaller ranges. Implementation quirks or conveniences:
* When actual number in range is >= 40, the minimum return value is 40.
* When the actual is <= 10, it is guaranteed to return that actual number.
```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=75
...
filluniquerandom :       1.417 micros/op 705668 ops/sec 1.417 seconds 999975 operations;   11.4 MB/s
approximatememtablestats :       3.342 micros/op 299197 ops/sec 3.342 seconds 1000000 operations;
Reported entry count stats (expected 75):
Count: 1000000 Average: 75.1210  StdDev: 15.02
Min: 40  Median: 71.9395  Max: 256
Percentiles: P50: 71.94 P75: 89.69 P99: 119.12 P99.9: 166.68 P99.99: 229.78
------------------------------------------------------
(      34,      51 ]    38867   3.887%   3.887% #
(      51,      76 ]   550554  55.055%  58.942% ###########
(      76,     110 ]   398854  39.885%  98.828% ########
(     110,     170 ]    11353   1.135%  99.963%
(     170,     250 ]      364   0.036%  99.999%
(     250,     380 ]        8   0.001% 100.000%

readrandom   :       1.861 micros/op 537224 ops/sec 1.861 seconds 1000000 operations;    8.7 MB/s (999974 of 1000000 found)

$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=25
...
filluniquerandom :       1.501 micros/op 666283 ops/sec 1.501 seconds 1000000 operations;   10.8 MB/s
approximatememtablestats :       5.118 micros/op 195401 ops/sec 5.118 seconds 1000000 operations;
Reported entry count stats (expected 25):
Count: 1000000 Average: 26.2392  StdDev: 4.58
Min: 25  Median: 28.4590  Max: 72
Percentiles: P50: 28.46 P75: 31.69 P99: 49.27 P99.9: 67.95 P99.99: 72.00
------------------------------------------------------
(      22,      34 ]   928936  92.894%  92.894% ###################
(      34,      51 ]    67960   6.796%  99.690% #
(      51,      76 ]     3104   0.310% 100.000%

readrandom   :       1.892 micros/op 528595 ops/sec 1.892 seconds 1000000 operations;    8.6 MB/s (1000000 of 1000000 found)

$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=10
...
filluniquerandom :       1.642 micros/op 608916 ops/sec 1.642 seconds 1000000 operations;    9.9 MB/s
approximatememtablestats :       3.042 micros/op 328721 ops/sec 3.042 seconds 1000000 operations;
Reported entry count stats (expected 10):
Count: 1000000 Average: 10.0000  StdDev: 0.00
Min: 10  Median: 10.0000  Max: 10
Percentiles: P50: 10.00 P75: 10.00 P99: 10.00 P99.9: 10.00 P99.99: 10.00
------------------------------------------------------
(       6,      10 ]  1000000 100.000% 100.000% ####################

readrandom   :       1.805 micros/op 554126 ops/sec 1.805 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
```

Remarkably consistent.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13047

Test Plan: new db_bench test for both performance and accuracy (see above); added to crash test; unit test updated.

Reviewed By: cbi42

Differential Revision: D63722003

Pulled By: pdillinger

fbshipit-source-id: cfc8613c085e87c17ecec22d82601aac2a5a1b26
2024-10-02 14:25:50 -07:00
Peter Dillinger 21eb82ebec Disable "uncache" behavior in DB shutdown (#12751)
Summary:
Crash test showed a potential use-after-free where a file marked as obsolete and eligible for uncache on destruction is destroyed in the VersionSet destructor, which only happens as part of DB shutdown. At that point, the in-memory column families have already been destroyed, so attempting to uncache could use-after-free on stuff like getting the `user_comparator()` from the `internal_comparator()`.

I attempted to make it smarter, but wasn't able to untangle the destruction dependencies in a way that was safe, understandable, and maintainable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12751

Test Plan:
Reproduced by adding uncache_aggressiveness to an existing (but otherwise unrelated) test. This makes it a fair regression test.

Also added testing to ensure that trivial moves and DB close & reopen are well behaved with uncache_aggressiveness. Specifically, this issue doesn't seem to be because things are uncached inappropriately in those cases.

Reviewed By: ltamasi

Differential Revision: D58390058

Pulled By: pdillinger

fbshipit-source-id: 66ac9cb13bf02638fa80ee5b7218153d8bc7cfd3
2024-06-11 15:57:40 -07:00
Changyu Bi fecb10c2fa Improve universal compaction sorted-run trigger (#12477)
Summary:
Universal compaction currently uses `level0_file_num_compaction_trigger` for two purposes:
1. the trigger for checking if there is any compaction to do, and
2. the limit on the number of sorted runs. RocksDB will do compaction to keep the number of sorted runs no more than the value of this option.

This can make the option inflexible. A value that is too small causes higher write amp: more compactions to reduce the number of sorted runs. A value that is too big delays potential compaction work and causes worse read performance. This PR introduce an option `CompactionOptionsUniversal::max_read_amp` for only the second purpose: to specify
the hard limit on the number of sorted runs.

For backward compatibility, `max_read_amp = -1` by default, which means to fallback to the current behavior.
When `max_read_amp > 0`,`level0_file_num_compaction_trigger` will only serve as a trigger to find potential compaction.
When `max_read_amp = 0`, RocksDB will auto-tune the limit on the number of sorted runs. The estimation is based on DB size, write_buffer_size and size_ratio, so it is adaptive to the size change of the DB. See more in `UniversalCompactionBuilder::PickCompaction()`.
Alternatively, users now can configure `max_read_amp` to a very big value and keep `level0_file_num_compaction_trigger` small. This will allow `size_ratio` and `max_size_amplification_percent` to control the number of sorted runs. This essentially disables compactions with reason kUniversalSortedRunNum.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12477

Test Plan:
* new unit test
* existing unit test for default behavior
* updated crash test with the new option
* benchmark:
  * Create a DB that is roughly 24GB in the last level. When `max_read_amp = 0`, we estimate that the DB needs 9 levels to avoid excessive compactions to reduce the number of sorted runs.
  * We then run fillrandom to ingest another 24GB data to compare write amp.
     * case 1: small level0 trigger: `level0_file_num_compaction_trigger=5, max_read_amp=-1`
       * write-amp: 4.8
     * case 2: auto-tune: `level0_file_num_compaction_trigger=5, max_read_amp=0`
       *  write-amp: 3.6
     * case 3: auto-tune with minimal trigger: `level0_file_num_compaction_trigger=1, max_read_amp=0`
       *  write-amp: 3.8
     * case 4: hard-code a good value for trigger: `level0_file_num_compaction_trigger=9`
       * write-amp: 2.8
```
Case 1:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   1.0      0.0     0.0      0.0      22.6     22.6       0.0   1.0      0.0    163.2    141.94            111.10       108    1.314       0      0       0.0       0.0
 L45      8/0    1.81 GB   0.0     39.6    11.1     28.5      39.3     10.8       0.0   3.5    209.0    207.3    194.25            191.29        43    4.517    348M  2498K       0.0       0.0
 L46     13/0    3.12 GB   0.0     15.3     9.5      5.8      15.0      9.3       0.0   1.6    203.1    199.3     77.13             75.88        16    4.821    134M  2362K       0.0       0.0
 L47     19/0    4.68 GB   0.0     15.4    10.5      4.9      14.7      9.8       0.0   1.4    204.0    194.9     77.38             76.15         8    9.673    135M  5920K       0.0       0.0
 L48     38/0    9.42 GB   0.0     19.6    11.7      7.9      17.3      9.4       0.0   1.5    206.5    182.3     97.15             95.02         4   24.287    172M    20M       0.0       0.0
 L49     91/0   22.70 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum    169/0   41.74 GB   0.0     89.9    42.9     47.0     109.0     61.9       0.0   4.8    156.7    189.8    587.85            549.45       179    3.284    791M    31M       0.0       0.0

Case 2:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      1/0   214.47 MB   1.2      0.0     0.0      0.0      22.6     22.6       0.0   1.0      0.0    164.5    140.81            109.98       108    1.304       0      0       0.0       0.0
 L44      0/0    0.00 KB   0.0      1.3     1.3      0.0       1.2      1.2       0.0   1.0    206.1    204.9      6.24              5.98         3    2.081     11M    51K       0.0       0.0
 L45      4/0   844.36 MB   0.0      7.1     5.4      1.7       7.0      5.4       0.0   1.3    194.6    192.9     37.41             36.00        13    2.878     62M   489K       0.0       0.0
 L46     11/0    2.57 GB   0.0     14.6     9.8      4.8      14.3      9.5       0.0   1.5    193.7    189.8     77.09             73.54        17    4.535    128M  2411K       0.0       0.0
 L47     24/0    5.81 GB   0.0     19.8    12.0      7.8      18.8     11.0       0.0   1.6    191.4    181.1    106.19            101.21         9   11.799    174M  9166K       0.0       0.0
 L48     38/0    9.42 GB   0.0     19.6    11.8      7.9      17.3      9.4       0.0   1.5    197.3    173.6    101.97             97.23         4   25.491    172M    20M       0.0       0.0
 L49     91/0   22.70 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum    169/0   41.54 GB   0.0     62.4    40.3     22.1      81.3     59.2       0.0   3.6    136.1    177.2    469.71            423.94       154    3.050    549M    32M       0.0       0.0

Case 3:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   5.0      0.0     0.0      0.0      22.6     22.6       0.0   1.0      0.0    163.8    141.43            111.13       108    1.310       0      0       0.0       0.0
 L44      0/0    0.00 KB   0.0      0.8     0.8      0.0       0.8      0.8       0.0   1.0    201.4    200.2      4.26              4.19         2    2.130   7360K    33K       0.0       0.0
 L45      4/0   844.38 MB   0.0      6.3     5.0      1.2       6.2      5.0       0.0   1.2    202.0    200.3     31.81             31.50        12    2.651     55M   403K       0.0       0.0
 L46      7/0    1.62 GB   0.0     13.3     8.8      4.6      13.1      8.6       0.0   1.5    198.9    195.7     68.72             67.89        17    4.042    117M  1696K       0.0       0.0
 L47     24/0    5.81 GB   0.0     21.7    12.9      8.8      20.6     11.8       0.0   1.6    198.5    188.6    112.04            109.97        12    9.336    191M  9352K       0.0       0.0
 L48     41/0   10.14 GB   0.0     24.8    13.0     11.8      21.9     10.1       0.0   1.7    198.6    175.6    127.88            125.36         6   21.313    218M    25M       0.0       0.0
 L49     91/0   22.70 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum    167/0   41.10 GB   0.0     67.0    40.5     26.4      85.4     58.9       0.0   3.8    141.1    179.8    486.13            450.04       157    3.096    589M    36M       0.0       0.0

Case 4:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.7      0.0     0.0      0.0      22.6     22.6       0.0   1.0      0.0    158.6    146.02            114.68       108    1.352       0      0       0.0       0.0
 L42      0/0    0.00 KB   0.0      1.7     1.7      0.0       1.7      1.7       0.0   1.0    185.4    184.3      9.25              8.96         4    2.314     14M    67K       0.0       0.0
 L43      0/0    0.00 KB   0.0      2.5     2.5      0.0       2.5      2.5       0.0   1.0    197.8    195.6     13.01             12.65         4    3.253     22M   202K       0.0       0.0
 L44      4/0   844.40 MB   0.0      4.2     4.2      0.0       4.1      4.1       0.0   1.0    188.1    185.1     22.81             21.89         5    4.562     36M   503K       0.0       0.0
 L45     13/0    3.12 GB   0.0      7.5     6.5      1.0       7.2      6.2       0.0   1.1    188.7    181.8     40.69             39.32         5    8.138     65M  2282K       0.0       0.0
 L46     17/0    4.18 GB   0.0      8.3     7.1      1.2       7.9      6.6       0.0   1.1    192.2    181.8     44.23             43.06         4   11.058     73M  3846K       0.0       0.0
 L47     22/0    5.34 GB   0.0      8.9     7.5      1.4       8.2      6.8       0.0   1.1    189.1    174.1     48.12             45.37         3   16.041     78M  6098K       0.0       0.0
 L48     27/0    6.58 GB   0.0      9.2     7.6      1.6       8.2      6.6       0.0   1.1    195.2    172.9     48.52             47.11         2   24.262     81M  9217K       0.0       0.0
 L49     91/0   22.70 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum    174/0   42.74 GB   0.0     42.3    37.0      5.3      62.4     57.1       0.0   2.8    116.3    171.3    372.66            333.04       135    2.760    372M    22M       0.0       0.0

setup:
./db_bench --benchmarks=fillseq,compactall,waitforcompaction --num=200000000 --compression_type=none --disable_wal=1 --compaction_style=1 --num_levels=50 --target_file_size_base=268435456 --max_compaction_bytes=6710886400 --level0_file_num_compaction_trigger=10 --write_buffer_size=268435456 --seed 1708494134896523

benchmark:
./db_bench --benchmarks=overwrite,waitforcompaction,stats --num=200000000 --compression_type=none --disable_wal=1 --compaction_style=1 --write_buffer_size=268435456 --level0_file_num_compaction_trigger=5 --target_file_size_base=268435456 --use_existing_db=1 --num_levels=50 --writes=200000000 --universal_max_read_amp=-1 --seed=1716488324800233

```

Reviewed By: ajkr

Differential Revision: D55370922

Pulled By: cbi42

fbshipit-source-id: 9be69979126b840d08e93e7059260e76a878bb2a
2024-05-24 10:10:31 -07:00
Changyu Bi ffd7930312 Add more debug print to `DBTestWithParam.ThreadStatusSingleCompaction` (#12661)
Summary:
This test is flaky and a recent failure prints the following:
```
[ RUN      ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
thread id: 1842811, thread status:
thread id: 1842803, thread status:
db/db_test.cc:4697: Failure
Expected equality of these values:
  op_count
    Which is: 0
  expected_count
    Which is: 1
[  FAILED  ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (307 ms)
```
Empty thread status implies that operation_type of the threads are all OP_UNKNOWN. From 3ed46e0668/monitoring/thread_status_updater.cc (L197), this can be due to thread_data->operation_type being OP_UNKNOWN or that thread_data->cf_key it not in `cf_info_map_`, potentially due to how cf_key_ is accessed with relaxed memory order. This PR adds some debug print to print the cf_name to check this.

This PR also prints num_running_compaction and lsm state to check if a compaction is indeed running, and removes some not needed options and ensures that exactly 4 L0 files are created.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12661

Test Plan:
- Cannot repro the failure locally: `gtest-parallel --repeat=10000 --workers=200 ./db_test --gtest_filter="*ThreadStatusSingleCompaction*"`
- New failure message will look like:
```
[ RUN      ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
op_count: 1, expected_count 2
thread id: 6104100864, thread status: , cf_name
thread id: 6103527424, thread status: Compaction, cf_name default
running compaction: 1 lsm state: 4
db/db_test.cc:4885: Failure
Value of: match
  Actual: false
Expected: true
[  FAILED  ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (115 ms)
```

Reviewed By: hx235

Differential Revision: D57422755

Pulled By: cbi42

fbshipit-source-id: 635663f26052b20e485dfa06a7c0f1f318ac1099
2024-05-16 17:23:56 -07:00
Changyu Bi 5bf2c00a35 Clarify manual compaction and file ingestion behavior with FIFO compaction (#12618)
Summary:
For manual compaction, FIFO compaction will always skip key range overlapping checking with SST files. If CompactRange() is called with CompactionRangeOptions::change_level=true, a CF with FIFO compaction will now return Status::NotSupported.

For file ingestion, we will always ingest into L0. Previously, it's possible to ingest files into non-L0 levels with FIFO compaction.

These changes also help to fix [this](a178d15baf/db/db_impl/db_impl_compaction_flush.cc (L1269)) assertion failure in crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12618

Test Plan: added unit tests to verify the new behavior.

Reviewed By: hx235

Differential Revision: D56962401

Pulled By: cbi42

fbshipit-source-id: 19812a1509650b4162b379ca5bee02f2e9d9569d
2024-05-07 12:00:15 -07:00
Jay Huh d34712e0ac MultiCfIterator - AttributeGroupIter Impl & CoalescingIter Optimization (#12534)
Summary:
Continuing from the previous MultiCfIterator Implementations - (https://github.com/facebook/rocksdb/issues/12422, https://github.com/facebook/rocksdb/issues/12480 #12465), this PR completes the `AttributeGroupIterator` by implementing `AttributeGroupIteratorImpl::AddToAttributeGroups()`. While implementing the `AttributeGroupIterator`, we had to make some changes in `MultiCfIteratorImpl` and found an opportunity to improve `Coalesce()` in `CoalescingIterator`.

Lifting `UNDER CONSTRUCTION - DO NOT USE` comment by replacing it with `EXPERIMENTAL`

Here are some implementation details:
- `IteratorAttributeGroups` is introduced to avoid having to copy all `WideColumn` objects during iteration.
- `PopulateIterator()` no longer advances non-top iterators that have the same key as the top iterator in the heap.
- `AdvanceIterator()` needs to advance the non-top iterators when they have the same key as the top iterator in the heap.
- Instead of populating one by one, `PopulateIterator()` now collects all items with the same key and calls `populate_func(items)` at once.
- This allowed optimization in `Coalesce()` such that we no longer do K-1 rounds of 2-way merge, but do one K-way merge instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12534

Test Plan:
Uncommented the assertions in `verifyAttributeGroupIterator()`

```
./multi_cf_iterator_test
```

Reviewed By: ltamasi

Differential Revision: D56089019

Pulled By: jaykorean

fbshipit-source-id: 6b0b4247e221f69b40b147d41492008cc9b15054
2024-04-16 08:45:38 -07:00
Jay Huh 58a98bded9 MultiCFIterator Refactor - CoalescingIterator & AttributeGroupIterator (#12480)
Summary:
There are a couple of reasons to modify the current implementation of the MultiCfIterator, which implements the generic `Iterator` interface.
- The default behavior of `value()`/`columns()` returning data from different Column Families for different keys can be prone to errors, even though there might be valid use cases where users do not care about the origin of the value/columns.
- The `attribute_groups()` API, which is not yet implemented, will not be useful for a single-CF iterator.

In this PR, we are implementing the following changes:
- `IteratorBase` introduced, which includes all basic iterator functions except `value()` and `columns()`.
- `Iterator`, which now inherits from `IteratorBase`, includes `value()` and `columns()`.
- New public interface `AttributeGroupIterator` inherits from `IteratorBase` and additionally includes `attribute_groups()` (to be implemented).
- Renamed former `MultiCfIterator` to `CoalescingIterator` which inherits from `Iterator`
- Existing MultiCfIteratorTest has been split into two - `CoalescingIteratorTest` and `AttributeGroupIteratorTest`.
- Moved AttributeGroup related code from `wide_columns.h` to a new file, `attribute_groups.h`.

Some Implementation Details
- `MultiCfIteratorImpl` takes two functions - `populate_func` and `reset_func` and use them to populate `value_` and `columns_` in CoalescingIterator and `attribute_groups_` in AttributeGroupIterator. In CoalescingIterator, populate_func is `Coalesce()`, in AttributeGroupIterator populate_func is `AddToAttributeGroups()`. `reset_func` clears populated value_, columns_ and attribute_groups_ accordingly.
- `Coalesce()` merge sorts columns from multiple CFs when a key exists in more than on CFs. column that appears in later CF overwrites the prior ones.

For example, if CF1 has `"key_1" ==> {"col_1": "foo",  "col_2", "baz"}` and CF2 has `"key_1" ==> {"col_2": "quux", "col_3", "bla"}`, and when the iterator is at `key_1`, `columns()` will return `{"col_1": "foo", "col_2", "quux", "col_3", "bla"}`

In this example, `value()` will be empty, because none of them have values for `kDefaultColumnName`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12480

Test Plan:
## Unit Test
```
./multi_cf_iterator_test
```

## Performance Test

To make sure this change does not impact existing `Iterator` performance

**Build**
```
$> make -j64 release
```
**Setup**
```
$> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none
```
**Run**
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000
```

**Before the change**
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.519 micros/op 1927904 ops/sec 0.519 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       5.302 micros/op 188589 ops/sec 5.303 seconds 1000000 operations; (0 of 1000000 found)
```
**After the change**
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.497 micros/op 2011012 ops/sec 0.497 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       5.252 micros/op 190405 ops/sec 5.252 seconds 1000000 operations; (0 of 1000000 found)
```

Reviewed By: ltamasi

Differential Revision: D55353909

Pulled By: jaykorean

fbshipit-source-id: 8d7786ffee09e022261ce34aa60e8633685e1946
2024-04-11 11:34:04 -07:00
Peter Dillinger b515a5db3f Replace ScopedArenaIterator with ScopedArenaPtr<InternalIterator> (#12470)
Summary:
ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want.

So this adds ScopedArenaPtr<T> to replace those uses.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470

Test Plan: CI (including ASAN/UBSAN)

Reviewed By: jowlyzhang

Differential Revision: D55254362

Pulled By: pdillinger

fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18
2024-03-22 13:40:42 -07:00
Jay Huh 3412195367 Introduce MultiCfIterator (#12153)
Summary:
This PR introduces a new implementation of `Iterator` via a new public API called `NewMultiCfIterator()`. The new API takes a vector of column family handles to build a cross-column-family iterator, which internally maintains multiple `DBIter`s as child iterators from a consistent database state. When a key exists in multiple column families, the iterator selects the value (and wide columns) from the first column family containing the key, following the order provided in the `column_families` parameter. Similar to the merging iterator, a min heap is used to iterate across the child iterators. Backward iteration and direction change functionalities will be implemented in future PRs.

The comparator used to compare keys across different column families will be derived from the iterator of the first column family specified in `column_families`. This comparator will be checked against the comparators from all other column families that the iterator will traverse. If there's a mismatch with any of the comparators, the initialization of the iterator will fail.

Please note that this PR is not enough for users to start using `MultiCfIterator`. The `MultiCfIterator` and related APIs are still marked as "**DO NOT USE - UNDER CONSTRUCTION**". This PR is just the first of many PRs that will follow soon.

This PR includes the following:
- Introduction and partial implementation of the `MultiCfIterator`, which implements the generic `Iterator` interface. The implementation includes the construction of the iterator, `SeekToFirst()`, `Next()`, `Valid()`, `key()`, `value()`, and `columns()`.
- Unit tests to verify iteration across multiple column families in two distinct scenarios: (1) keys are unique across all column families, and (2) the same keys exist in multiple column families.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12153

Reviewed By: pdillinger

Differential Revision: D52308697

Pulled By: jaykorean

fbshipit-source-id: b03e69f13b40af5a8f0598d0f43a0bec01ef8294
2024-03-05 10:22:43 -08:00
anand76 d227276147 Deprecate some variants of Get and MultiGet (#12327)
Summary:
A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327

Reviewed By: pdillinger

Differential Revision: D53828151

Pulled By: anand1976

fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050
2024-02-16 09:21:06 -08:00
Yu Zhang 4bea83aa44 Remove the force mode for EnableFileDeletions API (#12337)
Summary:
There is no strong reason for user to need this mode while on the other hand, its behavior is destructive.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337

Reviewed By: hx235

Differential Revision: D53630393

Pulled By: jowlyzhang

fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78
2024-02-13 18:36:25 -08:00
Peter Dillinger 54cb9c77d9 Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.

I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`.  If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.

With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.

A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).

Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.

I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308

Test Plan: existing tests, CI

Reviewed By: ltamasi

Differential Revision: D53204947

Pulled By: pdillinger

fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 10:44:11 -08:00
Andrew Kryczka f9d45358ca Removed `check_flush_compaction_key_order` (#12311)
Summary:
`check_flush_compaction_key_order` option was introduced for the key order checking online validation. It gave users the ability to disable the validation without downgrade in case the validation caused inefficiencies or false positives. Over time this validation has shown to be cheap and correct, so the option to disable it can now be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12311

Reviewed By: cbi42

Differential Revision: D53233379

Pulled By: ajkr

fbshipit-source-id: 1384361104021d6e3e580dce2ec123f9f99ce637
2024-01-31 16:30:26 -08:00
Peter Dillinger 76c834e441 Remove 'virtual' when implied by 'override' (#12319)
Summary:
... to follow modern C++ style / idioms.

Used this hack:
```
for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319

Test Plan: existing tests, CI

Reviewed By: jaykorean

Differential Revision: D53275303

Pulled By: pdillinger

fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377
2024-01-31 13:14:42 -08:00
Changyu Bi 3ef9092487 Print additional information when flaky test DBTestWithParam.ThreadStatusSingleCompaction fails (#12268)
Summary:
The test is [flaky](https://github.com/facebook/rocksdb/actions/runs/7616272304/job/20742657041?pr=12257&fbclid=IwAR1vNI1rSRVKnOsXs0WCPklqTkBXxlwS1GMJgWWe7D8dtAvh6e6wxk067FY) but I could not reproduce the test failure. Add some debug print to make the next failure more helpful

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12268

Test Plan:
```
check print works when test fails:
[ RUN      ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
thread id: 6134067200, thread status:
thread id: 6133493760, thread status: Compaction
db/db_test.cc:4680: Failure
Expected equality of these values:
  op_count
    Which is: 1
  expected_count
    Which is: 0
```

Reviewed By: hx235

Differential Revision: D52987503

Pulled By: cbi42

fbshipit-source-id: 33b369796f9b97155578b45167e722ddcde93594
2024-01-23 10:07:06 -08:00
Andrew Kryczka 5a9ecf6614 Automated modernization (#12210)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12210

Reviewed By: hx235

Differential Revision: D52559771

Pulled By: ajkr

fbshipit-source-id: 1ccdd3a0180cc02bc0441f20b0e4a1db50841b03
2024-01-05 11:53:57 -08:00
Changyu Bi d5bc30befa Enforce status checking after Valid() returns false for IteratorWrapper (#11975)
Summary:
... when compiled with ASSERT_STATUS_CHECKED = 1.

The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782.

Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975

Test Plan:
* `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check`
* I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task.

Reviewed By: ajkr

Differential Revision: D50383790

Pulled By: cbi42

fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce
2023-10-18 09:38:38 -07:00
anand76 269478ee46 Support compressed and local flash secondary cache stacking (#11812)
Summary:
This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache.

The basic design is as follows -
1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache.
2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block.
3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache.
4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded.

We can easily implement additional admission policies if desired.

Todo (In a subsequent PR):
1. Add to db_bench and run benchmarks
2. Add to db_stress

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812

Reviewed By: pdillinger

Differential Revision: D49461842

Pulled By: anand1976

fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-21 20:30:53 -07:00
Yu Zhang 32fc1e6cdc Add unit test for the multiget fix when ReadOptions.read_tier == kPersistedTier and disableWAL == true (#11854)
Summary:
Add unit tests for the fix in https://github.com/facebook/rocksdb/pull/11700

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11854

Reviewed By: anand1976

Differential Revision: D49392462

Pulled By: jowlyzhang

fbshipit-source-id: bd6978e4888074fa5417f3ccda7a78a2c7eee9c6
2023-09-21 14:59:58 -07:00
Changyu Bi 9d71682d1b Add statistics `COMPACTION_CPU_TOTAL_TIME` for total compaction time (#11741)
Summary:
Existing compaction statistics are `COMPACTION_TIME` and `COMPACTION_CPU_TIME` which are histogram and are logged at the end of a compaction. The new statistics `COMPACTION_CPU_TOTAL_TIME` is for cumulative total compaction time which is updated regularly during a compaction. This allows user to more closely track compaction cpu usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11741

Test Plan: * new unit test `DBTestWithParam.CompactionTotalTimeTest`

Reviewed By: ajkr

Differential Revision: D48608094

Pulled By: cbi42

fbshipit-source-id: b597109f3e4bf2237fb5a216b6fd036e5363b4c0
2023-09-12 15:48:36 -07:00
Changyu Bi 76ed9a3990 Add missing status check when compiling with `ASSERT_STATUS_CHECKED=1` (#11686)
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by 9c2ebcc2c3/Makefile (L243)
Applying the change in PR https://github.com/facebook/rocksdb/issues/11675 shows a lot of missing status checks. This PR adds the missing status checks.

Most of changes are just adding asserts in unit tests. I'll add pr comment around more interesting changes that need review.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11686

Test Plan: change Makefile as in https://github.com/facebook/rocksdb/issues/11675, and run `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 check`

Reviewed By: hx235

Differential Revision: D48176132

Pulled By: cbi42

fbshipit-source-id: 6758946cfb1c6ff84c4c1e0ca540d05e6fc390bd
2023-08-09 15:46:44 -07:00
Peter Dillinger e214964f40 Fix a potential memory leak on row_cache insertion failure (#11682)
Summary:
Although the built-in Cache implementations never return failure on Insert without keeping a reference (Handle), a custom implementation could. The code for inserting into row_cache does not keep a reference but does not clean up appropriately on non-OK. This is a fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11682

Test Plan: unit test added that previously fails under ASAN

Reviewed By: ajkr

Differential Revision: D48153831

Pulled By: pdillinger

fbshipit-source-id: 86eb7387915c5b38b6ff5dd8deb4e1e223b7d020
2023-08-08 11:34:41 -07:00
Andrew Kryczka 05c3b8ecac Prepare for specialized interface for row cache (#11620)
Summary:
An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema.

This PR prepares for the specialized `RowCache` interface by updating the migration plan of https://github.com/facebook/rocksdb/issues/11450. I added a release note for the removed APIs and didn't mention the added ones for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11620

Reviewed By: pdillinger

Differential Revision: D47536962

Pulled By: ajkr

fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691
2023-07-18 19:12:58 -07:00
Changyu Bi bc04ec85db Make option `level_compaction_dynamic_level_bytes` true by default (#11525)
Summary:
after https://github.com/facebook/rocksdb/issues/11321 and https://github.com/facebook/rocksdb/issues/11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size.

The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit f742be330c and b1928e42b3 which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape.

Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in https://github.com/facebook/rocksdb/issues/10057.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11525

Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change.

Reviewed By: ajkr

Differential Revision: D46654256

Pulled By: cbi42

fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1
2023-06-15 21:12:39 -07:00
mayue.fight fa878a0107 Support to create a CF by importing multiple non-overlapping CFs (#11378)
Summary:
The original Feature Request is from [https://github.com/facebook/rocksdb/issues/11317](https://github.com/facebook/rocksdb/issues/11317).
Flink uses rocksdb as the state backend,  all DB options are the same, and the keys of each DB instance are adjacent and there is no key overlap between two db instances.
In the Flink rescaling scenario, it is necessary to quickly split the DB according to a certain key range or quickly merge multiple DBs into one.

This PR is mainly used to quickly merge multiple DBs into one.

We hope to extend the function of `CreateColumnFamilyWithImports` to support creating ColumnFamily by importing multiple ColumnFamily with no overlapping keys.

The import logic is almost the same as `CreateColumnFamilyWithImport`, but it will check whether there is key overlap between CF when importing. The import will fail if there are key overlaps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11378

Reviewed By: ajkr

Differential Revision: D46413709

Pulled By: cbi42

fbshipit-source-id: 846d0049fad11c59cf460fa846c345b26c658dfb
2023-06-15 12:25:04 -07:00
Jay Huh 81aeb15988 Add WaitForCompact with WaitForCompactOptions to public API (#11436)
Summary:
Context:

This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions`

1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB.
2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing.

1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct.
2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h`
3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436

Test Plan:
Following tests added
- `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish`
- `DBCompactionTest::WaitForCompactAbortOnPauseAborted`
- `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted`
- `DBCompactionTest::WaitForCompactShutdownWhileWaiting`
- `TransactionTest::WaitForCompactAbortOnPause`

NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place.

Reviewed By: pdillinger

Differential Revision: D45799659

Pulled By: jaykorean

fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b
2023-05-25 17:25:51 -07:00
Peter Dillinger 4067acabca Compatibility step for separating BlockCache and GeneralCache APIs (#11450)
Summary:
Add two type aliases for Cache: BlockCache and GeneralCache, and add LRUCacheOptions::MakeSharedGeneralCache(). This will ease upgrade to an intended future change to separate the cache API between block cache and other (general) uses, including row cache. Separating the APIs will make it easier to expose more details of block caching for customization. For example, it would be nice to pass the file unique ID and offset as the logical cache key instead of using a Slice, which could facilitate some file-specific customizations in block cache. This would also make it clear that HyperClockCache is not usable as a general cache, because it can only deal with fixed-size block cache keys.

block_cache, row_cache, and blob_cache are the uses of Cache in the public API. blob_cache should be able to use BlockCache while row_cache is a GeneralCache user, as its keys are of arbitrary size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11450

Test Plan: see updated unit test.

Reviewed By: anand1976

Differential Revision: D45882067

Pulled By: pdillinger

fbshipit-source-id: ff5d9f0b644f87ae337a29a7728ce3ed07b2a4b2
2023-05-18 20:40:19 -07:00
mayue.fight 8d8eb0e77e Support Clip DB to KeyRange (#11379)
Summary:
This PR is part of the request https://github.com/facebook/rocksdb/issues/11317.
(Another part is https://github.com/facebook/rocksdb/pull/11378)

ClipDB() will clip the entries in the CF according to the range [begin_key, end_key). All the entries outside this range will be completely deleted (including tombstones).
 This feature is mainly used to ensure that there is no overlapping Key when calling CreateColumnFamilyWithImports() to import multiple CFs.

When Calling ClipDB [begin, end), there are the following steps

1.  Quickly and directly delete files without overlap
 DeleteFilesInRanges(nullptr, begin) + DeleteFilesInRanges(end, nullptr)
2. Delete the Key outside the range
Delete[smallest_key, begin) + Delete[end, largest_key]
3. Delete the tombstone through Manul Compact
CompactRange(option, nullptr, nullptr)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11379

Reviewed By: ajkr

Differential Revision: D45840358

Pulled By: cbi42

fbshipit-source-id: 54152e8a45fd8ede137f99787eb252f0b51440a4
2023-05-18 13:25:01 -07:00
Jay Huh 586d78b31e Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:

In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.

In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.

This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.

Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.

Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443

Test Plan:
Existing tests that involve a shutdown in progress:

- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle

Reviewed By: pdillinger

Differential Revision: D45923426

Pulled By: jaykorean

fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-17 18:13:50 -07:00
Peter Dillinger 206fdea3d9 Change internal headers with duplicate names (#11408)
Summary:
In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.

No public API changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408

Test Plan: existing tests

Reviewed By: ltamasi

Differential Revision: D45456696

Pulled By: pdillinger

fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373
2023-05-17 11:27:09 -07:00
clundro 50b33ebb1b remove redundant move (#11418)
Summary:
when I use g++-13 to exec the `make all` command,  the output throws the warnings.
```
db/compaction/compaction_job_test.cc: In member function ‘void rocksdb::CompactionJobTestBase::AddMockFile(const rocksdb::mock::KVVector&, int)’:
db/compaction/compaction_job_test.cc:376:57: error: redundant move in initialization [-Werror=redundant-move]
  376 |           env_, GenerateFileName(file_number), std::move(contents)));
      |                                                ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
  375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
      |       ^~~~~~~~~
db/compaction/compaction_job_test.cc:376:57: note: remove ‘std::move’ call
  376 |           env_, GenerateFileName(file_number), std::move(contents)));
      |                                                ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
  375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
      |       ^~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [Makefile:2507: db/compaction/compaction_job_test.o] Error 1
```

and I also add some `(void)unused_variable` statements because of the cmake argument `-Wunused-but-set-variable -Wunused-but-set-variable`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11418

Reviewed By: akankshamahajan15

Differential Revision: D45528223

Pulled By: ajkr

fbshipit-source-id: fee1a77c30039a56b481de953f0a834cc788abbc
2023-05-03 09:37:21 -07:00
Changyu Bi b3c43a5b99 Drain unnecessary levels when `level_compaction_dynamic_level_bytes=true` (#11340)
Summary:
When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340

Test Plan:
- New unit tests
- `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run.

Reviewed By: ajkr

Differential Revision: D44563884

Pulled By: cbi42

fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8
2023-04-06 11:20:43 -07:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
Hui Xiao 98d5db5c2e Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
-  File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
       - insert k1@1 to memtable m1
       - ingest file s1 with k2@2, ingest file s2 with k3@3
        - insert k4@4 to m1
       - compact files s1, s2 and  result in new file s3 of seqno range [2, 3]
       - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
    - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr  for this example)
        - an existing SST s1 contains only k1@1
        - insert k1@2 to memtable m1
        - ingest file s2 with k3@3, ingest file s3 with k4@4
        - insert single delete k5@5 in m1
        - flush m1 and result in new file s4 of seqno range [2, 5]
        - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
        - compact s4 and result in new file s6 of seqno range [2] due to single delete
    - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`

Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number`  ordering check. This will result in file ordering s1 <  s2 <  s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.

**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
  - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
  - Compaction output file  is assigned with the minimum `epoch_number` among input files'
      - Refit level: reuse refitted file's epoch_number
  -  Other paths needing `epoch_number` treatment:
     - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
     - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
  -  Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or  by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
  - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
   - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
   - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
   - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
   - update existing tests with `epoch_number` so make check will pass
   - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
   - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922

Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run 36a5686ec0 (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761

Reviewed By: ajkr

Differential Revision: D41063187

Pulled By: hx235

fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 13:29:37 -08:00
Andrew Kryczka 5cf6ab6f31 Ran clang-format on db/ directory (#10910)
Summary:
Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910

Reviewed By: riversand963

Differential Revision: D40880683

Pulled By: ajkr

fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
2022-11-02 14:34:24 -07:00
Yanqin Jin 7d26e4c5a3 Basic Support for Merge with user-defined timestamp (#10819)
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.

Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```

Some code comments are added for clarity.

Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40603195

Pulled By: riversand963

fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
2022-10-31 22:28:58 -07:00
Yueh-Hsuan Chiang e267909ecf Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348)
Summary:
FIFO compaction can theoretically open a DB with any compaction style.
However, the current code only allows FIFO compaction to open a DB with
a single level.

This PR relaxes the limitation of FIFO compaction and allows it to open a
DB with multiple levels.  Below is the read / write / compaction behavior:

* The read behavior is untouched, and it works like a regular rocksdb instance.
* The write behavior is untouched as well.  When a FIFO compacted DB
is opened with multiple levels, all new files will still be in level 0, and no files
will be moved to a different level.
* Compaction logic is extended.  It will first identify the bottom-most non-empty level.
Then, it will delete the oldest file in that level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348

Test Plan:
Added a new test to verify the migration from level to FIFO where the db has multiple levels.
Extended existing test cases in db_test and db_basic_test to also verify
all entries of a key after reopening the DB with FIFO compaction.

Reviewed By: jay-zhuang

Differential Revision: D40233744

fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1
2022-10-18 14:38:13 -07:00
Changyu Bi a6ce1955b1 Fix flaky test ShuttingDownNotBlockStalledWrites (#10800)
Summary:
DBTest::ShuttingDownNotBlockStalledWrites is flaky, added new sync point dependency to fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10800

Test Plan: gtest-parallel --repeat=1000 ./db_test --gtest_filter="*ShuttingDownNotBlockStalledWrites"

Reviewed By: jay-zhuang

Differential Revision: D40239116

Pulled By: jay-zhuang

fbshipit-source-id: 8c2d7e7df58f202d287bd9f5c9b60b7eff270d0c
2022-10-10 13:58:55 -07:00
Qingping Wang a45e6878f3 fix issue 10751 (#10765)
Summary:
Fix https://github.com/facebook/rocksdb/issues/10751 where a stalled write could be blocked forever when DB shutdown.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10765

Reviewed By: ajkr

Differential Revision: D40110069

Pulled By: ajkr

fbshipit-source-id: 598c05777db9be85913a0a85e421b3295ecdff5e
2022-10-10 09:46:09 -07:00
Jay Zhuang edae671ce0 Re-enable SuggestCompactRangeTest and add Universal Compaction test (#10473)
Summary:
The feature `SuggestCompactRange()` is still experimental. Just
re-add the test back.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10473

Test Plan: CI

Reviewed By: akankshamahajan15

Differential Revision: D38427153

Pulled By: jay-zhuang

fbshipit-source-id: 0b4491c947cbce6c18ff147b167e3c678633129a
2022-08-05 13:16:58 -07:00
Jay Zhuang fcccc412d7 Remove Travis CI (#10407)
Summary:
Travis CI is depreciated and haven't been maintained for some time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10407

Reviewed By: ajkr

Differential Revision: D38078382

Pulled By: jay-zhuang

fbshipit-source-id: f42057f2f41f722bdce56bf195f67a94835191fb
2022-07-22 20:16:45 -07:00
Yu Zhao 00540916 bfc737da21 fix typos in some code and comment (#10139)
Summary:
Minor issue, I just found a few typos on db_test and column_family while reading the code. And I have this PR opened to contribute.  :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10139

Reviewed By: ajkr

Differential Revision: D38007098

Pulled By: jay-zhuang

fbshipit-source-id: 511947b32424c34348184691216640f32c410fb1
2022-07-22 19:25:52 -07:00
Andrew Kryczka 25cc564ff7 Make RateLimiter not Customizable (#10378)
Summary:
(PR created for informational/testing purposes only.)

- Fixes lost dynamic updates to GenericRateLimiter bandwidth using `SetBytesPerSecond()`
- Benefit over #10374 is eliminating race conditions with Configurable framework.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10378

Reviewed By: pdillinger

Differential Revision: D37914865

fbshipit-source-id: d4f566d60ec9726d26932388c61671adf0ee0f30
2022-07-18 14:48:42 -07:00
Levi Tamasi c73d2a9d18 Add API for writing wide-column entities (#10242)
Summary:
The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
a new API called `PutEntity` that can be used to write a wide-column entity
to the database. The new API is added to both `DB` and `WriteBatch`. Note
that currently there is no way to retrieve these entities; more precisely, all
read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
encounter a wide-column entity that is required to answer a query. Read-side
support (as well as other missing functionality like `Merge`, compaction filter,
and timestamp support) will be added in later PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D37369748

Pulled By: ltamasi

fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d
2022-06-25 15:30:47 -07:00
Andrew Kryczka 5d6005c780 Add WriteOptions::protection_bytes_per_key (#10037)
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.

Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.

There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037

Test Plan:
- Manual
  - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
  - Enabled in MyShadow for 1+ week
- Automated
  - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
  - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection

Reviewed By: cbi42

Differential Revision: D36614569

Pulled By: ajkr

fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
2022-06-16 23:10:07 -07:00
mpoeter 77f4799515 Fix potential leak when reusing PinnableSlice instances. (#10166)
Summary:
`PinnableSlice` may hold a handle to a cache value which must be released to correctly decrement the ref-counter. However, when `PinnableSlice` variables are reused, e.g. like this:
```
PinnableSlice pin_slice;
db.Get("foo", &pin_slice);
db.Get("foo", &pin_slice);
```
then the second `Get` simply overwrites the old value in `pin_slice` and the handle returned by the first `Get` is _not_ released.

This PR adds `Reset` calls to the `Get`/`MultiGet` calls that accept `PinnableSlice` arguments to ensure proper cleanup of old values.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10166

Reviewed By: hx235

Differential Revision: D37151632

Pulled By: ajkr

fbshipit-source-id: 9dd3c3288300f560531b843f67db11aeb569a9ff
2022-06-14 21:29:52 -07:00
Yanqin Jin 1777e5f7e9 Snapshots with user-specified timestamps (#9879)
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.

It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.

This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.

In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.

This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```

If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.

Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();

// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
    std::shared_ptr<TransactionNotifier> notifier,
    TxnTimestamp ts,
    std::shared_ptr<const Snapshot>* ret);

// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```

The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```

We also provide two additional APIs for stats collection and reporting purposes.

```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
    TxnTimestamp ts_lb,
    TxnTimestamp ts_ub,
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```

To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```

Before shutdown, RocksDB will release all timestamped snapshots.

Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.

Timestamped snapshots can also be used with user-defined timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879

Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```

Reviewed By: siying

Differential Revision: D35783919

Pulled By: riversand963

fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
2022-06-10 16:07:03 -07:00
Peter Dillinger d3a3b02134 Fix bug with kHashSearch and changing prefix_extractor with SetOptions (#10128)
Summary:
When opening an SST file created using index_type=kHashSearch,
the *current* prefix_extractor would be saved, and used with hash index
if the *new current* prefix_extractor at query time is compatible with
the SST file. This is a problem if the prefix_extractor at SST open time
is not compatible but SetOptions later changes (back) to one that is
compatible.

This change fixes that by using the known compatible (or missing) prefix
extractor we save for use with prefix filtering. Detail: I have moved the
InternalKeySliceTransform wrapper to avoid some indirection and remove
unnecessary fields.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10128

Test Plan:
expanded unit test (using some logic from https://github.com/facebook/rocksdb/issues/10122) that fails
before fix and probably covers some other previously uncovered cases.

Reviewed By: siying

Differential Revision: D36955738

Pulled By: pdillinger

fbshipit-source-id: 0c78a6b0d24054ef2f3cb237bf010c1c5589fb10
2022-06-10 08:51:45 -07:00
Akanksha Mahajan 700d597bd8 Expose unix time in rocksdb::Snapshot (#9923)
Summary:
RocksDB snapshot already has a member unix_time_ set after
snapshot is taken. It is now exposed through GetSnapshotTime() API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9923

Test Plan: Update unit tests

Reviewed By: riversand963

Differential Revision: D36048275

Pulled By: akankshamahajan15

fbshipit-source-id: 825210ec287deb0bc3aaa9b8e1f079f07ad686fa
2022-05-23 22:31:08 -07:00