Go to file
Peter Dillinger dd23e84cad Re-implement GetApproximateMemTableStats for skip lists (#13047)
Summary:
GetApproximateMemTableStats() could return some bad results with the standard skip list memtable. See this new db_bench test showing the dismal distribution of results when the actual number of entries in range is 1000:

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000
...
filluniquerandom :       1.391 micros/op 718915 ops/sec 1.391 seconds 1000000 operations;   11.7 MB/s
approximatememtablestats :       3.711 micros/op 269492 ops/sec 3.711 seconds 1000000 operations;
Reported entry count stats (expected 1000):
Count: 1000000 Average: 2344.1611  StdDev: 26587.27
Min: 0  Median: 965.8555  Max: 835273
Percentiles: P50: 965.86 P75: 1610.77 P99: 12618.01 P99.9: 74991.58 P99.99: 830970.97
------------------------------------------------------
[       0,       1 ]   131344  13.134%  13.134% ###
(       1,       2 ]      115   0.011%  13.146%
(       2,       3 ]      106   0.011%  13.157%
(       3,       4 ]      190   0.019%  13.176%
(       4,       6 ]      214   0.021%  13.197%
(       6,      10 ]      522   0.052%  13.249%
(      10,      15 ]      748   0.075%  13.324%
(      15,      22 ]     1002   0.100%  13.424%
(      22,      34 ]     1948   0.195%  13.619%
(      34,      51 ]     3067   0.307%  13.926%
(      51,      76 ]     4213   0.421%  14.347%
(      76,     110 ]     5721   0.572%  14.919%
(     110,     170 ]    11375   1.137%  16.056%
(     170,     250 ]    17928   1.793%  17.849%
(     250,     380 ]    36597   3.660%  21.509% #
(     380,     580 ]    77882   7.788%  29.297% ##
(     580,     870 ]   160193  16.019%  45.317% ###
(     870,    1300 ]   210098  21.010%  66.326% ####
(    1300,    1900 ]   167461  16.746%  83.072% ###
(    1900,    2900 ]    78678   7.868%  90.940% ##
(    2900,    4400 ]    47743   4.774%  95.715% #
(    4400,    6600 ]    17650   1.765%  97.480%
(    6600,    9900 ]    11895   1.190%  98.669%
(    9900,   14000 ]     4993   0.499%  99.168%
(   14000,   22000 ]     2384   0.238%  99.407%
(   22000,   33000 ]     1966   0.197%  99.603%
(   50000,   75000 ]     2968   0.297%  99.900%
(  570000,  860000 ]      999   0.100% 100.000%

readrandom   :       1.967 micros/op 508487 ops/sec 1.967 seconds 1000000 operations;    8.2 MB/s (1000000 of 1000000 found)
```

Perhaps the only good thing to say about the old implementation was that it was fast, though apparently not that fast.

I've implemented a much more robust and reasonably fast new version of the function. It's still logarithmic but with some larger constant factors. The standard deviation from true count is around 20% or less, and roughly the CPU cost of two memtable point look-ups. See code comments for detail.

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000
...
filluniquerandom :       1.478 micros/op 676434 ops/sec 1.478 seconds 1000000 operations;   11.0 MB/s
approximatememtablestats :       2.694 micros/op 371157 ops/sec 2.694 seconds 1000000 operations;
Reported entry count stats (expected 1000):
Count: 1000000 Average: 1073.5158  StdDev: 197.80
Min: 608  Median: 1079.9506  Max: 2176
Percentiles: P50: 1079.95 P75: 1223.69 P99: 1852.36 P99.9: 1898.70 P99.99: 2176.00
------------------------------------------------------
(     580,     870 ]   134848  13.485%  13.485% ###
(     870,    1300 ]   747868  74.787%  88.272% ###############
(    1300,    1900 ]   116536  11.654%  99.925% ##
(    1900,    2900 ]      748   0.075% 100.000%

readrandom   :       1.997 micros/op 500654 ops/sec 1.997 seconds 1000000 operations;    8.1 MB/s (1000000 of 1000000 found)
```

We can already see that the distribution of results is dramatically better and wonderfully normal-looking, with relative standard deviation around 20%. The function is also FASTER, at least with these parameters. Let's look how this behavior generalizes, first *much* larger range:

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=30000
filluniquerandom :       1.390 micros/op 719654 ops/sec 1.376 seconds 990000 operations;   11.7 MB/s
approximatememtablestats :       1.129 micros/op 885649 ops/sec 1.129 seconds 1000000 operations;
Reported entry count stats (expected 30000):
Count: 1000000 Average: 31098.8795  StdDev: 3601.47
Min: 21504  Median: 29333.9303  Max: 43008
Percentiles: P50: 29333.93 P75: 33018.00 P99: 43008.00 P99.9: 43008.00 P99.99: 43008.00
------------------------------------------------------
(   14000,   22000 ]      408   0.041%   0.041%
(   22000,   33000 ]   749327  74.933%  74.974% ###############
(   33000,   50000 ]   250265  25.027% 100.000% #####

readrandom   :       1.894 micros/op 528083 ops/sec 1.894 seconds 1000000 operations;    8.5 MB/s (989989 of 1000000 found)
```

This is *even faster* and relatively *more accurate*, with relative standard deviation closer to 10%. Code comments explain why. Now let's look at smaller ranges. Implementation quirks or conveniences:
* When actual number in range is >= 40, the minimum return value is 40.
* When the actual is <= 10, it is guaranteed to return that actual number.
```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=75
...
filluniquerandom :       1.417 micros/op 705668 ops/sec 1.417 seconds 999975 operations;   11.4 MB/s
approximatememtablestats :       3.342 micros/op 299197 ops/sec 3.342 seconds 1000000 operations;
Reported entry count stats (expected 75):
Count: 1000000 Average: 75.1210  StdDev: 15.02
Min: 40  Median: 71.9395  Max: 256
Percentiles: P50: 71.94 P75: 89.69 P99: 119.12 P99.9: 166.68 P99.99: 229.78
------------------------------------------------------
(      34,      51 ]    38867   3.887%   3.887% #
(      51,      76 ]   550554  55.055%  58.942% ###########
(      76,     110 ]   398854  39.885%  98.828% ########
(     110,     170 ]    11353   1.135%  99.963%
(     170,     250 ]      364   0.036%  99.999%
(     250,     380 ]        8   0.001% 100.000%

readrandom   :       1.861 micros/op 537224 ops/sec 1.861 seconds 1000000 operations;    8.7 MB/s (999974 of 1000000 found)

$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=25
...
filluniquerandom :       1.501 micros/op 666283 ops/sec 1.501 seconds 1000000 operations;   10.8 MB/s
approximatememtablestats :       5.118 micros/op 195401 ops/sec 5.118 seconds 1000000 operations;
Reported entry count stats (expected 25):
Count: 1000000 Average: 26.2392  StdDev: 4.58
Min: 25  Median: 28.4590  Max: 72
Percentiles: P50: 28.46 P75: 31.69 P99: 49.27 P99.9: 67.95 P99.99: 72.00
------------------------------------------------------
(      22,      34 ]   928936  92.894%  92.894% ###################
(      34,      51 ]    67960   6.796%  99.690% #
(      51,      76 ]     3104   0.310% 100.000%

readrandom   :       1.892 micros/op 528595 ops/sec 1.892 seconds 1000000 operations;    8.6 MB/s (1000000 of 1000000 found)

$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=10
...
filluniquerandom :       1.642 micros/op 608916 ops/sec 1.642 seconds 1000000 operations;    9.9 MB/s
approximatememtablestats :       3.042 micros/op 328721 ops/sec 3.042 seconds 1000000 operations;
Reported entry count stats (expected 10):
Count: 1000000 Average: 10.0000  StdDev: 0.00
Min: 10  Median: 10.0000  Max: 10
Percentiles: P50: 10.00 P75: 10.00 P99: 10.00 P99.9: 10.00 P99.99: 10.00
------------------------------------------------------
(       6,      10 ]  1000000 100.000% 100.000% ####################

readrandom   :       1.805 micros/op 554126 ops/sec 1.805 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
```

Remarkably consistent.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13047

Test Plan: new db_bench test for both performance and accuracy (see above); added to crash test; unit test updated.

Reviewed By: cbi42

Differential Revision: D63722003

Pulled By: pdillinger

fbshipit-source-id: cfc8613c085e87c17ecec22d82601aac2a5a1b26
2024-10-02 14:25:50 -07:00
.circleci Enable io_uring in stress test (#12313) 2024-01-31 12:37:42 -08:00
.github More valgrind fixes (#12990) 2024-09-06 10:11:34 -07:00
buckifier Remove last user of AutoHeaders.RECURSIVE_GLOB 2024-09-17 13:21:57 -07:00
build_tools Fix folly build (#12795) 2024-06-22 15:15:02 -07:00
cache More accurate accounting of compressed cache memory (#13032) 2024-09-25 17:47:40 -07:00
cmake Fix zstd typo in cmake (#12309) 2024-02-22 14:39:05 -08:00
coverage Remove platform009 and default to platform010 (#11333) 2023-03-30 09:56:37 -07:00
db Re-implement GetApproximateMemTableStats for skip lists (#13047) 2024-10-02 14:25:50 -07:00
db_stress_tool Re-implement GetApproximateMemTableStats for skip lists (#13047) 2024-10-02 14:25:50 -07:00
docs Java FFI blog post - Post-publication issues with images (2) (#12372) 2024-02-22 15:01:55 -08:00
env Add missing RemapFileSystem::ReopenWritableFile (#12941) 2024-09-17 13:08:25 -07:00
examples Prefer static_cast in place of most reinterpret_cast (#12308) 2024-02-07 10:44:11 -08:00
file Fix orphaned files in SstFileManager (#13015) 2024-09-18 13:27:44 -07:00
fuzz Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
include/rocksdb Add comment for memory usage in BeginTransaction() and WriteBatch::Clear() (#13042) 2024-09-30 10:27:45 -07:00
java Steps toward deprecating implicit prefix seek, related fixes (#13026) 2024-09-20 15:54:19 -07:00
logging Fix data race in AutoRollLogger (#12436) 2024-03-14 14:28:33 -07:00
memory Set optimize_filters_for_memory by default (#12377) 2024-04-30 08:33:31 -07:00
memtable Re-implement GetApproximateMemTableStats for skip lists (#13047) 2024-10-02 14:25:50 -07:00
microbench internal_repo_rocksdb (-8794174668376270091) (#12114) 2023-12-01 11:10:30 -08:00
monitoring Add ticker stats for read corruption retries (#12923) 2024-08-12 15:32:07 -07:00
options Bug fix and test BuildDBOptions (#13038) 2024-09-26 14:36:29 -07:00
plugin Add initial CMake support to plugin (#9214) 2021-11-30 17:16:53 -08:00
port Fix CondVar::TimedWait for Windows (#12815) 2024-07-08 21:38:21 -07:00
table Steps toward deprecating implicit prefix seek, related fixes (#13026) 2024-09-20 15:54:19 -07:00
test_util Steps toward making IDENTITY file obsolete (#13019) 2024-09-19 14:05:21 -07:00
third-party fix optimization-disabled test builds with platform010 (#11361) 2023-04-10 13:59:44 -07:00
tools Re-implement GetApproximateMemTableStats for skip lists (#13047) 2024-10-02 14:25:50 -07:00
trace_replay Remove 'virtual' when implied by 'override' (#12319) 2024-01-31 13:14:42 -08:00
unreleased_history Re-implement GetApproximateMemTableStats for skip lists (#13047) 2024-10-02 14:25:50 -07:00
util More info in CompactionServiceJobInfo and CompactionJobStats (#13029) 2024-09-25 10:26:15 -07:00
utilities Fix non-ASCII character (#12972) 2024-09-03 14:41:55 -07:00
.clang-format A script that automatically reformat affected lines 2014-01-14 12:21:24 -08:00
.gitignore add gtags files ignore (#12747) 2024-06-12 21:46:40 -07:00
.lgtm.yml Create lgtm.yml for LGTM.com C/C++ analysis (#4058) 2018-06-26 12:43:04 -07:00
.watchmanconfig Added .watchmanconfig file to rocksdb repo (#5593) 2019-07-19 15:00:33 -07:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt Fix folly build (#12795) 2024-06-22 15:15:02 -07:00
CODE_OF_CONDUCT.md Adopt Contributor Covenant 2019-08-29 23:21:01 -07:00
CONTRIBUTING.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
COPYING Add GPLv2 as an alternative license. 2017-04-27 18:06:12 -07:00
DEFAULT_OPTIONS_HISTORY.md Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363) 2022-01-18 17:31:03 -08:00
DUMP_FORMAT.md First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
HISTORY.md Update HISTORY.md, version.h, and the format compatibility check script for the 9.7 release (#13027) 2024-09-20 19:19:06 -07:00
INSTALL.md fix out of date macos instructions in INSTALL.md (#12393) 2024-02-28 12:38:15 -08:00
LANGUAGE-BINDINGS.md Add grocksdb in Go language bindings (#10498) 2022-08-23 15:02:10 -07:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile Update folly Github hash (#13017) 2024-09-17 17:47:10 -07:00
PLUGINS.md Add encfs plugin link (#12070) 2023-11-14 07:33:21 -08:00
README.md Remove deprecated integration tests from README.md (#11354) 2023-04-07 16:52:50 -07:00
TARGETS Remove last user of AutoHeaders.RECURSIVE_GLOB 2024-09-17 13:21:57 -07:00
USERS.md Add Qdrant to USERS.md (#12072) 2023-11-16 10:35:08 -08:00
Vagrantfile Adding CentOS 7 Vagrantfile & build script 2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md Update branch name in WINDOWS_PORT.md (#8745) 2021-09-01 19:26:39 -07:00
common.mk Clean up variables for temporary directory (#9961) 2022-05-06 16:38:06 -07:00
crash_test.mk Stress/Crash Test for OptimisticTransactionDB (#11513) 2023-06-17 16:27:37 -07:00
issue_template.md Add Google Group to Issue Template 2020-01-28 14:40:37 -08:00
rocksdb.pc.in build: fix pkg-config file generation (#9953) 2022-05-30 12:46:40 -07:00
src.mk Fix folly build (#12795) 2024-06-22 15:15:02 -07:00
thirdparty.inc Fix build jemalloc api (#5470) 2019-06-24 17:40:32 -07:00

README.md

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.