rocksdb

Go to file

Peter Dillinger dd23e84cad Re-implement GetApproximateMemTableStats for skip lists (#13047 ) Summary: GetApproximateMemTableStats() could return some bad results with the standard skip list memtable. See this new db_bench test showing the dismal distribution of results when the actual number of entries in range is 1000: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.391 micros/op 718915 ops/sec 1.391 seconds 1000000 operations; 11.7 MB/s approximatememtablestats : 3.711 micros/op 269492 ops/sec 3.711 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 2344.1611 StdDev: 26587.27 Min: 0 Median: 965.8555 Max: 835273 Percentiles: P50: 965.86 P75: 1610.77 P99: 12618.01 P99.9: 74991.58 P99.99: 830970.97 ------------------------------------------------------ [ 0, 1 ] 131344 13.134% 13.134% ### ( 1, 2 ] 115 0.011% 13.146% ( 2, 3 ] 106 0.011% 13.157% ( 3, 4 ] 190 0.019% 13.176% ( 4, 6 ] 214 0.021% 13.197% ( 6, 10 ] 522 0.052% 13.249% ( 10, 15 ] 748 0.075% 13.324% ( 15, 22 ] 1002 0.100% 13.424% ( 22, 34 ] 1948 0.195% 13.619% ( 34, 51 ] 3067 0.307% 13.926% ( 51, 76 ] 4213 0.421% 14.347% ( 76, 110 ] 5721 0.572% 14.919% ( 110, 170 ] 11375 1.137% 16.056% ( 170, 250 ] 17928 1.793% 17.849% ( 250, 380 ] 36597 3.660% 21.509% # ( 380, 580 ] 77882 7.788% 29.297% ## ( 580, 870 ] 160193 16.019% 45.317% ### ( 870, 1300 ] 210098 21.010% 66.326% #### ( 1300, 1900 ] 167461 16.746% 83.072% ### ( 1900, 2900 ] 78678 7.868% 90.940% ## ( 2900, 4400 ] 47743 4.774% 95.715% # ( 4400, 6600 ] 17650 1.765% 97.480% ( 6600, 9900 ] 11895 1.190% 98.669% ( 9900, 14000 ] 4993 0.499% 99.168% ( 14000, 22000 ] 2384 0.238% 99.407% ( 22000, 33000 ] 1966 0.197% 99.603% ( 50000, 75000 ] 2968 0.297% 99.900% ( 570000, 860000 ] 999 0.100% 100.000% readrandom : 1.967 micros/op 508487 ops/sec 1.967 seconds 1000000 operations; 8.2 MB/s (1000000 of 1000000 found) ``` Perhaps the only good thing to say about the old implementation was that it was fast, though apparently not that fast. I've implemented a much more robust and reasonably fast new version of the function. It's still logarithmic but with some larger constant factors. The standard deviation from true count is around 20% or less, and roughly the CPU cost of two memtable point look-ups. See code comments for detail. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.478 micros/op 676434 ops/sec 1.478 seconds 1000000 operations; 11.0 MB/s approximatememtablestats : 2.694 micros/op 371157 ops/sec 2.694 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 1073.5158 StdDev: 197.80 Min: 608 Median: 1079.9506 Max: 2176 Percentiles: P50: 1079.95 P75: 1223.69 P99: 1852.36 P99.9: 1898.70 P99.99: 2176.00 ------------------------------------------------------ ( 580, 870 ] 134848 13.485% 13.485% ### ( 870, 1300 ] 747868 74.787% 88.272% ############### ( 1300, 1900 ] 116536 11.654% 99.925% ## ( 1900, 2900 ] 748 0.075% 100.000% readrandom : 1.997 micros/op 500654 ops/sec 1.997 seconds 1000000 operations; 8.1 MB/s (1000000 of 1000000 found) ``` We can already see that the distribution of results is dramatically better and wonderfully normal-looking, with relative standard deviation around 20%. The function is also FASTER, at least with these parameters. Let's look how this behavior generalizes, first much larger range: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=30000 filluniquerandom : 1.390 micros/op 719654 ops/sec 1.376 seconds 990000 operations; 11.7 MB/s approximatememtablestats : 1.129 micros/op 885649 ops/sec 1.129 seconds 1000000 operations; Reported entry count stats (expected 30000): Count: 1000000 Average: 31098.8795 StdDev: 3601.47 Min: 21504 Median: 29333.9303 Max: 43008 Percentiles: P50: 29333.93 P75: 33018.00 P99: 43008.00 P99.9: 43008.00 P99.99: 43008.00 ------------------------------------------------------ ( 14000, 22000 ] 408 0.041% 0.041% ( 22000, 33000 ] 749327 74.933% 74.974% ############### ( 33000, 50000 ] 250265 25.027% 100.000% ##### readrandom : 1.894 micros/op 528083 ops/sec 1.894 seconds 1000000 operations; 8.5 MB/s (989989 of 1000000 found) ``` This is even faster and relatively more accurate, with relative standard deviation closer to 10%. Code comments explain why. Now let's look at smaller ranges. Implementation quirks or conveniences: * When actual number in range is >= 40, the minimum return value is 40. * When the actual is <= 10, it is guaranteed to return that actual number. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=75 ... filluniquerandom : 1.417 micros/op 705668 ops/sec 1.417 seconds 999975 operations; 11.4 MB/s approximatememtablestats : 3.342 micros/op 299197 ops/sec 3.342 seconds 1000000 operations; Reported entry count stats (expected 75): Count: 1000000 Average: 75.1210 StdDev: 15.02 Min: 40 Median: 71.9395 Max: 256 Percentiles: P50: 71.94 P75: 89.69 P99: 119.12 P99.9: 166.68 P99.99: 229.78 ------------------------------------------------------ ( 34, 51 ] 38867 3.887% 3.887% # ( 51, 76 ] 550554 55.055% 58.942% ########### ( 76, 110 ] 398854 39.885% 98.828% ######## ( 110, 170 ] 11353 1.135% 99.963% ( 170, 250 ] 364 0.036% 99.999% ( 250, 380 ] 8 0.001% 100.000% readrandom : 1.861 micros/op 537224 ops/sec 1.861 seconds 1000000 operations; 8.7 MB/s (999974 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=25 ... filluniquerandom : 1.501 micros/op 666283 ops/sec 1.501 seconds 1000000 operations; 10.8 MB/s approximatememtablestats : 5.118 micros/op 195401 ops/sec 5.118 seconds 1000000 operations; Reported entry count stats (expected 25): Count: 1000000 Average: 26.2392 StdDev: 4.58 Min: 25 Median: 28.4590 Max: 72 Percentiles: P50: 28.46 P75: 31.69 P99: 49.27 P99.9: 67.95 P99.99: 72.00 ------------------------------------------------------ ( 22, 34 ] 928936 92.894% 92.894% ################### ( 34, 51 ] 67960 6.796% 99.690% # ( 51, 76 ] 3104 0.310% 100.000% readrandom : 1.892 micros/op 528595 ops/sec 1.892 seconds 1000000 operations; 8.6 MB/s (1000000 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=10 ... filluniquerandom : 1.642 micros/op 608916 ops/sec 1.642 seconds 1000000 operations; 9.9 MB/s approximatememtablestats : 3.042 micros/op 328721 ops/sec 3.042 seconds 1000000 operations; Reported entry count stats (expected 10): Count: 1000000 Average: 10.0000 StdDev: 0.00 Min: 10 Median: 10.0000 Max: 10 Percentiles: P50: 10.00 P75: 10.00 P99: 10.00 P99.9: 10.00 P99.99: 10.00 ------------------------------------------------------ ( 6, 10 ] 1000000 100.000% 100.000% #################### readrandom : 1.805 micros/op 554126 ops/sec 1.805 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found) ``` Remarkably consistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13047 Test Plan: new db_bench test for both performance and accuracy (see above); added to crash test; unit test updated. Reviewed By: cbi42 Differential Revision: D63722003 Pulled By: pdillinger fbshipit-source-id: cfc8613c085e87c17ecec22d82601aac2a5a1b26		2024-10-02 14:25:50 -07:00
.circleci	Enable io_uring in stress test (#12313 )	2024-01-31 12:37:42 -08:00
.github	More valgrind fixes (#12990 )	2024-09-06 10:11:34 -07:00
buckifier	Remove last user of AutoHeaders.RECURSIVE_GLOB	2024-09-17 13:21:57 -07:00
build_tools	Fix folly build (#12795 )	2024-06-22 15:15:02 -07:00
cache	More accurate accounting of compressed cache memory (#13032 )	2024-09-25 17:47:40 -07:00
cmake	Fix zstd typo in cmake (#12309 )	2024-02-22 14:39:05 -08:00
coverage	Remove platform009 and default to platform010 (#11333 )	2023-03-30 09:56:37 -07:00
db	Re-implement GetApproximateMemTableStats for skip lists (#13047 )	2024-10-02 14:25:50 -07:00
db_stress_tool	Re-implement GetApproximateMemTableStats for skip lists (#13047 )	2024-10-02 14:25:50 -07:00
docs	Java FFI blog post - Post-publication issues with images (2) (#12372 )	2024-02-22 15:01:55 -08:00
env	Add missing RemapFileSystem::ReopenWritableFile (#12941 )	2024-09-17 13:08:25 -07:00
examples	Prefer static_cast in place of most reinterpret_cast (#12308 )	2024-02-07 10:44:11 -08:00
file	Fix orphaned files in SstFileManager (#13015 )	2024-09-18 13:27:44 -07:00
fuzz	Block per key-value checksum (#11287 )	2023-04-25 12:08:23 -07:00
include/rocksdb	Add comment for memory usage in BeginTransaction() and WriteBatch::Clear() (#13042 )	2024-09-30 10:27:45 -07:00
java	Steps toward deprecating implicit prefix seek, related fixes (#13026 )	2024-09-20 15:54:19 -07:00
logging	Fix data race in AutoRollLogger (#12436 )	2024-03-14 14:28:33 -07:00
memory	Set optimize_filters_for_memory by default (#12377 )	2024-04-30 08:33:31 -07:00
memtable	Re-implement GetApproximateMemTableStats for skip lists (#13047 )	2024-10-02 14:25:50 -07:00
microbench	internal_repo_rocksdb (-8794174668376270091) (#12114 )	2023-12-01 11:10:30 -08:00
monitoring	Add ticker stats for read corruption retries (#12923 )	2024-08-12 15:32:07 -07:00
options	Bug fix and test BuildDBOptions (#13038 )	2024-09-26 14:36:29 -07:00
plugin	Add initial CMake support to plugin (#9214 )	2021-11-30 17:16:53 -08:00
port	Fix CondVar::TimedWait for Windows (#12815 )	2024-07-08 21:38:21 -07:00
table	Steps toward deprecating implicit prefix seek, related fixes (#13026 )	2024-09-20 15:54:19 -07:00
test_util	Steps toward making IDENTITY file obsolete (#13019 )	2024-09-19 14:05:21 -07:00
third-party	fix optimization-disabled test builds with platform010 (#11361 )	2023-04-10 13:59:44 -07:00
tools	Re-implement GetApproximateMemTableStats for skip lists (#13047 )	2024-10-02 14:25:50 -07:00
trace_replay	Remove 'virtual' when implied by 'override' (#12319 )	2024-01-31 13:14:42 -08:00
unreleased_history	Re-implement GetApproximateMemTableStats for skip lists (#13047 )	2024-10-02 14:25:50 -07:00
util	More info in CompactionServiceJobInfo and CompactionJobStats (#13029 )	2024-09-25 10:26:15 -07:00
utilities	Fix non-ASCII character (#12972 )	2024-09-03 14:41:55 -07:00
.clang-format	A script that automatically reformat affected lines	2014-01-14 12:21:24 -08:00
.gitignore	add gtags files ignore (#12747 )	2024-06-12 21:46:40 -07:00
.lgtm.yml	Create lgtm.yml for LGTM.com C/C++ analysis (#4058 )	2018-06-26 12:43:04 -07:00
.watchmanconfig	Added .watchmanconfig file to rocksdb repo (#5593 )	2019-07-19 15:00:33 -07:00
AUTHORS	Update RocksDB Authors File	2017-10-18 14:42:10 -07:00
CMakeLists.txt	Fix folly build (#12795 )	2024-06-22 15:15:02 -07:00
CODE_OF_CONDUCT.md	Adopt Contributor Covenant	2019-08-29 23:21:01 -07:00
CONTRIBUTING.md	Add Code of Conduct	2017-12-05 18:42:35 -08:00
COPYING	Add GPLv2 as an alternative license.	2017-04-27 18:06:12 -07:00
DEFAULT_OPTIONS_HISTORY.md	Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363 )	2022-01-18 17:31:03 -08:00
DUMP_FORMAT.md	First version of rocksdb_dump and rocksdb_undump.	2015-06-19 16:24:36 -07:00
HISTORY.md	Update HISTORY.md, version.h, and the format compatibility check script for the 9.7 release (#13027 )	2024-09-20 19:19:06 -07:00
INSTALL.md	fix out of date macos instructions in INSTALL.md (#12393 )	2024-02-28 12:38:15 -08:00
LANGUAGE-BINDINGS.md	Add grocksdb in Go language bindings (#10498 )	2022-08-23 15:02:10 -07:00
LICENSE.Apache	Change RocksDB License	2017-07-15 16:11:23 -07:00
LICENSE.leveldb	Add back the LevelDB license file	2017-07-16 18:42:18 -07:00
Makefile	Update folly Github hash (#13017 )	2024-09-17 17:47:10 -07:00
PLUGINS.md	Add encfs plugin link (#12070 )	2023-11-14 07:33:21 -08:00
README.md	Remove deprecated integration tests from README.md (#11354 )	2023-04-07 16:52:50 -07:00
TARGETS	Remove last user of AutoHeaders.RECURSIVE_GLOB	2024-09-17 13:21:57 -07:00
USERS.md	Add Qdrant to USERS.md (#12072 )	2023-11-16 10:35:08 -08:00
Vagrantfile	Adding CentOS 7 Vagrantfile & build script	2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md	Update branch name in WINDOWS_PORT.md (#8745 )	2021-09-01 19:26:39 -07:00
common.mk	Clean up variables for temporary directory (#9961 )	2022-05-06 16:38:06 -07:00
crash_test.mk	Stress/Crash Test for OptimisticTransactionDB (#11513 )	2023-06-17 16:27:37 -07:00
issue_template.md	Add Google Group to Issue Template	2020-01-28 14:40:37 -08:00
rocksdb.pc.in	build: fix pkg-config file generation (#9953 )	2022-05-30 12:46:40 -07:00
src.mk	Fix folly build (#12795 )	2024-06-22 15:15:02 -07:00
thirdparty.inc	Fix build jemalloc api (#5470 )	2019-06-24 17:40:32 -07:00

README.md

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.