rocksdb

History

Peter Dillinger dd23e84cad Re-implement GetApproximateMemTableStats for skip lists (#13047 ) Summary: GetApproximateMemTableStats() could return some bad results with the standard skip list memtable. See this new db_bench test showing the dismal distribution of results when the actual number of entries in range is 1000: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.391 micros/op 718915 ops/sec 1.391 seconds 1000000 operations; 11.7 MB/s approximatememtablestats : 3.711 micros/op 269492 ops/sec 3.711 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 2344.1611 StdDev: 26587.27 Min: 0 Median: 965.8555 Max: 835273 Percentiles: P50: 965.86 P75: 1610.77 P99: 12618.01 P99.9: 74991.58 P99.99: 830970.97 ------------------------------------------------------ [ 0, 1 ] 131344 13.134% 13.134% ### ( 1, 2 ] 115 0.011% 13.146% ( 2, 3 ] 106 0.011% 13.157% ( 3, 4 ] 190 0.019% 13.176% ( 4, 6 ] 214 0.021% 13.197% ( 6, 10 ] 522 0.052% 13.249% ( 10, 15 ] 748 0.075% 13.324% ( 15, 22 ] 1002 0.100% 13.424% ( 22, 34 ] 1948 0.195% 13.619% ( 34, 51 ] 3067 0.307% 13.926% ( 51, 76 ] 4213 0.421% 14.347% ( 76, 110 ] 5721 0.572% 14.919% ( 110, 170 ] 11375 1.137% 16.056% ( 170, 250 ] 17928 1.793% 17.849% ( 250, 380 ] 36597 3.660% 21.509% # ( 380, 580 ] 77882 7.788% 29.297% ## ( 580, 870 ] 160193 16.019% 45.317% ### ( 870, 1300 ] 210098 21.010% 66.326% #### ( 1300, 1900 ] 167461 16.746% 83.072% ### ( 1900, 2900 ] 78678 7.868% 90.940% ## ( 2900, 4400 ] 47743 4.774% 95.715% # ( 4400, 6600 ] 17650 1.765% 97.480% ( 6600, 9900 ] 11895 1.190% 98.669% ( 9900, 14000 ] 4993 0.499% 99.168% ( 14000, 22000 ] 2384 0.238% 99.407% ( 22000, 33000 ] 1966 0.197% 99.603% ( 50000, 75000 ] 2968 0.297% 99.900% ( 570000, 860000 ] 999 0.100% 100.000% readrandom : 1.967 micros/op 508487 ops/sec 1.967 seconds 1000000 operations; 8.2 MB/s (1000000 of 1000000 found) ``` Perhaps the only good thing to say about the old implementation was that it was fast, though apparently not that fast. I've implemented a much more robust and reasonably fast new version of the function. It's still logarithmic but with some larger constant factors. The standard deviation from true count is around 20% or less, and roughly the CPU cost of two memtable point look-ups. See code comments for detail. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.478 micros/op 676434 ops/sec 1.478 seconds 1000000 operations; 11.0 MB/s approximatememtablestats : 2.694 micros/op 371157 ops/sec 2.694 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 1073.5158 StdDev: 197.80 Min: 608 Median: 1079.9506 Max: 2176 Percentiles: P50: 1079.95 P75: 1223.69 P99: 1852.36 P99.9: 1898.70 P99.99: 2176.00 ------------------------------------------------------ ( 580, 870 ] 134848 13.485% 13.485% ### ( 870, 1300 ] 747868 74.787% 88.272% ############### ( 1300, 1900 ] 116536 11.654% 99.925% ## ( 1900, 2900 ] 748 0.075% 100.000% readrandom : 1.997 micros/op 500654 ops/sec 1.997 seconds 1000000 operations; 8.1 MB/s (1000000 of 1000000 found) ``` We can already see that the distribution of results is dramatically better and wonderfully normal-looking, with relative standard deviation around 20%. The function is also FASTER, at least with these parameters. Let's look how this behavior generalizes, first much larger range: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=30000 filluniquerandom : 1.390 micros/op 719654 ops/sec 1.376 seconds 990000 operations; 11.7 MB/s approximatememtablestats : 1.129 micros/op 885649 ops/sec 1.129 seconds 1000000 operations; Reported entry count stats (expected 30000): Count: 1000000 Average: 31098.8795 StdDev: 3601.47 Min: 21504 Median: 29333.9303 Max: 43008 Percentiles: P50: 29333.93 P75: 33018.00 P99: 43008.00 P99.9: 43008.00 P99.99: 43008.00 ------------------------------------------------------ ( 14000, 22000 ] 408 0.041% 0.041% ( 22000, 33000 ] 749327 74.933% 74.974% ############### ( 33000, 50000 ] 250265 25.027% 100.000% ##### readrandom : 1.894 micros/op 528083 ops/sec 1.894 seconds 1000000 operations; 8.5 MB/s (989989 of 1000000 found) ``` This is even faster and relatively more accurate, with relative standard deviation closer to 10%. Code comments explain why. Now let's look at smaller ranges. Implementation quirks or conveniences: * When actual number in range is >= 40, the minimum return value is 40. * When the actual is <= 10, it is guaranteed to return that actual number. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=75 ... filluniquerandom : 1.417 micros/op 705668 ops/sec 1.417 seconds 999975 operations; 11.4 MB/s approximatememtablestats : 3.342 micros/op 299197 ops/sec 3.342 seconds 1000000 operations; Reported entry count stats (expected 75): Count: 1000000 Average: 75.1210 StdDev: 15.02 Min: 40 Median: 71.9395 Max: 256 Percentiles: P50: 71.94 P75: 89.69 P99: 119.12 P99.9: 166.68 P99.99: 229.78 ------------------------------------------------------ ( 34, 51 ] 38867 3.887% 3.887% # ( 51, 76 ] 550554 55.055% 58.942% ########### ( 76, 110 ] 398854 39.885% 98.828% ######## ( 110, 170 ] 11353 1.135% 99.963% ( 170, 250 ] 364 0.036% 99.999% ( 250, 380 ] 8 0.001% 100.000% readrandom : 1.861 micros/op 537224 ops/sec 1.861 seconds 1000000 operations; 8.7 MB/s (999974 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=25 ... filluniquerandom : 1.501 micros/op 666283 ops/sec 1.501 seconds 1000000 operations; 10.8 MB/s approximatememtablestats : 5.118 micros/op 195401 ops/sec 5.118 seconds 1000000 operations; Reported entry count stats (expected 25): Count: 1000000 Average: 26.2392 StdDev: 4.58 Min: 25 Median: 28.4590 Max: 72 Percentiles: P50: 28.46 P75: 31.69 P99: 49.27 P99.9: 67.95 P99.99: 72.00 ------------------------------------------------------ ( 22, 34 ] 928936 92.894% 92.894% ################### ( 34, 51 ] 67960 6.796% 99.690% # ( 51, 76 ] 3104 0.310% 100.000% readrandom : 1.892 micros/op 528595 ops/sec 1.892 seconds 1000000 operations; 8.6 MB/s (1000000 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=10 ... filluniquerandom : 1.642 micros/op 608916 ops/sec 1.642 seconds 1000000 operations; 9.9 MB/s approximatememtablestats : 3.042 micros/op 328721 ops/sec 3.042 seconds 1000000 operations; Reported entry count stats (expected 10): Count: 1000000 Average: 10.0000 StdDev: 0.00 Min: 10 Median: 10.0000 Max: 10 Percentiles: P50: 10.00 P75: 10.00 P99: 10.00 P99.9: 10.00 P99.99: 10.00 ------------------------------------------------------ ( 6, 10 ] 1000000 100.000% 100.000% #################### readrandom : 1.805 micros/op 554126 ops/sec 1.805 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found) ``` Remarkably consistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13047 Test Plan: new db_bench test for both performance and accuracy (see above); added to crash test; unit test updated. Reviewed By: cbi42 Differential Revision: D63722003 Pulled By: pdillinger fbshipit-source-id: cfc8613c085e87c17ecec22d82601aac2a5a1b26		2024-10-02 14:25:50 -07:00
..
advisor	Fix lint issues after enable BLACK (#10717 )	2022-09-21 13:37:51 -07:00
block_cache_analyzer	Block cache analyzer: Calculate miss ratio for each caller (#10823 )	2024-01-10 14:02:14 -08:00
dump	internal_repo_rocksdb (435146444452818992) (#12115 )	2023-12-01 11:15:17 -08:00
CMakeLists.txt	…
Dockerfile	…
analyze_txn_stress_test.sh	…
auto_sanity_test.sh	…
backup_db.sh	Revamp check_format_compatible.sh (#8012 )	2021-03-02 11:42:27 -08:00
benchmark.sh	optimize file size statistics in benchmark script (#12363 )	2024-02-21 15:45:18 -08:00
benchmark_ci.py	Remove NUMA setting for benchmark-linux (#11180 )	2023-02-02 15:15:09 -08:00
benchmark_compare.sh	Fix file modes (#10815 )	2022-10-13 09:00:37 -07:00
benchmark_leveldb.sh	…
blob_dump.cc	Remove RocksDB LITE (#11147 )	2023-01-27 13:14:19 -08:00
check_all_python.py	Enable BLACK for internal_repo_rocksdb (#10710 )	2022-09-20 17:47:52 -07:00
check_format_compatible.sh	Update HISTORY.md, version.h, and the format compatibility check script for the 9.7 release (#13027 )	2024-09-20 19:19:06 -07:00
db_bench.cc	Add (& fix) some simple source code checks (#8821 )	2021-09-07 21:19:27 -07:00
db_bench_tool.cc	Re-implement GetApproximateMemTableStats for skip lists (#13047 )	2024-10-02 14:25:50 -07:00
db_bench_tool_test.cc	Group SST write in flush, compaction and db open with new stats (#11910 )	2023-12-29 15:29:23 -08:00
db_crashtest.py	Steps toward making IDENTITY file obsolete (#13019 )	2024-09-19 14:05:21 -07:00
db_repl_stress.cc	Prefer static_cast in place of most reinterpret_cast (#12308 )	2024-02-07 10:44:11 -08:00
db_sanity_test.cc	Remove 'virtual' when implied by 'override' (#12319 )	2024-01-31 13:14:42 -08:00
dbench_monitor	…
generate_random_db.sh	…
ingest_external_sst.sh	…
io_tracer_parser.cc	Remove RocksDB LITE (#11147 )	2023-01-27 13:14:19 -08:00
io_tracer_parser_test.cc	Remove RocksDB LITE (#11147 )	2023-01-27 13:14:19 -08:00
io_tracer_parser_tool.cc	Remove RocksDB LITE (#11147 )	2023-01-27 13:14:19 -08:00
io_tracer_parser_tool.h	Remove RocksDB LITE (#11147 )	2023-01-27 13:14:19 -08:00
ldb.cc	Remove RocksDB LITE (#11147 )	2023-01-27 13:14:19 -08:00
ldb_cmd.cc	Add an option to dump wal seqno gaps (#13014 )	2024-09-18 17:48:18 -07:00
ldb_cmd_impl.h	Add an option to dump wal seqno gaps (#13014 )	2024-09-18 17:48:18 -07:00
ldb_cmd_test.cc	Remove `bottommost_temperature` (#12389 )	2024-02-27 14:48:00 -08:00
ldb_test.py	Add an option to dump wal seqno gaps (#13014 )	2024-09-18 17:48:18 -07:00
ldb_tool.cc	Add LDB command and option for follower instances (#12682 )	2024-05-28 23:21:32 -07:00
pflag	…
reduce_levels_test.cc	Make option `level_compaction_dynamic_level_bytes` true by default (#11525 )	2023-06-15 21:12:39 -07:00
regression_test.sh	Fix regression script for async_io benchmarks (#11462 )	2023-05-22 15:32:12 -07:00
restore_db.sh	Revamp check_format_compatible.sh (#8012 )	2021-03-02 11:42:27 -08:00
rocksdb_dump_test.sh	…
run_blob_bench.sh	add exe and script path check (#11621 )	2023-07-19 12:05:24 -07:00
run_flash_bench.sh	…
run_leveldb.sh	…
sample-dump.dmp	…
simulated_hybrid_file_system.cc	Group SST write in flush, compaction and db open with new stats (#11910 )	2023-12-29 15:29:23 -08:00
simulated_hybrid_file_system.h	Remove RocksDB LITE (#11147 )	2023-01-27 13:14:19 -08:00
sst_dump.cc	Remove RocksDB LITE (#11147 )	2023-01-27 13:14:19 -08:00
sst_dump_test.cc	Allow SstFileReader to verify number of entries in SST files (#12418 )	2024-03-12 11:05:20 -07:00
sst_dump_tool.cc	Augment sst_dump tool to verify num_entries in table property (#12322 )	2024-02-01 14:35:03 -08:00
trace_analyzer.cc	Remove RocksDB LITE (#11147 )	2023-01-27 13:14:19 -08:00
trace_analyzer_test.cc	internal_repo_rocksdb (435146444452818992) (#12115 )	2023-12-01 11:15:17 -08:00
trace_analyzer_tool.cc	Trace analyzer: replace number with enumeration type (#10827 )	2023-12-27 10:38:53 -08:00
trace_analyzer_tool.h	Remove RocksDB LITE (#11147 )	2023-01-27 13:14:19 -08:00
verify_random_db.sh	Fix some bugs in verify_random_db.sh (#10112 )	2022-06-03 16:35:13 -07:00
write_external_sst.sh	Revamp check_format_compatible.sh (#8012 )	2021-03-02 11:42:27 -08:00
write_stress.cc	Remove RocksDB LITE (#11147 )	2023-01-27 13:14:19 -08:00
write_stress_runner.py	Enable BLACK for internal_repo_rocksdb (#10710 )	2022-09-20 17:47:52 -07:00