rocksdb

Commit Graph

Author	SHA1	Message	Date
Changyu Bi	9a136e18b3	Fix a valgrind unit test failure (#13137 ) Summary: fix the valgrind failure from https://github.com/facebook/rocksdb/actions/runs/11813904728/job/32911902535?fbclid=IwZXh0bgNhZW0CMTEAAR2GJs1U6mNwNv3zwPzU8rpCmBHqfStV3dupj2o_-686RneLKXADaSZH5-U_aem_ADUQy7bzknoseVpjrOc5SQ ``` [ RUN ] WBWIMemTableTest.ReadFromWBWIMemtable ==1150870== Conditional jump or move depends on uninitialised value(s) ==1150870== at 0x50FE67A: rocksdb::WBWIMemTable::Get(rocksdb::LookupKey const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, rocksdb::PinnableWideColumns, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, rocksdb::Status, rocksdb::MergeContext, unsigned long, unsigned long, rocksdb::ReadOptions const&, bool, rocksdb::ReadCallback, bool, bool) (wbwi_memtable.cc:60) ==1150870== by 0x50FF92A: rocksdb::WBWIMemTable::MultiGet(rocksdb::ReadOptions const&, rocksdb::MultiGetContext::Range, rocksdb::ReadCallback, bool) (wbwi_memtable.cc:120) ==1150870== by 0x1879EF: rocksdb::WBWIMemTableTest_ReadFromWBWIMemtable_Test::TestBody() (write_batch_with_index_test.cc:3580) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13137 Test Plan: `valgrind ./write_batch_with_index_test --gtest_filter="ReadFromWBWIMemtable*"` Reviewed By: ltamasi Differential Revision: D65892657 Pulled By: cbi42 fbshipit-source-id: 0b44a5a06b8cc64173ad36966339877e2f508d52	2024-11-13 12:41:56 -08:00
Changyu Bi	1c7652fcef	Introduce a WriteBatchWithIndex-based implementation of ReadOnlyMemTable (#13123 ) Summary: introduce the class WBWIMemTable that implements ReadOnlyMemTable interface with data stored in a WriteBatchWithIndex object. This PR implements the main read path: Get, MultiGet and Iterator. It only supports Put, Delete and SingleDelete operations for now. All the keys in the WBWIMemTable will be assigned a global sequence number through WBWIMemTable::SetGlobalSequenceNumber(). Planned follow up PRs: - Create WBWIMemTable with a transaction's WBWI and ingest it into a DB during Transaction::Commit() - Support for Merge. This will be more complicated since we can have multiple updates with the same user key for Merge. - Support for other operations like WideColumn and other ReadOnlyMemTable methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13123 Test Plan: * A mini-stress test for the read path is added as a new unit test Reviewed By: jowlyzhang Differential Revision: D65633419 Pulled By: cbi42 fbshipit-source-id: 0684fe47260b41f51ca39c300eb72ca5bc9c5a3b	2024-11-12 09:27:11 -08:00
Peter Dillinger	dd23e84cad	Re-implement GetApproximateMemTableStats for skip lists (#13047 ) Summary: GetApproximateMemTableStats() could return some bad results with the standard skip list memtable. See this new db_bench test showing the dismal distribution of results when the actual number of entries in range is 1000: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.391 micros/op 718915 ops/sec 1.391 seconds 1000000 operations; 11.7 MB/s approximatememtablestats : 3.711 micros/op 269492 ops/sec 3.711 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 2344.1611 StdDev: 26587.27 Min: 0 Median: 965.8555 Max: 835273 Percentiles: P50: 965.86 P75: 1610.77 P99: 12618.01 P99.9: 74991.58 P99.99: 830970.97 ------------------------------------------------------ [ 0, 1 ] 131344 13.134% 13.134% ### ( 1, 2 ] 115 0.011% 13.146% ( 2, 3 ] 106 0.011% 13.157% ( 3, 4 ] 190 0.019% 13.176% ( 4, 6 ] 214 0.021% 13.197% ( 6, 10 ] 522 0.052% 13.249% ( 10, 15 ] 748 0.075% 13.324% ( 15, 22 ] 1002 0.100% 13.424% ( 22, 34 ] 1948 0.195% 13.619% ( 34, 51 ] 3067 0.307% 13.926% ( 51, 76 ] 4213 0.421% 14.347% ( 76, 110 ] 5721 0.572% 14.919% ( 110, 170 ] 11375 1.137% 16.056% ( 170, 250 ] 17928 1.793% 17.849% ( 250, 380 ] 36597 3.660% 21.509% # ( 380, 580 ] 77882 7.788% 29.297% ## ( 580, 870 ] 160193 16.019% 45.317% ### ( 870, 1300 ] 210098 21.010% 66.326% #### ( 1300, 1900 ] 167461 16.746% 83.072% ### ( 1900, 2900 ] 78678 7.868% 90.940% ## ( 2900, 4400 ] 47743 4.774% 95.715% # ( 4400, 6600 ] 17650 1.765% 97.480% ( 6600, 9900 ] 11895 1.190% 98.669% ( 9900, 14000 ] 4993 0.499% 99.168% ( 14000, 22000 ] 2384 0.238% 99.407% ( 22000, 33000 ] 1966 0.197% 99.603% ( 50000, 75000 ] 2968 0.297% 99.900% ( 570000, 860000 ] 999 0.100% 100.000% readrandom : 1.967 micros/op 508487 ops/sec 1.967 seconds 1000000 operations; 8.2 MB/s (1000000 of 1000000 found) ``` Perhaps the only good thing to say about the old implementation was that it was fast, though apparently not that fast. I've implemented a much more robust and reasonably fast new version of the function. It's still logarithmic but with some larger constant factors. The standard deviation from true count is around 20% or less, and roughly the CPU cost of two memtable point look-ups. See code comments for detail. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.478 micros/op 676434 ops/sec 1.478 seconds 1000000 operations; 11.0 MB/s approximatememtablestats : 2.694 micros/op 371157 ops/sec 2.694 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 1073.5158 StdDev: 197.80 Min: 608 Median: 1079.9506 Max: 2176 Percentiles: P50: 1079.95 P75: 1223.69 P99: 1852.36 P99.9: 1898.70 P99.99: 2176.00 ------------------------------------------------------ ( 580, 870 ] 134848 13.485% 13.485% ### ( 870, 1300 ] 747868 74.787% 88.272% ############### ( 1300, 1900 ] 116536 11.654% 99.925% ## ( 1900, 2900 ] 748 0.075% 100.000% readrandom : 1.997 micros/op 500654 ops/sec 1.997 seconds 1000000 operations; 8.1 MB/s (1000000 of 1000000 found) ``` We can already see that the distribution of results is dramatically better and wonderfully normal-looking, with relative standard deviation around 20%. The function is also FASTER, at least with these parameters. Let's look how this behavior generalizes, first much larger range: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=30000 filluniquerandom : 1.390 micros/op 719654 ops/sec 1.376 seconds 990000 operations; 11.7 MB/s approximatememtablestats : 1.129 micros/op 885649 ops/sec 1.129 seconds 1000000 operations; Reported entry count stats (expected 30000): Count: 1000000 Average: 31098.8795 StdDev: 3601.47 Min: 21504 Median: 29333.9303 Max: 43008 Percentiles: P50: 29333.93 P75: 33018.00 P99: 43008.00 P99.9: 43008.00 P99.99: 43008.00 ------------------------------------------------------ ( 14000, 22000 ] 408 0.041% 0.041% ( 22000, 33000 ] 749327 74.933% 74.974% ############### ( 33000, 50000 ] 250265 25.027% 100.000% ##### readrandom : 1.894 micros/op 528083 ops/sec 1.894 seconds 1000000 operations; 8.5 MB/s (989989 of 1000000 found) ``` This is even faster and relatively more accurate, with relative standard deviation closer to 10%. Code comments explain why. Now let's look at smaller ranges. Implementation quirks or conveniences: * When actual number in range is >= 40, the minimum return value is 40. * When the actual is <= 10, it is guaranteed to return that actual number. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=75 ... filluniquerandom : 1.417 micros/op 705668 ops/sec 1.417 seconds 999975 operations; 11.4 MB/s approximatememtablestats : 3.342 micros/op 299197 ops/sec 3.342 seconds 1000000 operations; Reported entry count stats (expected 75): Count: 1000000 Average: 75.1210 StdDev: 15.02 Min: 40 Median: 71.9395 Max: 256 Percentiles: P50: 71.94 P75: 89.69 P99: 119.12 P99.9: 166.68 P99.99: 229.78 ------------------------------------------------------ ( 34, 51 ] 38867 3.887% 3.887% # ( 51, 76 ] 550554 55.055% 58.942% ########### ( 76, 110 ] 398854 39.885% 98.828% ######## ( 110, 170 ] 11353 1.135% 99.963% ( 170, 250 ] 364 0.036% 99.999% ( 250, 380 ] 8 0.001% 100.000% readrandom : 1.861 micros/op 537224 ops/sec 1.861 seconds 1000000 operations; 8.7 MB/s (999974 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=25 ... filluniquerandom : 1.501 micros/op 666283 ops/sec 1.501 seconds 1000000 operations; 10.8 MB/s approximatememtablestats : 5.118 micros/op 195401 ops/sec 5.118 seconds 1000000 operations; Reported entry count stats (expected 25): Count: 1000000 Average: 26.2392 StdDev: 4.58 Min: 25 Median: 28.4590 Max: 72 Percentiles: P50: 28.46 P75: 31.69 P99: 49.27 P99.9: 67.95 P99.99: 72.00 ------------------------------------------------------ ( 22, 34 ] 928936 92.894% 92.894% ################### ( 34, 51 ] 67960 6.796% 99.690% # ( 51, 76 ] 3104 0.310% 100.000% readrandom : 1.892 micros/op 528595 ops/sec 1.892 seconds 1000000 operations; 8.6 MB/s (1000000 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=10 ... filluniquerandom : 1.642 micros/op 608916 ops/sec 1.642 seconds 1000000 operations; 9.9 MB/s approximatememtablestats : 3.042 micros/op 328721 ops/sec 3.042 seconds 1000000 operations; Reported entry count stats (expected 10): Count: 1000000 Average: 10.0000 StdDev: 0.00 Min: 10 Median: 10.0000 Max: 10 Percentiles: P50: 10.00 P75: 10.00 P99: 10.00 P99.9: 10.00 P99.99: 10.00 ------------------------------------------------------ ( 6, 10 ] 1000000 100.000% 100.000% #################### readrandom : 1.805 micros/op 554126 ops/sec 1.805 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found) ``` Remarkably consistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13047 Test Plan: new db_bench test for both performance and accuracy (see above); added to crash test; unit test updated. Reviewed By: cbi42 Differential Revision: D63722003 Pulled By: pdillinger fbshipit-source-id: cfc8613c085e87c17ecec22d82601aac2a5a1b26	2024-10-02 14:25:50 -07:00
Changyu Bi	92ad4a88f3	Small CPU optimization in InlineSkipList::Insert() (#12975 ) Summary: reuse decode key in more places to avoid decoding length prefixed key x->Key(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12975 Test Plan: ran benchmarks simultaneously for "before" and "after" * fillseq: ``` (for I in $(seq 1 50); do ./db_bench --benchmarks=fillseq --disable_auto_compactions=1 --min_write_buffer_number_to_merge=100 --max_write_buffer_number=1000 --write_buffer_size=268435456 --num=5000000 --seed=1723056275 --disable_wal=1 2>&1 \| grep "fillseq" done;) \| awk '{ t += $5; c++; print } END { printf ("%9.3f\n", 1.0 * t / c) }'; before: 1483191 after: 1490555 (+0.5%) ``` * fillrandom: ``` (for I in $(seq 1 2); do ./db_bench_imain --benchmarks=fillrandom --disable_auto_compactions=1 --min_write_buffer_number_to_merge=100 --max_write_buffer_number=1000 --write_buffer_size=268435456 --num=2500000 --seed=1723056275 --disable_wal=1 2>&1 \| grep "fillrandom" before: 255463 after: 256128 (+0.26%) ``` Reviewed By: anand1976 Differential Revision: D61835340 Pulled By: cbi42 fbshipit-source-id: 70345510720e348bacd51269acb5d2dd5a62bf0a	2024-08-27 13:57:40 -07:00
Changyu Bi	defd97bc9d	Add an option to verify memtable key order during reads (#12889 ) Summary: add a new CF option `paranoid_memory_checks` that allows additional data integrity validations during read/scan. Currently, skiplist-based memtable will validate the order of keys visited. Further data validation can be added in different layers. The option will be opt-in due to performance overhead. The motivation for this feature is for services where data correctness is critical and want to detect in-memory corruption earlier. For a corrupted memtable key, this feature can help to detect it during during reads instead of during flush with existing protections (OutputValidator that verifies key order or per kv checksum). See internally linked task for more context. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12889 Test Plan: * new unit test added for paranoid_memory_checks=true. * existing unit test for paranoid_memory_checks=false. * enable in stress test. Performance Benchmark: we check for performance regression in read path where data is in memtable only. For each benchmark, the script was run at the same time for main and this PR: * Memtable-only randomread ops/sec: ``` (for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --num=250000 --reads=500000 --seed=1723056275 2>&1 \| grep "readrandom"; done;) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; Main: 608146 PR with paranoid_memory_checks=false: 607727 (- %0.07) PR with paranoid_memory_checks=true: 521889 (-%14.2) ``` * Memtable-only sequential scan ops/sec: ``` (for I in $(seq 1 50); do ./db_bench--benchmarks=fillseq,readseq[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; Main: 9180077 PR with paranoid_memory_checks=false: 9536241 (+%3.8) PR with paranoid_memory_checks=true: 7653934 (-%16.6) ``` * Memtable-only reverse scan ops/sec: ``` (for I in $(seq 1 20); do ./db_bench --benchmarks=fillseq,readreverse[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; Main: 1285719 PR with integrity_checks=false: 1431626 (+%11.3) PR with integrity_checks=true: 811031 (-%36.9) ``` The `readrandom` benchmark shows no regression. The scanning benchmarks show improvement that I can't explain. Reviewed By: pdillinger Differential Revision: D60414267 Pulled By: cbi42 fbshipit-source-id: a70b0cbeea131f1a249a5f78f9dc3a62dacfaa91	2024-08-19 13:53:25 -07:00
Peter Dillinger	54cb9c77d9	Prefer static_cast in place of most reinterpret_cast (#12308 ) Summary: The following are risks associated with pointer-to-pointer reinterpret_cast: * Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do. * Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally. I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement: * Going through `void` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have `struct Derived: public Base1, public Base2`. If we have `Derived` -> `void` -> `Base2` -> `Derived` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic. * Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived` -> `Base2` -> `Derived` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance. With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain. A couple of related interventions included here: Previously Cache::Handle was not actually derived from in the implementations and just used as a `void` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle. Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse). Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work. I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308 Test Plan: existing tests, CI Reviewed By: ltamasi Differential Revision: D53204947 Pulled By: pdillinger fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2	2024-02-07 10:44:11 -08:00
Peter Dillinger	76c834e441	Remove 'virtual' when implied by 'override' (#12319 ) Summary: ... to follow modern C++ style / idioms. Used this hack: ``` for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319 Test Plan: existing tests, CI Reviewed By: jaykorean Differential Revision: D53275303 Pulled By: pdillinger fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377	2024-01-31 13:14:42 -08:00
Andrew Kryczka	be3bc36811	internal_repo_rocksdb (-8794174668376270091) (#12114 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12114 Reviewed By: jowlyzhang Differential Revision: D51745613 Pulled By: ajkr fbshipit-source-id: 27ca4bda275cab057d3a3ec99f0f92cdb9be5177	2023-12-01 11:10:30 -08:00
clundro	50b33ebb1b	remove redundant move (#11418 ) Summary: when I use g++-13 to exec the `make all` command, the output throws the warnings. ``` db/compaction/compaction_job_test.cc: In member function ‘void rocksdb::CompactionJobTestBase::AddMockFile(const rocksdb::mock::KVVector&, int)’: db/compaction/compaction_job_test.cc:376:57: error: redundant move in initialization [-Werror=redundant-move] 376 \| env_, GenerateFileName(file_number), std::move(contents))); \| ~~~~~~~~~^~~~~~~~~~ db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’ 375 \| EXPECT_OK(mock_table_factory_->CreateMockTable( \| ^~~~~~~~~ db/compaction/compaction_job_test.cc:376:57: note: remove ‘std::move’ call 376 \| env_, GenerateFileName(file_number), std::move(contents))); \| ~~~~~~~~~^~~~~~~~~~ db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’ 375 \| EXPECT_OK(mock_table_factory_->CreateMockTable( \| ^~~~~~~~~ cc1plus: all warnings being treated as errors make: *** [Makefile:2507: db/compaction/compaction_job_test.o] Error 1 ``` and I also add some `(void)unused_variable` statements because of the cmake argument `-Wunused-but-set-variable -Wunused-but-set-variable` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11418 Reviewed By: akankshamahajan15 Differential Revision: D45528223 Pulled By: ajkr fbshipit-source-id: fee1a77c30039a56b481de953f0a834cc788abbc	2023-05-03 09:37:21 -07:00
Hui Xiao	39c29372bf	Add `SetAllowStall()` (#11335 ) Summary: Context/Summary: - Allow runtime changes to whether `WriteBufferManager` allows stall or not by calling `SetAllowStall()` - Misc: some clean up - see PR conversation Pull Request resolved: https://github.com/facebook/rocksdb/pull/11335 Test Plan: - New UT Reviewed By: akankshamahajan15 Differential Revision: D44502555 Pulled By: hx235 fbshipit-source-id: 24b5cc57df7734b11d42e4870c06c87b95312b5e	2023-03-30 09:43:33 -07:00
Peter Dillinger	3cacd4b4ec	Put Cache and CacheWrapper in new public header (#11192 ) Summary: The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment. In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544 Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it. Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192 Test Plan: `make check` Reviewed By: anand1976 Differential Revision: D43055487 Pulled By: pdillinger fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110	2023-02-09 12:12:02 -08:00
sdong	4720ba4391	Remove RocksDB LITE (#11147 ) Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE \| egrep '[.](cc\|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2	2023-01-27 13:14:19 -08:00
Hui Xiao	08a63ad10b	Run clang format against files under example/, memory/ and memtable/ folders (#10893 ) Summary: Context/Summary: Run the following to format ``` find ./examples -iname .h -o -iname .cc \| xargs clang-format -i find ./memory -iname .h -o -iname .cc \| xargs clang-format -i find ./memtable -iname .h -o -iname .cc \| xargs clang-format -i ``` Test - Manual inspection to ensure changes are cosmetic only - CI Pull Request resolved: https://github.com/facebook/rocksdb/pull/10893 Reviewed By: jay-zhuang Differential Revision: D40779187 Pulled By: hx235 fbshipit-source-id: 529cbb0f0fbd698d95817e8c42fe3ce32254d9b0	2022-10-28 13:16:50 -07:00
Peter Dillinger	e466173d5c	Print stack traces on frozen tests in CI (#10828 ) Summary: Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off. For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828 Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI Reviewed By: hx235 Differential Revision: D40447634 Pulled By: pdillinger fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1	2022-10-18 00:35:35 -07:00
Jay Zhuang	5d3aefb682	Migrate to docker for CI run (#10496 ) Summary: Moved linux builds to using docker to avoid CI instability caused by dependency installation site down. Added the `Dockerfile` which is used to build the image. The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test). Also fixed a few issues detected while building this: * `DestoryDB()` Status not checked for a few tests * nullptr might be used in `inlineskiplist.cc` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496 Test Plan: CI Reviewed By: ajkr Differential Revision: D38554200 Pulled By: jay-zhuang fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd	2022-08-10 17:34:38 -07:00
sdong	4e00748098	Fix a bug in hash linked list (#10401 ) Summary: In hash linked list, with a bucket of only one record, following sequence can cause users to temporarily miss a record: Thread 1: Fetch the structure bucket x points too, which would be a Node n1 for a key, with next pointer to be null Thread 2: Insert a key to bucket x that is larger than the existing key. This will make n1->next points to a new node n2, and update bucket x to point to n1. Thread 1: see n1->next is not null, so it thinks it is a header of linked list and ignore the key of n1. Fix it by refetch structure that bucket x points to when it sees n1->next is not null. This should work because if n1->next is not null, bucket x should already point to a linked list or skip list header. A related change is to revert th order of testing for linked list and skip list. This is because after refetching the bucket, it might end up with a skip list, rather than linked list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10401 Test Plan: Run existing tests and make sure at least it doesn't regress. Reviewed By: jay-zhuang Differential Revision: D38064471 fbshipit-source-id: 142bb85e1546c803f47e3357aef3e76debccd8df	2022-07-25 11:33:28 -07:00
Hui Xiao	3573558ec5	Rewrite memory-charging feature's option API (#9926 ) Summary: Context: Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`. Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags. Summary: - Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that - Added missing db bench/stress test for some memory charging features - Renamed related test suite to indicate they are under the same theme of memory charging - Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication - Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926 Test Plan: - New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)` - New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)` - CI - db bench (in case querying new options introduces regression) +0.5% micros/op: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 \| egrep 'fillseq'` #-run \| (pre-PR) avg micros/op \| std micros/op \| (post-PR) micros/op \| std micros/op \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 3.9711 \| 0.264408 \| 3.9914 \| 0.254563 \| 0.5111933721 20 \| 3.83905 \| 0.0664488 \| 3.8251 \| 0.0695456 \| -0.3633711465 40 \| 3.86625 \| 0.136669 \| 3.8867 \| 0.143765 \| 0.5289363078 - db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36054712 Pulled By: hx235 fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af	2022-05-17 15:01:51 -07:00
sdong	736a7b5433	Remove own ToString() (#9955 ) Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471	2022-05-06 13:03:58 -07:00
Hui Xiao	49623f9c8e	Account memory of big memory users in BlockBasedTable in global memory limit (#9748 ) Summary: Context: Through heap profiling, we discovered that `BlockBasedTableReader` objects can accumulate and lead to high memory usage (e.g, `max_open_file = -1`). These memories are currently not saved, not tracked, not constrained and not cache evict-able. As a first step to improve this, similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track an estimate of `BlockBasedTableReader` object's memory in block cache and fail future creation if the memory usage exceeds the available space of cache at the time of creation. Summary: - Approximate big memory users (`BlockBasedTable::Rep` and `TableProperties` )' memory usage in addition to the existing estimated ones (filter block/index block/un-compression dictionary) - Charge all of these memory usages to block cache on `BlockBasedTable::Open()` and release them on `~BlockBasedTable()` as there is no memory usage fluctuation of concern in between - Refactor on CacheReservationManager (and its call-sites) to add concurrent support for BlockBasedTable used in this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9748 Test Plan: - New unit tests - db bench: `OpenDb` : -0.52% in ms - Setup `./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -write_buffer_size=1048576` - Repeated run with pre-change w/o feature and post-change with feature, benchmark `OpenDb`: `./db_bench -benchmarks=readrandom -use_existing_db=1 -db=/dev/shm/testdb -reserve_table_reader_memory=true (remove this when running w/o feature) -file_opening_threads=3 -open_files=-1 -report_open_timing=true\| egrep 'OpenDb:'` #-run \| (feature-off) avg milliseconds \| std milliseconds \| (feature-on) avg milliseconds \| std milliseconds \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 11.4018 \| 5.95173 \| 9.47788 \| 1.57538 \| -16.87382694 20 \| 9.23746 \| 0.841053 \| 9.32377 \| 1.14074 \| 0.9343477536 40 \| 9.0876 \| 0.671129 \| 9.35053 \| 1.11713 \| 2.893283155 80 \| 9.72514 \| 2.28459 \| 9.52013 \| 1.0894 \| -2.108041632 160 \| 9.74677 \| 0.991234 \| 9.84743 \| 1.73396 \| 1.032752389 320 \| 10.7297 \| 5.11555 \| 10.547 \| 1.97692 \| -1.70275031 640 \| 11.7092 \| 2.36565 \| 11.7869 \| 2.69377 \| 0.6635807741 - db bench on write with cost to cache in WriteBufferManager (just in case this PR's CRM refactoring accidentally slows down anything in WBM) : `fillseq` : +0.54% in micros/op `./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -cost_write_buffer_to_cache=true -write_buffer_size=10000000000 \| egrep 'fillseq'` #-run \| (pre-PR) avg micros/op \| std micros/op \| (post-PR) avg micros/op \| std micros/op \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 6.15 \| 0.260187 \| 6.289 \| 0.371192 \| 2.260162602 20 \| 7.28025 \| 0.465402 \| 7.37255 \| 0.451256 \| 1.267813605 40 \| 7.06312 \| 0.490654 \| 7.13803 \| 0.478676 \| 1.060579461 80 \| 7.14035 \| 0.972831 \| 7.14196 \| 0.92971 \| 0.02254791432 - filter bench: `bloom filter`: -0.78% in ms/key - ` ./filter_bench -impl=2 -quick -reserve_table_builder_memory=true \| grep 'Build avg'` #-run \| (pre-PR) avg ns/key \| std ns/key \| (post-PR) ns/key \| std ns/key \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 26.4369 \| 0.442182 \| 26.3273 \| 0.422919 \| -0.4145720565 20 \| 26.4451 \| 0.592787 \| 26.1419 \| 0.62451 \| -1.1465262 - Crash test `python3 tools/db_crashtest.py blackbox --reserve_table_reader_memory=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D35136549 Pulled By: hx235 fbshipit-source-id: 146978858d0f900f43f4eb09bfd3e83195e3be28	2022-04-06 10:33:00 -07:00
Yanqin Jin	0376869f05	Remove using namespace (#9369 ) Summary: As title. This is part of an fb-internal task. First, remove all `using namespace` statements if applicable. Next, utilize multiple build platforms and see if anything is broken. Should anything become broken, fix the compilation errors with as little extra change as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369 Test Plan: internal build and make check make clean && make static_lib && cd examples && make all Reviewed By: pdillinger Differential Revision: D33517260 Pulled By: riversand963 fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40	2022-01-12 09:31:12 -08:00
Hui Xiao	3018a3e27e	Minor improvement to CacheReservationManager/WriteBufferManager/CompressionDictBuilding (#9139 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9139 Reviewed By: zhichao-cao Differential Revision: D32211415 Pulled By: hx235 fbshipit-source-id: 39ce036ba34e1fb4a1992a33ac6904a4a943301d	2021-11-05 16:13:47 -07:00
Giuseppe Ottaviano	22d4dc5066	Fix race in WriteBufferManager (#9009 ) Summary: EndWriteStall has a data race: `queue_.empty()` is checked outside of the mutex, so once we enter the critical section another thread may already have cleared the list, and accessing the `front()` is undefined behavior (and causes interesting crashes under high concurrency). This PR fixes the bug, and also rewrites the logic to make it easier to reason about it. It also fixes another subtle bug: if some writers are stalled and `SetBufferSize(0)` is called, which disables the WBM, the writer are not unblocked because of an early `enabled()` check in `EndWriteStall()`. It doesn't significantly change the locking behavior, as before writers won't lock unless entering a stall condition, and `FreeMem` almost always locks if stalling is allowed, but that is inevitable with the current design. Liveness is guaranteed by the fact that if some writes are blocked, eventually all writes will be blocked due to `stall_active_`, and eventually all memory is freed. While at it, do a couple of optimizations: - In `WBMStallInterface::Signal()` signal the CV only after releasing the lock. Signaling under the lock is a common pitfall, as it causes the woken-up thread to immediately go back to sleep because the mutex is still locked by the awaker. - Move all allocations and deallocations outside of the lock. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9009 Test Plan: ``` USE_CLANG=1 make -j64 all check ``` Reviewed By: akankshamahajan15 Differential Revision: D31550668 Pulled By: ot fbshipit-source-id: 5125387c3dc7ecaaa2b8bbc736e58c4156698580	2021-10-12 00:16:21 -07:00
mrambacher	beed86473a	Make MemTableRepFactory into a Customizable class (#8419 ) Summary: This PR does the following: -> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString -> Makes the existing implementations compatible with configurations -> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419 Reviewed By: zhichao-cao Differential Revision: D29558961 Pulled By: mrambacher fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c	2021-09-08 07:46:44 -07:00
Peter Dillinger	4750421ece	Replace most typedef with using= (#8751 ) Summary: Old typedef syntax is confusing Most but not all changes with perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files make format Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751 Test Plan: existing Reviewed By: zhichao-cao Differential Revision: D30745277 Pulled By: pdillinger fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9	2021-09-07 11:31:59 -07:00
Hui Xiao	74cfe7db60	Refactor WriteBufferManager::CacheRep into CacheReservationManager (#8506 ) Summary: Context: To help cap various memory usage by a single limit of the block cache capacity, we charge the memory usage through inserting/releasing dummy entries in the block cache. CacheReservationManager is such a class (non thread-safe) responsible for inserting/removing dummy entries to reserve cache space for memory used by the class user. - Refactored the inner private class CacheRep of WriteBufferManager into public CacheReservationManager class for reusability such as for https://github.com/facebook/rocksdb/pull/8428 - Encapsulated implementation details of cache key generation and dummy entries insertion/release in cache reservation as discussed in https://github.com/facebook/rocksdb/pull/8506#discussion_r666550838 - Consolidated increase/decrease cache reservation into one API - UpdateCacheReservation. - Adjusted the previous dummy entry release algorithm in decreasing cache reservation to be loop-releasing dummy entries to stay symmetric to dummy entry insertion algorithm - Made the previous dummy entry release algorithm in delayed decrease mode more aggressive for better decreasing cache reservation when memory used is less likely to increase back. Previously, the algorithms only release 1 dummy entries when new_mem_used < 3/4 * cache_allocated_size_ and cache_allocated_size_ - kSizeDummyEntry > new_mem_used. Now, the algorithms loop-releases as many dummy entries as possible when new_mem_used < 3/4 * cache_allocated_size_. - Updated WriteBufferManager's test cases to adapt to changes on the release algorithm mentioned above and left comment for some test cases for clarity - Replaced the previous cache key prefix generation (utilizing object address related to the cache client) with one that utilizes Cache->NewID() to prevent cache-key collision among dummy entry clients sharing the same cache. The specific collision we are preventing happens when the object address is reused for a new cache-key prefix while the old cache-key using that same object address in its prefix still exists in the cache. This could happen due to that, under LRU cache policy, there is a possible delay in releasing a cache entry after the cache client object owning that cache entry get deallocated. In this case, the object address related to the cache client object can get reused for other client object to generate a new cache-key prefix. This prefix generation can be made obsolete after Peter's unification of all the code generating cache key, mentioned in https://github.com/facebook/rocksdb/pull/8506#discussion_r667265255 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8506 Test Plan: - Passing the added unit tests cache_reservation_manager_test.cc - Passing existing and adjusted write_buffer_manager_test.cc Reviewed By: ajkr Differential Revision: D29644135 Pulled By: hx235 fbshipit-source-id: 0fc93fbfe4a40bb41be85c314f8f2bafa8b741f7	2021-08-24 12:43:31 -07:00
Baptiste Lemaire	e51be2c5a1	Improve MemPurge sampling (#8656 ) Summary: Previously, the `MemPurge` sampling function was assessing whether a random entry from a memtable was garbage or not by simply querying the given memtable (see https://github.com/facebook/rocksdb/issues/8628 for more details). In this diff, I am updating the sampling function by querying not only the memtable the entry was drawn from, but also all subsequent memtables that have a greater memtable ID. I also added the size of the value for KV entries in the payload/useful payload estimates (which was also one of the reasons why sampling was not as good as mempurging all the time in terms of L0 SST files reduction). Once these changes were made, I was able to clean obsolete objects and functions from the `MemtableList` struct, and did a bit of cleanup everywhere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8656 Reviewed By: pdillinger Differential Revision: D30288583 Pulled By: bjlemaire fbshipit-source-id: 7646a545ec56f4715949daa59ab5eee74540feb3	2021-08-13 14:35:41 -07:00
Baptiste Lemaire	e3a96c4823	Memtable sampling for mempurge heuristic. (#8628 ) Summary: Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option. This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value. Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate. At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation. Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries. The unit tests have been readapted to support this new API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628 Reviewed By: pdillinger Differential Revision: D30149315 Pulled By: bjlemaire fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27	2021-08-10 18:09:03 -07:00
Andrew Kryczka	ed8eb436db	Move slow valgrind tests behind -DROCKSDB_FULL_VALGRIND_RUN (#8475 ) Summary: Various tests had disabled valgrind due to it slowing down and timing out (as is the case right now) the CI runs. Where a test was disabled with no comment, I assumed slowness was the cause. For these tests that were slow under valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475 Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly Reviewed By: jay-zhuang Differential Revision: D29504843 Pulled By: ajkr fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7	2021-07-07 11:14:05 -07:00
Peter Dillinger	311a544c2a	Use deleters to label cache entries and collect stats (#8297 ) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb	2021-05-19 16:51:13 -07:00
Akanksha Mahajan	596e9008e4	Stall writes in WriteBufferManager when memory_usage exceeds buffer_size (#7898 ) Summary: When WriteBufferManager is shared across DBs and column families to maintain memory usage under a limit, OOMs have been observed when flush cannot finish but writes continuously insert to memtables. In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write, this change will stall incoming writers until flush is completed and memory_usage drops. Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_ (memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true. DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall() (which adds dummy stall object to the writer's queue). Then DB is blocked on a state State::Blocked (current write doesn't go through). WBStallInterface object maintained by every DB instance is added to the queue of WriteBufferManager. If multiple DBs tries to write during this stall, they will also be blocked when check WriteBufferManager::ShouldStall() returns true. End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush to complete and avoid continous stalling if memory usage remains close to buffer_size. WriterBufferManager::EndWriteStall() is called, which removes all instances from its queue and signal them to continue. Their state is changed to State::Running and they are unblocked. DBImpl then signal all incoming writers of that DB to continue by calling write_thread_.EndWriteStall() (which removes dummy stall object from the queue). DB instance creates WBMStallInterface which is an interface to block and signal DBs during stall. When DB needs to be blocked or signalled by WriteBufferManager, state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898 Test Plan: Added a new test db/db_write_buffer_manager_test.cc Reviewed By: anand1976 Differential Revision: D26093227 Pulled By: akankshamahajan15 fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae	2021-04-21 13:54:02 -07:00
mrambacher	3dff28cf9b	Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033 ) Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67	2021-03-15 04:34:11 -07:00
matthewvon	4126bdc0e1	Feature: add SetBufferSize() so that managed size can be dynamic (#7961 ) Summary: This PR adds SetBufferSize() to the WriteBufferManager object. This enables user code to adjust the global budget for write_buffers based upon other memory conditions such as growth in table reader memory as the dataset grows. The buffer_size_ member variable is now atomic to match design of other changeable size_t members within WriteBufferManager. This change is useful as is. However, this change is also essential if someone decides they wanted to enable db_write_buffer_size modifications through the DB::SetOptions() API, i.e. no waste taking this as is. Any format / spacing changes are due to clang-format as required by check-in automation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7961 Reviewed By: ajkr Differential Revision: D26639075 Pulled By: akankshamahajan15 fbshipit-source-id: 0604348caf092d35f44e85715331dc920e5c1033	2021-03-03 14:22:11 -08:00
mrambacher	12f1137355	Add a SystemClock class to capture the time functions of an Env (#7858 ) Summary: Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock. Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock. There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc). Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858 Reviewed By: pdillinger Differential Revision: D26006406 Pulled By: mrambacher fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90	2021-01-25 22:09:11 -08:00
Akanksha Mahajan	8ed680bdb0	Add new API to report dummy entries size in cache in WriteBufferManager (#7837 ) Summary: Add new API WriteBufferManager::dummy_entries_in_cache_usage() which reports the dummy entries size stored in cache to account for DataBlocks in WriteBufferManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7837 Test Plan: Updated test ./write_buffer_manager_test Reviewed By: ajkr Differential Revision: D25794312 Pulled By: akankshamahajan15 fbshipit-source-id: 197f5e8701e3dc57a7df72dab1735624f90daf4b	2021-01-08 13:26:24 -08:00
mrambacher	a8c89cc969	Test for LoadLatestOptions (#7554 ) Summary: Make LoadLatestOptions return PathNotFound if the options file does not exist. Added tests for the LoadOptions related methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7554 Reviewed By: akankshamahajan15 Differential Revision: D24298985 Pulled By: zhichao-cao fbshipit-source-id: c9ae3cb12fc4a5bbef07743e1c1300f98a2441b3	2020-10-14 22:28:55 -07:00
Peter Dillinger	08552b19d3	Genericize and clean up FastRange (#7436 ) Summary: A generic algorithm in progress depends on a templatized version of fastrange, so this change generalizes it and renames it to fit our style guidelines, FastRange32, FastRange64, and now FastRangeGeneric. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7436 Test Plan: added a few more test cases Reviewed By: jay-zhuang Differential Revision: D23958153 Pulled By: pdillinger fbshipit-source-id: 8c3b76101653417804997e5f076623a25586f3e8	2020-09-28 11:35:00 -07:00
mrambacher	c7c7b07f06	More Makefile Cleanup (#7097 ) Summary: Cleans up some of the dependencies on test code in the Makefile while building tools: - Moves the test::RandomString, DBBaseTest::RandomString into Random - Moves the test::RandomHumanReadableString into Random - Moves the DestroyDir method into file_utils - Moves the SetupSyncPointsToMockDirectIO into sync_point. - Moves the FaultInjection Env and FS classes under env These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies. By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated. Tested both release and debug builds via Make and CMake for both static and shared libraries. More work remains to clean up how the tools are built and remove some unnecessary dependencies. There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097 Reviewed By: riversand963 Differential Revision: D22463160 Pulled By: pdillinger fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2	2020-07-09 14:35:17 -07:00
Peter Dillinger	c7432cc3c0	Fix more defects reported by Coverity Scan (#6935 ) Summary: Mostly uninitialized values: some probably written before use, but some seem like bugs. Also, destructor needs to be virtual, and possible use-after-free in test Pull Request resolved: https://github.com/facebook/rocksdb/pull/6935 Test Plan: make check Reviewed By: siying Differential Revision: D21885484 Pulled By: pdillinger fbshipit-source-id: e2e7cb0a0cf196f2b55edd16f0634e81f6cc8e08	2020-06-04 15:35:08 -07:00
Peter Dillinger	31da5e34c1	C++20 compatibility (#6697 ) Summary: Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended: * Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists * Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition * std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API * Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line * Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20) * Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor * Added GCC 9 C++11 & GCC9 C++20 in Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697 Test Plan: make check and CI Reviewed By: cheng-chang Differential Revision: D21020318 Pulled By: pdillinger fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91	2020-04-20 13:24:25 -07:00
sdong	57096ab13e	Fix a bug that crashes the service when write buffer manager fails to insert to block cache (#6619 ) Summary: https://github.com/facebook/rocksdb/issues/6247 reports that when write buffer manager fails to insert the dummy entry to block cache, null pointer is still stored and used to release the handle and cause corruption. Fix the bug by not releasing it with null handle. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6619 Test Plan: Add a unit test that fails without the fix. Reviewed By: ajkr Differential Revision: D20776769 fbshipit-source-id: 4127fbd9f295a0a3e45774746ffcd91f939f6287	2020-04-01 11:27:40 -07:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
Peter Dillinger	ca7ccbe2ea	Misc hashing updates / upgrades (#5909 ) Summary: - Updated our included xxhash implementation to version 0.7.2 (== the latest dev version as of 2019-10-09). - Using XXH_NAMESPACE (like other fb projects) to avoid potential name collisions. - Added fastrange64, and unit tests for it and fastrange32. These are faster alternatives to hash % range. - Use preview version of XXH3 instead of MurmurHash64A for NPHash64 -- Had to update cache_test to increase probability of passing for any given hash function. - Use fastrange64 instead of % with uses of NPHash64 -- Had to fix WritePreparedTransactionTest.CommitOfDelayedPrepared to avoid deadlock apparently caused by new hash collision. - Set default seed for NPHash64 because specifying a seed rarely makes sense for it. - Removed unnecessary include xxhash.h in a popular .h file - Rename preview version of XXH3 to XXH3p for clarity and to ease backward compatibility in case final version of XXH3 is integrated. Relying on existing unit tests for NPHash64-related changes. Each new implementation of fastrange64 passed unit tests when manipulating my local build to select it. I haven't done any integration performance tests, but I consider the improved performance of the pieces being swapped in to be well established. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5909 Differential Revision: D18125196 Pulled By: pdillinger fbshipit-source-id: f6bf83d49d20cbb2549926adf454fd035f0ecc0d	2019-10-24 17:16:46 -07:00
Maysam Yabandeh	638d239507	Charge block cache for cache internal usage (#5797 ) Summary: For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor. This PR builds up on https://github.com/facebook/rocksdb/issues/4258 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797 Test Plan: - Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata. - The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata. Differential Revision: D17396833 Pulled By: maysamyabandeh fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f	2019-09-16 15:26:21 -07:00
Lingjing You	1a928c22a0	Add insert hints for each writebatch (#5728 ) Summary: Add insert hints for each writebatch so that they can be used in concurrent write, and add write option to enable it. Bench result (qps): `./db_bench --benchmarks=fillseq -allow_concurrent_memtable_write=true -num=4000000 -batch-size=1 -threads=1 -db=/data3/ylj/tmp -write_buffer_size=536870912 -num_column_families=4` master: \| batch size \ thread num \| 1 \| 2 \| 4 \| 8 \| \| ----------------------- \| ------- \| ------- \| ------- \| ------- \| \| 1 \| 387883 \| 220790 \| 308294 \| 490998 \| \| 10 \| 1397208 \| 978911 \| 1275684 \| 1733395 \| \| 100 \| 2045414 \| 1589927 \| 1798782 \| 2681039 \| \| 1000 \| 2228038 \| 1698252 \| 1839877 \| 2863490 \| fillseq with writebatch hint: \| batch size \ thread num \| 1 \| 2 \| 4 \| 8 \| \| ----------------------- \| ------- \| ------- \| ------- \| ------- \| \| 1 \| 286005 \| 223570 \| 300024 \| 466981 \| \| 10 \| 970374 \| 813308 \| 1399299 \| 1753588 \| \| 100 \| 1962768 \| 1983023 \| 2676577 \| 3086426 \| \| 1000 \| 2195853 \| 2676782 \| 3231048 \| 3638143 \| Pull Request resolved: https://github.com/facebook/rocksdb/pull/5728 Differential Revision: D17297240 fbshipit-source-id: b053590a6d77871f1ef2f911a7bd013b3899b26c	2019-09-12 17:15:18 -07:00
Shylock Hg	9eb3e1f77d	Use delete to disable automatic generated methods. (#5009 ) Summary: Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009 Differential Revision: D17288733 fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3	2019-09-11 18:09:00 -07:00
Zhongyi Xie	d68f9f4580	simplify include directive involving inttypes (#5402 ) Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03	2019-06-06 13:56:07 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
Siying Dong	beb44ec3eb	WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175 ) Summary: Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget. Change it to 256KB accordingly. There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175 Differential Revision: D14954289 Pulled By: siying fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94	2019-04-16 12:03:07 -07:00
Siying Dong	0bb555630f	Consolidate hash function used for non-persistent data in a new function (#5155 ) Summary: Create new function NPHash64() and GetSliceNPHash64(), which are currently implemented using murmurhash. Replace the current direct call of murmurhash() to use the new functions if the hash results are not used in on-disk format. This will make it easier to try out or switch to alternative functions in the uses where data format compatibility doesn't need to be considered. This part shouldn't have any performance impact. Also, the sharded cache hash function is changed to the new format, because it falls into this categoery. It doesn't show visible performance impact in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4% in an extreme benchmark setting (4KB blocks, no-compression, everything cached in block cache). We've known that the current hash function used, our own Hash() has serious hash quality problem. It can generate a lots of conflicts with similar input. In this use case, it means extra lock contention for reads from the same file. This slight CPU regression is worthy to me to counter the potential bad performance with hot keys. And hopefully this will get further improved in the future with a better hash function. cache_test's condition is relaxed a little bit to. The new hash is slightly more skewed in this use case, but I manually checked the data and see the hash results are still in a reasonable range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155 Differential Revision: D14834821 Pulled By: siying fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5	2019-04-08 13:32:06 -07:00

1 2

91 Commits