rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-27 11:43:49 +00:00

Author	SHA1	Message	Date
Yu Zhang	4dafa5b220	switch to use RocksDB UnorderedMap (#11507 ) Summary: Switch from std::unordered_map to RocksDB UnorderedMap for all the places that logging user-defined timestamp size in WAL used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11507 Test Plan: ``` make all check ``` Reviewed By: ltamasi Differential Revision: D46448975 Pulled By: jowlyzhang fbshipit-source-id: bdb4d56a723b697a33daaf0f856a61d49a367a99	2023-06-05 13:36:26 -07:00
Yu Zhang	56ca9e3106	Logging timestamp size record in WAL and use it during recovery (#11471 ) Summary: Start logging the timestamp size record in WAL and use the record during recovery. Currently, user comparator cannot be different from what was used to create a column family, so the timestamp size record is just used to confirm it's consistent with the timestamp size the running user comparator indicates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11471 Test Plan: ``` make all check ./db_secondary_test ./db_wal_test --gtest_filter="WithTimestamp" ./repair_test --gtest_filter="WithTimestamp" ``` Reviewed By: ltamasi Differential Revision: D46236769 Pulled By: jowlyzhang fbshipit-source-id: f6c60b5c8defdb05021c63df302ccc0be1275ad0	2023-05-30 19:32:00 -07:00
Hui Xiao	dcc6fc99f9	Fix StopWatch bug; Remove setting `record_read_stats` (#11474 ) Summary: Context/Summary: - StopWatch enable stats even when `StatsLevel::kExceptTimers` is set. It's a harmless bug though since `reportTimeToHistogram()` will not report it anyway according to https://github.com/facebook/rocksdb/blob/main/include/rocksdb/statistics.h#L705 - https://github.com/facebook/rocksdb/pull/11288 should have removed logics of setting `record_read_stats = !for_compaction` as we don't differentiate `RandomAccessFileReader`'s stats behavior based on compaction or not (instead we now report stats of different IO activities including compaction to different stats). Fixing this should report more compaction related file read micros that aren't reported previously due to `for_compaction==true` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11474 Test Plan: - DB bench pre vs post fix with small max_open_files Setup command `./db_ bench -db=/dev/shm/testdb/ -statistics=true -benchmarks=fillseq -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=true -compression_type=none -bloom_bits=3` Run command `./db_bench --open_files=1 -use_existing_db=true -db=/dev/shm/testdb2/ -statistics=true -benchmarks=compactall -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=true -compression_type=none -bloom_bits=3` Pre-fix ``` rocksdb.sst.read.micros P50 : 2.056175 P95 : 4.647739 P99 : 8.948475 P100 : 25.000000 COUNT : 4451 SUM : 12827 rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.read.compaction.micros P50 : 2.057397 P95 : 4.625253 P99 : 8.749474 P100 : 25.000000 COUNT : 4382 SUM : 12608 rocksdb.file.read.db.open.micros P50 : 1.985294 P95 : 9.100000 P99 : 13.000000 P100 : 13.000000 COUNT : 69 SUM : 219 ``` Post-fix (with a higher `rocksdb.file.read.compaction.micros` count) ``` rocksdb.sst.read.micros P50 : 1.858968 P95 : 3.653086 P99 : 5.968000 P100 : 21.000000 COUNT : 3548 SUM : 9119 rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.read.compaction.micros P50 : 1.857027 P95 : 3.627614 P99 : 5.738621 P100 : 21.000000 COUNT : 3479 SUM : 8904 rocksdb.file.read.db.open.micros P50 : 2.000000 P95 : 6.733333 P99 : 11.000000 P100 : 11.000000 COUNT : 69 SUM : 215 ``` - CI Reviewed By: ajkr Differential Revision: D46137221 Pulled By: hx235 fbshipit-source-id: e5b4ee7001af26f2ee0377bc6334f43b2a527388	2023-05-25 10:16:58 -07:00
Peter Dillinger	17bc27741f	Improve memory efficiency of many OptimisticTransactionDBs (#11439 ) Summary: Currently it's easy to use a ton of memory with many small OptimisticTransactionDB instances, because each one by default allocates a million mutexes (40 bytes each on my compiler) for validating transactions. It even puts a lot of pressure on the allocator by allocating each one individually! In this change: * Create a new object and option that enables sharing these buckets of mutexes between instances. This is generally good for load balancing potential contention as various DBs become hotter or colder with txn writes. About the only cases where this sharing wouldn't make sense (e.g. each DB usually written by one thread) are cases that would be better off with OccValidationPolicy::kValidateSerial which doesn't use the buckets anyway. * Allocate the mutexes in a contiguous array, for efficiency * Add an option to ensure the mutexes are cache-aligned. In several other places we use cache-aligned mutexes but OptimisticTransactionDB historically does not. It should be a space-time trade-off the user can choose. * Provide some visibility into the memory used by the mutex buckets with an ApproximateMemoryUsage() function (also used in unit testing) * Share code with other users of "striped" mutexes, appropriate refactoring for customization & efficiency (e.g. using FastRange instead of modulus) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11439 Test Plan: unit tests added. Ran sized-up versions of stress test in unit test, including a before-and-after performance test showing no consistent difference. (NOTE: OptimisticTransactionDB not currently covered by db_stress!) Reviewed By: ltamasi Differential Revision: D45796393 Pulled By: pdillinger fbshipit-source-id: ae2b3a26ad91ceeec15debcdc63ff48df6736a54	2023-05-24 11:57:15 -07:00
Yu Zhang	11ebddb1d4	Add utils to use for handling user defined timestamp size record in WAL (#11451 ) Summary: Add a util method `HandleWriteBatchTimestampSizeDifference` to handle a `WriteBatch` read from WAL log when user-defined timestamp size record is written and read. Two check modes are added: `kVerifyConsistency` that just verifies the recorded timestamp size are consistent with the running ones. This mode is to be used by `db_impl_secondary` for opening a DB as secondary instance. It will also be used by `db_impl_open` before the user comparator switch support is added to make a column switch between enabling/disable UDT feature. The other mode `kReconcileInconsistency` will be used by `db_impl_open` later when user comparator can be changed. Another change is to extract a method `CollectColumnFamilyIdsFromWriteBatch` in db_secondary_impl.h into its standalone util file so it can be shared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11451 Test Plan: ``` make check ./udt_util_test ``` Reviewed By: ltamasi Differential Revision: D45894386 Pulled By: jowlyzhang fbshipit-source-id: b96790777f154cddab6d45d9ba2e5d20ebc6fe9d	2023-05-22 14:28:58 -07:00
Peter Dillinger	39f5846ec7	Much better stats for seeks and prefix filtering (#11460 ) Summary: We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers. This change does several things: * Introduce new stat tickers for seeks and related filtering, \LEVEL_SEEK\ * Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access. * We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.) * For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in. * The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases. * The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix. * Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460 Test Plan: unit tests updated, including updating many to pop the stat value since last read to improve test readability and maintainability. Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED). Create DB with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 ``` And run simultaneous before&after with ``` TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics ``` Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec Reviewed By: ajkr Differential Revision: D46029177 Pulled By: pdillinger fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822	2023-05-19 15:25:49 -07:00
Alan Paxton	e110d713e0	Minimal RocksJava compliance with Java 8 language level (EB 1046) (#10951 ) Summary: Apply a small (and automatic) set of IntelliJ Java inspections/repairs to the Java interface to RocksDB Java and its tests. Partly enabled by the fact that we now (from RocksDB7) require java 8. Explicit <p> in empty lines in javadoc comments. Parameters and variables made final where possible. Anonymous subclasses converted lambdas. Some tests which previously used other assertion models were converted to assertj, e.g. (assertThat(actual).isEqualTo(expected) In a very few cases tests were found to be inoperative or broken, and were repaired. No problems with actual RocksDB behaviour were observed. This PR is intended to replace https://github.com/facebook/rocksdb/pull/9618 - that PR was not merged, and attempts to rebase it have yielded a questionable looking diff, so we choose to go back to square 1 here, and implement a conservative set of changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10951 Reviewed By: anand1976 Differential Revision: D45057849 Pulled By: ajkr fbshipit-source-id: e4ea46bfc80518ae86f37702b03ca9352bc11c3d	2023-05-17 19:44:24 -07:00
Peter Dillinger	206fdea3d9	Change internal headers with duplicate names (#11408 ) Summary: In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement. No public API changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408 Test Plan: existing tests Reviewed By: ltamasi Differential Revision: D45456696 Pulled By: pdillinger fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373	2023-05-17 11:27:09 -07:00
Yu Zhang	47235dda9e	Add support in log writer and reader for a user-defined timestamp size record (#11433 ) Summary: This patch adds support to write and read a user-defined timestamp size record in log writer and log reader. It will be used by WAL logs to persist the user-defined timestamp format for subsequent WriteBatch records. Reading and writing UDT sizes for WAL logs are not included in this patch. It will be in a follow up. The syntax for the record is: at write time, one such record is added when log writer encountered any non-zero UDT size it hasn't recorded so far. At read time, all such records read up to a point are accumulated and applicable to all subsequent WriteBatch records. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11433 Test Plan: ``` make clean && make -j32 all ./log_test --gtest_filter="WithTimestampSize" ``` Reviewed By: ltamasi Differential Revision: D45678708 Pulled By: jowlyzhang fbshipit-source-id: b770c8f45bb7b9383b14aac9f22af781304fb41d	2023-05-11 17:26:19 -07:00
Jay Huh	7531cbda91	Clean up rate limiter refill logic (#11425 ) Summary: Context: This pull request update is in response to a comment made on https://github.com/facebook/rocksdb/pull/8596#discussion_r680264932. The current implementation of RefillBytesAndGrantRequestsLocked() drains all available_bytes, but the first request after the last wave of requesting/bytes granting is done is not being handled in the same way. This creates a scenario where if a request for a large amount of bytes is enqueued first, but there are not enough available_bytes to fulfill it, the request is put to sleep until the next refill time. Meanwhile, a later request for a smaller number of bytes comes in and is granted immediately. This behavior is not fair as the earlier request was made first. To address this issue, we have made changes to the code to exhaust the remaining available bytes from the request and queue the remaining. With this change, requests are granted in the order they are received, ensuring that earlier requests are not unfairly delayed by later, smaller requests. The specific scenario described above will no longer occur with this change. Also consolidated `granted` and `request_bytes` as part of the change since `granted` is equivalent to `request_bytes == 0` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11425 Test Plan: Added `AvailableByteSizeExhaustTest` Reviewed By: hx235 Differential Revision: D45570711 Pulled By: jaykorean fbshipit-source-id: a7117ed17bf4b8a7ae0f76124cb41902db1a2592	2023-05-10 10:18:36 -07:00
Peter Dillinger	459969e993	Simplify detection of x86 CPU features (#11419 ) Summary: Background - runtime detection of certain x86 CPU features was added for optimizing CRC32c checksums, where performance is dramatically affected by the availability of certain CPU instructions and code using intrinsics for those instructions. And Java builds with native library try to be broadly compatible but performant. What has changed is that CRC32c is no longer the most efficient cheecksum on contemporary x86_64 hardware, nor the default checksum. XXH3 is generally faster and not as dramatically impacted by the availability of certain CPU instructions. For example, on my Skylake system using db_bench (similar on an older Skylake system without AVX512): PORTABLE=1 empty USE_SSE : xxh3->8 GB/s crc32c->0.8 GB/s (no SSE4.2 nor AVX2 instructions) PORTABLE=1 USE_SSE=1 : xxh3->19 GB/s crc32c->16 GB/s (with SSE4.2 and AVX2) PORTABLE=0 USE_SSE ignored: xxh3->28 GB/s crc32c->16 GB/s (also some AVX512) Testing a ~10 year old system, with SSE4.2 but without AVX2, crc32c is a similar speed to the new systems but xxh3 is only about half that speed, also 8GB/s like the non-AVX2 compile above. Given that xxh3 has specific optimization for AVX2, I think we can infer that that crc32c is only fastest for that ~2008-2013 period when SSE4.2 was included but not AVX2. And given that xxh3 is only about 2x slower on these systems (not like >10x slower for unoptimized crc32c), I don't think we need to invest too much in optimally adapting to these old cases. x86 hardware that doesn't support fast CRC32c is now extremely rare, so requiring a custom build to support such hardware is fine IMHO. This change does two related things: * Remove runtime CPU detection for optimizing CRC32c on x86. Maintaining this code is non-zero work, and compiling special code that doesn't work on the configured target instruction set for code generation is always dubious. (On the one hand we have to ensure the CRC32c code uses SSE4.2 but on the other hand we have to ensure nothing else does.) * Detect CPU features in source code, not in build scripts. Although there are some hypothetical advantages to detectiong in build scripts (compiler generality), RocksDB supports at least three build systems: make, cmake, and buck. It's not practical to support feature detection on all three, and we have suffered from missed optimization opportunities by relying on missing or incomplete detection in cmake and buck. We also depend on some components like xxhash that do source code detection anyway. In more detail: * `HAVE_SSE42`, `HAVE_AVX2`, and `HAVE_PCLMUL` replaced by standard macros `__SSE4_2__`, `__AVX2__`, and `__PCLMUL__`. * MSVC does not provide high fidelity defines for SSE, PCLMUL, or POPCNT, but we can infer those from `__AVX__` or `__AVX2__` in a compatibility header. In rare cases of false negative or false positive feature detection, a build engineer should be able to set defines to work around the issue. * `__POPCNT__` is another standard define, but we happen to only need it on MSVC, where it is set by that compatibility header, or can be set by the build engineer. * `PORTABLE` can be set to a CPU type, e.g. "haswell", to compile for that CPU type. * `USE_SSE` is deprecated, now equivalent to PORTABLE=haswell, which roughly approximates its old behavior. Notably, this change should enable more builds to use the AVX2-optimized Bloom filter implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11419 Test Plan: existing tests, CI Manual performance tests after the change match the before above (none expected with make build). We also see AVX2 optimized Bloom filter code enabled when expected, by injecting a compiler error. (Performance difference is not big on my current CPU.) Reviewed By: ajkr Differential Revision: D45489041 Pulled By: pdillinger fbshipit-source-id: 60ceb0dd2aa3b365c99ed08a8b2a087a9abb6a70	2023-05-09 22:25:45 -07:00
Peter Dillinger	d79be3dca2	Changes and enhancements to compression stats, thresholds (#11388 ) Summary: ## Option API updates * Add new CompressionOptions::max_compressed_bytes_per_kb, which corresponds to 1024.0 / min allowable compression ratio. This avoids the hard-coded minimum ratio of 8/7. * Remove unnecessary constructor for CompressionOptions. * Document undocumented CompressionOptions. Use idiom for default values shown clearly in one place (not precariously repeated). ## Stat API updates * Deprecate the BYTES_COMPRESSED, BYTES_DECOMPRESSED histograms. Histograms incur substantial extra space & time costs compared to tickers, and the distribution of uncompressed data block sizes tends to be uninteresting. If we're interested in that distribution, I don't see why it should be limited to blocks stored as compressed. * Deprecate the NUMBER_BLOCK_NOT_COMPRESSED ticker, because the name is very confusing. * New or existing tickers relevant to compression: * BYTES_COMPRESSED_FROM * BYTES_COMPRESSED_TO * BYTES_COMPRESSION_BYPASSED * BYTES_COMPRESSION_REJECTED * COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES (both existing) * NUMBER_BLOCK_COMPRESSED (existing) * NUMBER_BLOCK_COMPRESSION_BYPASSED * NUMBER_BLOCK_COMPRESSION_REJECTED * BYTES_DECOMPRESSED_FROM * BYTES_DECOMPRESSED_TO We can compute a number of things with these stats: * "Successful" compression ratio: BYTES_COMPRESSED_FROM / BYTES_COMPRESSED_TO * Compression ratio of data on which compression was attempted: (BYTES_COMPRESSED_FROM + BYTES_COMPRESSION_REJECTED) / (BYTES_COMPRESSED_TO + BYTES_COMPRESSION_REJECTED) * Compression ratio of data that could be eligible for compression: (BYTES_COMPRESSED_FROM + X) / (BYTES_COMPRESSED_TO + X) where X = BYTES_COMPRESSION_REJECTED + NUMBER_BLOCK_COMPRESSION_REJECTED * Overall SST compression ratio (compression disabled vs. actual): (Y - BYTES_COMPRESSED_TO + BYTES_COMPRESSED_FROM) / Y where Y = COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES Keeping _REJECTED separate from _BYPASSED helps us to understand "wasted" CPU time in compression. ## BlockBasedTableBuilder Various small refactorings, optimizations, and name clean-ups. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11388 Test Plan: unit tests added * `options_settable_test.cc`: use non-deprecated idiom for configuring CompressionOptions from string. The old idiom is tested elsewhere and does not need to be updated to support the new field. Reviewed By: ajkr Differential Revision: D45128202 Pulled By: pdillinger fbshipit-source-id: 5a652bf5c022b7ec340cf79018cccf0686962803	2023-04-21 21:57:40 -07:00
Hui Xiao	151242ce46	Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288 ) Summary: Context: The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. Summary - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - Stress test - Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's. (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - Db bench 2: performance Read SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` Compaction/Flushrun till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132	2023-04-21 09:07:18 -07:00
Jeff Palm	6b67b561bc	util/ribbon_test.cc: avoid ambiguous reversed operator error in c++20 (#11371 ) Summary: util/ribbon_test.cc: avoid ambiguous reversed operator error in c++20 (and enable checking for the error) Code would produce errors like this, when compiled with -Wambiguous-reversed-operator under c++20. ``` util/ribbon_test.cc:695:20: error: ISO C++20 considers use of overloaded operator '!=' (with operand types 'KeyGen' (aka '(anonymous namespace)::StandardKeyGen') and 'KeyGen') to be ambiguou s despite there being a unique best viable function with non-reversed arguments [-Werror,-Wambiguous-reversed-operator] while (cur != batch_end) { ~~~ ^ ~~~~~~~~~ util/ribbon_test.cc:111:8: note: candidate function with non-reversed arguments bool operator!=(const StandardKeyGen& other) { ^ util/ribbon_test.cc:107:8: note: ambiguous candidate function with reversed arguments bool operator==(const StandardKeyGen& other) { ^ ``` This will become a hard error in future standards. Confirmed that no errors were generated when building using clang and c++20: ``` USE_CLANG=1 USE_COROUTINES=1 make ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11371 Reviewed By: meyering Differential Revision: D44921027 Pulled By: cbi42 fbshipit-source-id: ef25b78260920a4d75a718310688d3a2487ffa87	2023-04-12 13:24:34 -07:00
Andrew Kryczka	b45738622a	Use user-provided ReadOptions for metadata block reads more often (#11208 ) Summary: This is mostly taken from https://github.com/facebook/rocksdb/issues/10427 with my own comments addressed. This PR plumbs the user’s `ReadOptions` down to `GetOrReadIndexBlock()`, `GetOrReadFilterBlock()`, and `GetFilterPartitionBlock()`. Now those functions no longer have to make up a `ReadOptions` with incomplete information. I also let `PartitionIndexReader::NewIterator()` pass through its caller's `ReadOptions::verify_checksums`, which was inexplicably dropped previously. Fixes https://github.com/facebook/rocksdb/issues/10463 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11208 Test Plan: Functional: - Measured `-verify_checksum=false` applies to metadata blocks read outside of table open - setup command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56` - run command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=readrandom -use_existing_db=true -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56 -duration=10 -threads=32 -cache_size=131072 -statistics=true -verify_checksum=false -open_files=20 -cache_index_and_filter_blocks=true` - before: `rocksdb.block.checksum.compute.count COUNT : 384353` - after: `rocksdb.block.checksum.compute.count COUNT : 22` Performance: - Setup command (tmpfs, 128MB logical data size, cache indexes/filters without pinning so index/filter lookups go through table reader): `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=131072 -target_file_size_base=131072 -max_bytes_for_level_base=524288 -compression_type=none -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1` - Measured point lookup performance. Database is fully cached to emphasize any new callstack overheads - Command: `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=readrandom[-W1][-X20] -use_existing_db=true -cache_index_and_filter_blocks=true -disable_auto_compactions=true -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1 -duration=10 -cache_size=1048576000` - Before: `readrandom [AVG 20 runs] : 274848 (± 3717) ops/sec; 8.4 (± 0.1) MB/sec` - After: `readrandom [AVG 20 runs] : 277904 (± 4474) ops/sec; 8.5 (± 0.1) MB/sec` Reviewed By: hx235 Differential Revision: D43145366 Pulled By: ajkr fbshipit-source-id: 75ec062ece86a82cd788783de9de2c72df57f994	2023-04-04 16:53:14 -07:00
Peter Dillinger	b4d78189b3	Fix gflags_compat.h (#11346 ) Summary: Was getting compilation failure with old verison of gflags, examples in https://github.com/facebook/rocksdb/issues/11344. Perhaps this is new since enabling C++17. Getting rid of std::reference_wrapper from https://github.com/facebook/rocksdb/issues/10729 seems to fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11346 Test Plan: manual, CI Reviewed By: guowentian Differential Revision: D44632776 Pulled By: pdillinger fbshipit-source-id: 5c1f3f79a055698574538b6342c912a627b6d061	2023-04-03 10:41:00 -07:00
mrambacher	b6640c3117	Remove FactoryFunc from LoadXXXObject (#11203 ) Summary: The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available. With the removal of LITE mode, the function was no longer required. Note that the MergeOperator had some private classes defined in header files. To gain access to their constructors (and name methods), the class definitions were moved into header files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203 Reviewed By: cbi42 Differential Revision: D43160255 Pulled By: pdillinger fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c	2023-02-17 12:54:07 -08:00
Andrew Kryczka	25e1365227	Merge operator failed subcode (#11231 ) Summary: From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions. This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231 Test Plan: updated unit test Reviewed By: akankshamahajan15 Differential Revision: D43396607 Pulled By: ajkr fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9	2023-02-17 10:58:46 -08:00
Wentian Guo	42d6652ba2	remove dependency on options.h for port_posix.h andport_win.h (#11214 ) Summary: The files in `port/`, such as `port_posix.h`, are layering over the system libraries, so shouldn't include the DB-specific files like `options.h`. This PR remove this dependency. # How The reason that `port_posix.h` (or `port_win.h`) include `options.h` is to use `CpuPriority`, as there is a method `SetCpuPriority()` in `port_posix.h` that uses `CpuPriority.` - I think `SetCpuPriority()` make sense to exist in `port_posix.h` as it provides has platform-dependent implementation - `CpuPriority` enum is defined in `env.h`, but used in `rocksdb/include` and `port/`. Hence, let us define `CpuPriority` enum in a common file, say `port_defs.h`, such that both directories `rocksdb/include` and `port/` can include. When we remove this dependency, some other files have compile errors because they can't find definitions, so add header files to resolve # Test make all check -j Pull Request resolved: https://github.com/facebook/rocksdb/pull/11214 Reviewed By: pdillinger Differential Revision: D43196910 Pulled By: guowentian fbshipit-source-id: 70deccb72844cfb08fcc994f76c6ef6df5d55ab9	2023-02-13 02:21:38 -08:00
Peter Dillinger	34bb3ddc43	Improve SmallEnumSet (#11178 ) Summary: In anticipation of using this to represent sets of CacheEntryRole for including or excluding kinds of blocks in block cache tiers, add significant new features to SmallEnumSet, including at least: * List initialization * Applicative constexpr operations * copy/move/equality ops * begin/end/const_iterator for iteration * Better comments Pull Request resolved: https://github.com/facebook/rocksdb/pull/11178 Test Plan: unit tests added/expanded Reviewed By: ltamasi Differential Revision: D42973723 Pulled By: pdillinger fbshipit-source-id: 40783486feda931c3f7c6fcc9a300acd6a4b0a0a	2023-02-08 20:14:57 -08:00
anand76	77b61abc7b	Fix bug in WAL streaming uncompression (#11198 ) Summary: Fix a bug in the calculation of the input buffer address/offset in log_reader.cc. The bug is when consecutive fragments of a compressed record are located at the same offset in the log reader buffer, the second fragment input buffer is treated as a leftover from the previous input buffer. As a result, the offset in the `ZSTD_inBuffer` is not reset. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11198 Test Plan: Add a unit test in log_test.cc that fails without the fix and passes with it. Reviewed By: ajkr, cbi42 Differential Revision: D43102692 Pulled By: anand1976 fbshipit-source-id: aa2648f4802c33991b76a3233c5a58d4cc9e77fd	2023-02-08 12:05:49 -08:00
anand76	63da9cfa26	Return any errors returned by ReadAsync to the MultiGet caller (#11171 ) Summary: Currently, we incorrectly return a Status::Corruption to the MultiGet caller if the file system ReadAsync cannot issue a read and returns an error for some reason, such as IOStatus::NotSupported(). In this PR, we copy the ReadAsync error to the request status so it can be returned to the user. Tests: Update existing unit tests and add a new one for this scenario Pull Request resolved: https://github.com/facebook/rocksdb/pull/11171 Reviewed By: akankshamahajan15 Differential Revision: D42950057 Pulled By: anand1976 fbshipit-source-id: 85ffcb015fa6c064c311f8a28488fec78c487869	2023-02-02 16:35:27 -08:00
sdong	4720ba4391	Remove RocksDB LITE (#11147 ) Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE \| egrep '[.](cc\|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2	2023-01-27 13:14:19 -08:00
Heiko Becker	88edfbfb5e	Fix build with gcc 13 by including <cstdint> (#11118 ) Summary: Like other versions before, gcc 13 moved some includes around and as a result <cstdint> is no longer transitively included [1]. Explicitly include it for uint{32,64}_t. [1] https://gcc.gnu.org/gcc-13/porting_to.html#header-dep-changes Pull Request resolved: https://github.com/facebook/rocksdb/pull/11118 Reviewed By: cbi42 Differential Revision: D42711356 Pulled By: ajkr fbshipit-source-id: 5ea257b85b7017f40fd8fdbce965336da95c55b2	2023-01-25 14:30:32 -08:00
Peter Dillinger	fd911f9655	Upgrade xxhash.h to latest dev (#11098 ) Summary: Upgrading xxhash.h to latest dev version as of 1/17/2023, which is d7197ddea81364a539051f116ca77926100fc77f This should improve performance on some ARM machines. I allowed some of our RocksDB-specific changes to be made obsolete where it seemed appropriate, for example * xxhash.h has its own fallthrough marker (which I hope works for us) * As in https://github.com/Cyan4973/xxHash/pull/549 Merging and resolving conflicts one way or the other was all that went into this diff. Except I had to mix the two sides around `defined(__loongarch64)` How I did the upgrade (for future reference), so that I could use usual merge conflict resolution: ``` # New branch to help with merging git checkout -b xxh_merge_base # Check out RocksDB revision before last xxhash.h upgrade git reset --hard 22161b7547652af82a5dc67458de9ca8946ac83d^ # Create a commit with the raw base version from xxHash repo (from xxHash repo) git show 2c611a76f914828bed675f0f342d6c4199ffee1e:xxhash.h > ../rocksdb/util/xxhash.h # In RocksDB repo git commit -a # Merge in the last xxhash.h upgrade git merge `22161b7547` # Resolve conflict using committed version git show 22161b7547652af82a5dc67458de9ca8946ac83d:util/xxhash.h > util/xxhash.h git commit -a # Catch up to upstream git merge upstream/main # Create a different branch for applying raw upgrade git checkout -b xxh_upgrade_2023 # Find the RocksDB commit we made for the raw base version from xxHash git log main..HEAD # Rewind to it git reset --hard `2428b727a9` # Copy in latest raw version (from xxHash repo) cat xxhash.h > ../rocksdb/util/xxhash.h # Merge in RocksDB changes, use typical tools for conflict resolution git merge xxh_merge_base ``` Branch https://github.com/facebook/rocksdb/tree/xxhash_merge_base can be used as a base for future xxhash merges. Fixes https://github.com/facebook/rocksdb/issues/11073 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11098 Test Plan: existing tests (e.g. Bloom filter schema stability tests) Also seems to include a small performance boost on my Intel dev machine, using `./db_bench --benchmarks=xxh3[-X50] 2>&1 \| egrep -o 'operations;.*' \| sort` Fastest out of 50 runs, before: 15477.3 MB/s Fastest out of 50 runs, after: 15850.7 MB/s, and 11 more runs faster than the "before" number Slowest out of 50 runs, before: 12267.5 MB/s Slowest out of 50 runs, after: 13897.1 MB/s More repetitions show the distinction is repeatable Reviewed By: hx235 Differential Revision: D42560010 Pulled By: pdillinger fbshipit-source-id: c43ee52f1c5fe0ba3d6d6e4eebb22ded5f5492ea	2023-01-19 12:07:50 -08:00
Wenlong Zhang	1cfe3528a2	support loongarch64 for rocksdb (#10036 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10036 Reviewed By: hx235 Differential Revision: D42424074 Pulled By: ajkr fbshipit-source-id: 004adb75005a26bd01c5d568d1ec6ac442cd59dd	2023-01-13 08:42:44 -08:00
Peter Dillinger	9f7801c5f1	Major Cache refactoring, CPU efficiency improvement (#10975 ) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char` for serialized byte data all over RocksDB, with many advantages over `void`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void, so that we can better indicate the intent of the void when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432	2023-01-11 14:20:40 -08:00
Peter Dillinger	433d7e4594	Improve error messages for SST footer and size errors (#11009 ) Summary: Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure: * Magic number error is reported first if we attempt to open an SST file and the footer is completely bad. * Footer errors are reported with affected file. * If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009 Test Plan: unit tests added, some manual Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed). Reviewed By: siying Differential Revision: D41656272 Pulled By: pdillinger fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df	2022-12-09 10:03:47 -08:00
WLeoo	be3a62a2e7	Fix an uninitialized variable warning for g++ 12.2.0 (#10995 ) Summary: /home/wl/rocksdbtry/rocksdb-WL/util/bloom_test.cc: In constructor ‘rocksdb::RawFilterTester::RawFilterTester()’: /home/wl/rocksdbtry/rocksdb-WL/util/bloom_test.cc:813:40: error: member ‘rocksdb::RawFilterTester::data_’ is used uninitialized [-Werror=uninitialized] Pull Request resolved: https://github.com/facebook/rocksdb/pull/10995 Reviewed By: cbi42 Differential Revision: D41620186 Pulled By: ajkr fbshipit-source-id: a6ebd3820ef12e0af322cbfb7eb553de5bdfcb29	2022-11-30 19:27:28 -08:00
Daniel Engel	55d58d91e7	Fix use of crc32c 3way on portable builds using MSVC (#10667 ) Summary: Hello, As discussed previously in this [discussion](https://github.com/facebook/rocksdb/pull/9680#discussion_r853105163), the mentioned PR introduced a regression in portable versions that compile with MSVC - crc_3way optimization won't be used even in cases where it is supported. This PR aims to fix just that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10667 Reviewed By: akankshamahajan15 Differential Revision: D40644592 Pulled By: ajkr fbshipit-source-id: dadbeb10d57c19800e74288258ec3b96095557dd	2022-11-08 11:56:55 -08:00
Brad Smith	4a6906e28c	Add OpenBSD/arm64 support for detection of CRC32 and PMULL (#10902 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10902 Reviewed By: akankshamahajan15 Differential Revision: D40839659 Pulled By: ajkr fbshipit-source-id: 06be5919622f8cce1fce1097c5e654900bf7f8fb	2022-11-02 14:35:27 -07:00
Peter Dillinger	a1a1dc6659	Manual interventions for clang-format util/ (#10870 ) Summary: Complements https://github.com/facebook/rocksdb/issues/10867 with some manual edits to avoid weird formatting or to avoid massive reformatting third party code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10870 Test Plan: `make check` etc Reviewed By: riversand963 Differential Revision: D40686526 Pulled By: pdillinger fbshipit-source-id: 6af988fe4b0a8ae4a5992ec2c3c37fe67584226e	2022-10-26 12:08:20 -07:00
Peter Dillinger	7fff38b1fe	clang-format cache/ and util/ directories (#10867 ) Summary: This is purely the result of running `clang-format -i` on files, except some files have been excluded for manual intervention in a separate PR Pull Request resolved: https://github.com/facebook/rocksdb/pull/10867 Test Plan: `make check`, `make check-headers`, `make format` Reviewed By: jay-zhuang Differential Revision: D40682086 Pulled By: pdillinger fbshipit-source-id: 8673d978553ab99b516da7fb63ba0b82523337f8	2022-10-26 12:08:20 -07:00
sdong	5d17297b76	Make UserComparatorWrapper not Customizable (#10837 ) Summary: Right now UserComparatorWrapper is a Customizable object, although it is not, which introduces some intialization overhead for the object. In some benchmarks, it shows up in CPU profiling. Make it not configurable by defining most functions needed by UserComparatorWrapper to an interface and implement the interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10837 Test Plan: Make sure existing tests pass Reviewed By: pdillinger Differential Revision: D40528511 fbshipit-source-id: 70eaac89ecd55401a26e8ed32abbc413a9617c62	2022-10-21 12:27:50 -07:00
Peter Dillinger	e466173d5c	Print stack traces on frozen tests in CI (#10828 ) Summary: Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off. For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828 Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI Reviewed By: hx235 Differential Revision: D40447634 Pulled By: pdillinger fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1	2022-10-18 00:35:35 -07:00
Yanqin Jin	d2578ab195	Add DECLARE_uint32 to gflags compatibility (#10729 ) Summary: Older versions of gflags do not have `DEFINE_uint32` and `DECLARE_uint32`. In util/gflag_compat.h, we already add a hack for `DEFINE_uint32`. This PR adds a hack for `DECLARE_uint32`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10729 Test Plan: ROCKSDB_NO_FBCODE=1 make V=1 -j16 db_stress make check Resolves https://github.com/facebook/rocksdb/issues/10704 Reviewed By: pdillinger Differential Revision: D39789183 Pulled By: riversand963 fbshipit-source-id: a58747e0163dcf55dd762733aa5c40d8f0ae70a6	2022-09-27 20:12:13 -07:00
Andrew Kryczka	36dec11bc6	Disable RateLimiterTest.Rate with valgrind (#10637 ) Summary: Example valgrind flake: https://app.circleci.com/pipelines/github/facebook/rocksdb/18073/workflows/3794e569-45cb-4621-a2b4-df1dcdf5cb19/jobs/475569 ``` util/rate_limiter_test.cc:358 Expected equality of these values: samples_at_minimum Which is: 9 samples Which is: 10 ``` Some other runs of `RateLimiterTest.Rate` already skip this check due to its reliance on a minimum execution speed. We know valgrind slows execution a lot so can disable the check in that case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10637 Reviewed By: cbi42 Differential Revision: D39251350 Pulled By: ajkr fbshipit-source-id: 41ae1ea4cd91992ea57df902f9f7fd6d182a5932	2022-09-04 22:15:14 -07:00
gitbw95	6cd8133035	Fix an import issue in fbcode. (#10604 ) Summary: This should fix an import issue detected in meta internal tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10604 Test Plan: Unit Tests. Reviewed By: hx235 Differential Revision: D39120414 Pulled By: gitbw95 fbshipit-source-id: dbd016d7f47b9f54aab5ea61e8d3cd79734f46af	2022-08-29 21:09:36 -07:00
Jay Zhuang	d9e71fb2c5	Fix periodic_task unable to re-register the same task type (#10379 ) Summary: Timer has a limitation that it cannot re-register a task with the same name, because the cancel only mark the task as invalid and wait for the Timer thread to clean it up later, before the task is cleaned up, the same task name cannot be added. Which makes the task option update likely to fail, which basically cancel and re-register the same task name. Change the periodic task name to a random unique id and store it in periodic_task_scheduler. Also refactor the `periodic_work` to `periodic_task` to make each job function as a `task`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10379 Test Plan: unittests Reviewed By: ajkr Differential Revision: D38000615 Pulled By: jay-zhuang fbshipit-source-id: e4135f9422e3b53aaec8eda54f4e18ce633a279e	2022-08-25 18:52:37 -07:00
Ryan Mack	06f73d2575	Fix autovector::emplace_back return type for C++17 (#10542 ) Summary: C++17 changes emplace_back API to return the new object. Needed to compile rocksdb on recent compilers. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10542 Reviewed By: hx235 Differential Revision: D38896019 Pulled By: ajkr fbshipit-source-id: cd7ddf34c0dcd449ecedc41e89a37b3a270a5603	2022-08-23 14:58:16 -07:00
Peter Dillinger	86a1e3e0e7	Derive cache keys from SST unique IDs (#10394 ) Summary: ... so that cache keys can be derived from DB manifest data before reading the file from storage--so that every part of the file can potentially go in a persistent cache. See updated comments in cache_key.cc for technical details. Importantly, the new cache key encoding uses some fancy but efficient math to pack data into the cache key without depending on the sizes of the various pieces. This simplifies some existing code creating cache keys, like cache warming before the file size is known. This should provide us an essentially permanent mapping between SST unique IDs and base cache keys, with the ability to "upgrade" SST unique IDs (and thus cache keys) with new SST format_versions. These cache keys are of similar, perhaps indistinguishable quality to the previous generation. Before this change (see "corrected" days between collision): ``` ./cache_bench -stress_cache_key -sck_keep_bits=43 18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected) ``` After this change (keep 43 bits, up through 50, to validate "trajectory" is ok on "corrected" days between collision): ``` 19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected) 16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected) 15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected) 15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected) 15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected) 15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected) 15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected) 15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected) ``` However, the change does modify (probably weaken) the "guaranteed unique" promise from this > SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86 to this (see https://github.com/facebook/rocksdb/issues/10388) > With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits I don't think this is a practical concern, though. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394 Test Plan: unit tests updated, see simulation results above Reviewed By: jay-zhuang Differential Revision: D38667529 Pulled By: pdillinger fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba	2022-08-12 13:49:49 -07:00
sdong	9277569ba3	Add some missing headers (#10519 ) Summary: Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519 Reviewed By: jay-zhuang Differential Revision: D38603291 fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1	2022-08-11 12:45:50 -07:00
sdong	911c0208b9	WritableFileWriter tries to skip operations after failure (#10489 ) Summary: A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489 Test Plan: Fix existing unit tests and watch crash tests for a while. Reviewed By: anand1976 Differential Revision: D38473277 fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366	2022-08-10 10:19:20 -07:00
sdong	aec28ebae6	db_bench -use_stderr_info_logger to print timestamp (#10435 ) Summary: Right now db_bench -use_stderr_info_logger would redirect RocksDB info logging to stderr but no timetamp is printed out. Add timestamp to there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10435 Test Plan: Run "db_bench -use_stderr_info_logger" Reviewed By: riversand963 Differential Revision: D38258699 fbshipit-source-id: 3fee6eb1205127b923bc6a660f86bd2742519aec	2022-07-29 11:24:52 -07:00
Jay Zhuang	fcccc412d7	Remove Travis CI (#10407 ) Summary: Travis CI is depreciated and haven't been maintained for some time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10407 Reviewed By: ajkr Differential Revision: D38078382 Pulled By: jay-zhuang fbshipit-source-id: f42057f2f41f722bdce56bf195f67a94835191fb	2022-07-22 20:16:45 -07:00
Andrew Kryczka	e576f2ab19	Fix race conditions in GenericRateLimiter (#10374 ) Summary: Made locking strict for all accesses of `GenericRateLimiter` internal state. `SetBytesPerSecond()` was the main problem since it had no locking, while the two updates it makes need to be done as one atomic operation. The test case, "ConfigOptionsTest.ConfiguringOptionsDoesNotRevertRateLimiterBandwidth", is for the issue fixed in https://github.com/facebook/rocksdb/issues/10378, but I forgot to include the test there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10374 Reviewed By: pdillinger Differential Revision: D37906367 Pulled By: ajkr fbshipit-source-id: ccde620d2a7f96d1401bdafd2bdb685cbefbafa5	2022-07-19 09:31:14 -07:00
Andrew Kryczka	25cc564ff7	Make RateLimiter not Customizable (#10378 ) Summary: (PR created for informational/testing purposes only.) - Fixes lost dynamic updates to GenericRateLimiter bandwidth using `SetBytesPerSecond()` - Benefit over #10374 is eliminating race conditions with Configurable framework. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10378 Reviewed By: pdillinger Differential Revision: D37914865 fbshipit-source-id: d4f566d60ec9726d26932388c61671adf0ee0f30	2022-07-18 14:48:42 -07:00
Jay Zhuang	a3acf2ef87	Add seqno to time mapping (#10338 ) Summary: Which will be used for tiered storage to preclude hot data from compacting to the cold tier (the last level). Internally, adding seqno to time mapping. A periodic_task is scheduled to record the current_seqno -> current_time in certain cadence. When memtable flush, the mapping informaiton is stored in sstable property. During compaction, the mapping information are merged and get the approximate time of sequence number, which is used to determine if a key is recently inserted or not and preclude it from the last level if it's recently inserted (within the `preclude_last_level_data_seconds`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338 Test Plan: CI Reviewed By: siying Differential Revision: D37810187 Pulled By: jay-zhuang fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f	2022-07-14 21:49:34 -07:00
sdong	c8b20d469d	Make InternalKeyComparator not configurable (#10342 ) Summary: InternalKeyComparator is an internal class which is a simple wrapper of Comparator. https://github.com/facebook/rocksdb/pull/8336 made Comparator customizeable. As a side effect, internal key comparator was made configurable too. This introduces overhead to this simple wrapper. For example, every InternalKeyComparator will have an std::vector attached to it, which consumes memory and possible allocation overhead too. We remove InternalKeyComparator from being customizable by making InternalKeyComparator not a subclass of Comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10342 Test Plan: Run existing CI tests and make sure it doesn't fail Reviewed By: riversand963 Differential Revision: D37771351 fbshipit-source-id: 917256ee04b2796ed82974549c734fb6c4d8ccee	2022-07-14 10:09:31 -07:00
Changyu Bi	5f9fe7f21e	Added WAL compression checksum (#10319 ) Summary: Enabled zstd checksum flag in StreamingCompress so that WAL (de)compreression is protected by a checksum per compression frame. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10319 Test Plan: - `make check` - WAL perf: average ops/sec over 10 runs is 161226 pre PR and 159635 post PR (1% drop). ``` sudo TEST_TMPDIR=/dev/shm/memtable_write ./db_bench_checksum -benchmarks=fillseq -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=10 -wal_compression=zstd ``` Reviewed By: ajkr Differential Revision: D37673311 Pulled By: cbi42 fbshipit-source-id: 9f34a3bfc2a82e5c80b1ec63bb339a7465108ec9	2022-07-13 15:29:20 -07:00

1 2 3 4 5 ...

2075 commits