rocksdb

Commit Graph

Author	SHA1	Message	Date
mrambacher	6924869867	Make SystemClock into a Customizable Class (#8636 ) Summary: Made SystemClock into a Customizable class, complete with CreateFromString. Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv). Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636 Reviewed By: zhichao-cao Differential Revision: D30483360 Pulled By: mrambacher fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1	2021-09-21 09:23:48 -07:00
Hui Xiao	65411b8d4e	Improve rate_limiter_test.cc (#8904 ) Summary: - Fixed a bug in `RateLimiterTest.GeneratePriorityIterationOrder` that the callbacks in this test were not called starting from `i = 1`. Fix by increasing `rate_bytes_per_sec` and requested bytes. - The bug is due to the previous `rate_bytes_per_sec` was set too small, resulting in `refill_bytes_per_period` less than `kMinRefillBytesPerPeriod`. Hence the actual `refill_bytes_per_period` was equal to `kMinRefillBytesPerPeriod` due to the logic [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L302-L303) and it ended up being greater than the previously set requested bytes. Therefore starting from `i = 1`, `RefillBytesAndGrantRequests()` and `GeneratePriorityIterationOrder` won't be called and the test callbacks was not triggered to execute the assertion. - Added internal flag to assert callbacks are called in `RateLimiterTest.GeneratePriorityIterationOrder` to prevent any future changes defeat the purpose of the test [as suggested](https://github.com/facebook/rocksdb/pull/8890#discussion_r704915134) - Increased `rate_bytes_per_sec` and bytes of each request in `RateLimiterTest.GetTotalBytesThrough`, `RateLimiterTest.GetTotalRequests`, `RateLimiterTest.GetTotalPendingRequests` to trigger the "long path" of execution (i.e, the one trigger RefillBytesAndGrantRequests()) to increase test coverage - This increased the running time of the three tests, see test plan for time difference running locally - Cleared up sync point effects after each test by calling `SyncPoint::GetInstance()->DisableProcessing();` and `SyncPoint::GetInstance()->ClearAllCallBacks();` in `~RateLimiterTest()` [as suggested](https://github.com/facebook/rocksdb/pull/8595/files#r697534279) - It's fine to call these two methods even when `EnableProcessing()` or `SetCallBack()` is not called in the test or is already cleaned up. In those cases, calling these two functions in destructor is effectively no-op. - This will allow cleaning up sync point effects of previous test even when the previous test failed in assertion. - Added missing `SyncPoint::GetInstance()->DisableProcessing();` and `SyncPoint::GetInstance()->ClearCallBacks(..);` in existing tests for completeness - Called `SyncPoint::GetInstance()->DisableProcessing();` and `SyncPoint::GetInstance()->ClearCallBacks(..);` in loop in `RateLimiterTest.GeneratePriorityIterationOrder` for completeness Pull Request resolved: https://github.com/facebook/rocksdb/pull/8904 Test Plan: - Passing existing tests - To verify the 1st change, run `RateLimiterTest.GeneratePriorityIterationOrder` with assertions of callbacks are indeed called under original `rate_bytes_per_sec` and request byte and under updated `rate_bytes_per_sec` and request byte. The former will fail the assertion while the latter succeeds. - Here is the increased test time due to the 3rd change mentioned above in the summary. The relevant 3 tests mentioned in total increase the test time by 6s (~6000/33848 = 17.7% of the original total test time), which IMO is acceptable for better test coverage through running the "long path". - current (run on branch rate_limiter_ut_improve locally) [ RUN ] RateLimiterTest.GetTotalBytesThrough [ OK ] RateLimiterTest.GetTotalBytesThrough (3000 ms) [ RUN ] RateLimiterTest.GetTotalRequests [ OK ] RateLimiterTest.GetTotalRequests (3001 ms) [ RUN ] RateLimiterTest.GetTotalPendingRequests [ OK ] RateLimiterTest.GetTotalPendingRequests (0 ms) ... [----------] 10 tests from RateLimiterTest (43349 ms total) [----------] Global test environment tear-down [==========] 10 tests from 1 test case ran. (43349 ms total) [ PASSED ] 10 tests. - previous (run on branch main locally) [ RUN ] RateLimiterTest.GetTotalBytesThrough [ OK ] RateLimiterTest.GetTotalBytesThrough (0 ms) [ RUN ] RateLimiterTest.GetTotalRequests [ OK ] RateLimiterTest.GetTotalRequests (0 ms) [ RUN ] RateLimiterTest.GetTotalPendingRequests [ OK ] RateLimiterTest.GetTotalPendingRequests (0 ms) ... [----------] 10 tests from RateLimiterTest (33848 ms total) [----------] Global test environment tear-down [==========] 10 tests from 1 test case ran. (33848 ms total) [ PASSED ] 10 tests. Reviewed By: ajkr Differential Revision: D30872544 Pulled By: hx235 fbshipit-source-id: ff894f5c1a4bef70e8e407d53b00be45f776b3e4	2021-09-17 09:23:31 -07:00
Peter Dillinger	bda8d93ba9	Fix and detect headers with missing dependencies (#8893 ) Summary: It's always annoying to find a header does not include its own dependencies and only works when included after other includes. This change adds `make check-headers` which validates that each header can be included at the top of a file. Some headers are excluded e.g. because of platform or external dependencies. rocksdb_namespace.h had to be re-worked slightly to enable checking for failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.) Fixes mostly involve adding and cleaning up #includes, but for FileTraceWriter, a constructor was out-of-lined to make a forward declaration sufficient. This check is not currently run with `make check` but is added to CircleCI build-linux-unity since that one is already relatively fast. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893 Test Plan: existing tests and resolving issues detected by new check Reviewed By: mrambacher Differential Revision: D30823300 Pulled By: pdillinger fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572	2021-09-10 10:00:26 -07:00
Hui Xiao	12542488ef	Add public API RateLimiter::GetTotalPendingRequests() (#8890 ) Summary: Context/Summary: As users requested, a public API RateLimiter::GetTotalPendingRequests() is added to expose the total number of pending requests for bytes in the rate limiter, which is the size of the request queue of that priority (or of all priorities, if IO_TOTAL is interested) at the time when this API is called. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8890 Test Plan: - Passing added new unit tests - Passing existing unit tests Reviewed By: ajkr Differential Revision: D30815500 Pulled By: hx235 fbshipit-source-id: 2dfa990f651c1c47378b6215c751ad76a5824300	2021-09-10 08:37:04 -07:00
Peter Dillinger	0ef88538c6	Improve support for using regexes (#8740 ) Summary: * Consolidate use of std::regex for testing to testharness.cc, to minimize Facebook linters constantly flagging uses in non-production code. * Improve syntax and error messages for asserting some string matches a regex in tests. * Add a public Regex wrapper class to encapsulate existing usage in ObjectRegistry. * Remove unnecessary include <regex> * Put warnings that use of Regex in production code could cause bad performance or stack overflow. Intended follow-up work: * Replace std::regex with another underlying implementation like RE2 * Improve ObjectRegistry interface in terms of possibly confusing literal string matching vs. regex and in terms of reporting invalid regex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8740 Test Plan: tests updated, basic unit test for public Regex, and some manual testing of temporary changes to see example error messages: utilities/backupable/backupable_db_test.cc:917: Failure 000010_1162373755_138626.blob (child.name) does not match regex [0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern) db/db_basic_test.cc:74: Failure R3SHSBA8C4U0CIMV2ZB0 (sid3) does not match regex [0-9A-Z]{20}HAHAHA Reviewed By: mrambacher Differential Revision: D30706246 Pulled By: pdillinger fbshipit-source-id: ba845e8f563ccad39bdb58f44f04e9da8f78c3fd	2021-09-07 13:05:23 -07:00
Peter Dillinger	4750421ece	Replace most typedef with using= (#8751 ) Summary: Old typedef syntax is confusing Most but not all changes with perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files make format Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751 Test Plan: existing Reviewed By: zhichao-cao Differential Revision: D30745277 Pulled By: pdillinger fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9	2021-09-07 11:31:59 -07:00
Peter Dillinger	c9cd5d25a8	Remove some unneeded code (#8736 ) Summary: * FullKey and ParseFullKey appear to serve no purpose in the public API (or anything else) so removed. Only use in one test updated. * NumberToString serves no purpose vs. ToString so removed, numerous calls updated * Remove unnecessary forward declarations in metadata.h by re-arranging class definitions. * Remove some unneeded semicolons Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736 Test Plan: existing tests Reviewed By: mrambacher Differential Revision: D30700039 Pulled By: pdillinger fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2	2021-09-01 14:28:58 -07:00
Hui Xiao	240c4126fd	Implement superior user & mid IO priority level in GenericRateLimiter (#8595 ) Summary: Context: An extra IO_USER priority in rate limiter allows users to optionally charge WAL writes / SST reads to rate limiter at this priority level, which then has higher priority than IO_HIGH and IO_LOW. With an extra IO_USER priority, it allows users to better specify the relative urgency/importance among different requests in rate limiter. As a consequence, IO resource management can better prioritize and limit resource based on user's need. The IO_USER is implemented as superior priority in GenericRateLimiter, in the sense that its request queue will always be iterated first without being constrained to fairness. The reason is that the notion of fairness is only meaningful in helping lower priorities in background IO (i.e, IO_HIGH/MID/LOW) to gain some fair chance to run so that it does not block foreground IO (i.e, the ones that are charged at the level of IO_USER). As we can see, the ultimate goal here is to not blocking foreground IO at IO_USER level, which justifies the superiority of IO_USER. Similar benefits exist for IO_MID priority. - Rewrote the logic of deciding the order of iterating request queues of high/low priorities to include the extra user/mid priority w/o affecting the existing behavior (see PR's [comment](https://github.com/facebook/rocksdb/pull/8595/files#r678749331)) - Included the request queue of user-pri/mid-pri in the code path of next-leader-candidate signaling and GenericRateLimiter's destructor - Included the extra user/mid-pri in bookkeeping data structures: total_bytes_through_ and total_requests_ - Re-written the previous impl of explicitly iterating priorities with a loop from Env::IO_LOW to Env::IO_TOTAL Pull Request resolved: https://github.com/facebook/rocksdb/pull/8595 Test Plan: - passed existing rate_limiter_test.cc - passed added unit tests in rate_limiter_test.cc - run performance test to verify performance with only high/low requests is not affected by this change - Set-up command: `TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --duration=5 --compression_type=none --num=100000000 --disable_auto_compactions=true --write_buffer_size=1048576 --writable_file_max_buffer_size=65536 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --level0_slowdown_writes_trigger=$(((1 << 31) - 1)) --level0_stop_writes_trigger=$(((1 << 31) - 1))` - Test command: `TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=overwrite --use_existing_db=true --disable_wal=true --duration=30 --compression_type=none --num=100000000 --write_buffer_size=1048576 --writable_file_max_buffer_size=65536 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --level0_slowdown_writes_trigger=$(((1 << 31) - 1)) --level0_stop_writes_trigger=$(((1 << 31) - 1)) --statistics=true --rate_limiter_bytes_per_sec=1048576 --rate_limiter_refill_period_us=1000 --threads=32 \|& grep -E '(flush\|compact)\.write\.bytes'` - Before (on branch upstream/master): `rocksdb.compact.write.bytes COUNT : 4014162` `rocksdb.flush.write.bytes COUNT : 26715832` rocksdb.flush.write.bytes/rocksdb.compact.write.bytes ~= 6.66 - After (on branch rate_limiter_user_pri): `rocksdb.compact.write.bytes COUNT : 3807822` `rocksdb.flush.write.bytes COUNT : 26098659` rocksdb.flush.write.bytes/rocksdb.compact.write.bytes ~= 6.85 Reviewed By: ajkr Differential Revision: D30577783 Pulled By: hx235 fbshipit-source-id: 0881f2705ffd13ecd331256bde7e8ec874a353f4	2021-08-31 11:24:27 -07:00
Peter Dillinger	13ded69484	Built-in support for generating unique IDs, bug fix (#8708 ) Summary: Env::GenerateUniqueId() works fine on Windows and on POSIX where /proc/sys/kernel/random/uuid exists. Our other implementation is flawed and easily produces collision in a new multi-threaded test. As we rely more heavily on DB session ID uniqueness, this becomes a serious issue. This change combines several individually suitable entropy sources for reliable generation of random unique IDs, with goal of uniqueness and portability, not cryptographic strength nor maximum speed. Specifically: * Moves code for getting UUIDs from the OS to port::GenerateRfcUuid rather than in Env implementation details. Callers are now told whether the operation fails or succeeds. * Adds an internal API GenerateRawUniqueId for generating high-quality 128-bit unique identifiers, by combining entropy from three "tracks": * Lots of info from default Env like time, process id, and hostname. * std::random_device * port::GenerateRfcUuid (when working) * Built-in implementations of Env::GenerateUniqueId() will now always produce an RFC 4122 UUID string, either from platform-specific API or by converting the output of GenerateRawUniqueId. DB session IDs now use GenerateRawUniqueId while DB IDs (not as critical) try to use port::GenerateRfcUuid but fall back on GenerateRawUniqueId with conversion to an RFC 4122 UUID. GenerateRawUniqueId is declared and defined under env/ rather than util/ or even port/ because of the Env dependency. Likely follow-up: enhance GenerateRawUniqueId to be faster after the first call and to guarantee uniqueness within the lifetime of a single process (imparting the same property onto DB session IDs). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708 Test Plan: A new mini-stress test in env_test checks the various public and internal APIs for uniqueness, including each track of GenerateRawUniqueId individually. We can't hope to verify anywhere close to 128 bits of entropy, but it can at least detect flaws as bad as the old code. Serial execution of the new tests takes about 350 ms on my machine. Reviewed By: zhichao-cao, mrambacher Differential Revision: D30563780 Pulled By: pdillinger fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364	2021-08-30 15:20:41 -07:00
Peter Dillinger	22161b7547	Upgrade xxhash, add Hash128 (#8634 ) Summary: With expected use for a 128-bit hash, xxhash library is upgraded to current dev (2c611a76f914828bed675f0f342d6c4199ffee1e) as of Aug 6 so that we can use production version of XXH3_128bits as new Hash128 function (added in hash128.h). To make this work, however, we have to carve out the "preview" version of XXH3 that is used in new SST Bloom and Ribbon filters, since that will not get maintenance in xxhash releases. I have consolidated all the relevant code into xxph3.h and made it "inline only" (no .cc file). The working name for this hash function is changed from XXH3p to XXPH3 (XX Preview Hash) because the latter is easier to get working with no symbol name conflicts between the headers. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8634 Test Plan: no expected change in existing functionality. For Hash128, added some unit tests based on those for Hash64 to ensure some basic properties and that the values do not change accidentally. Reviewed By: zhichao-cao Differential Revision: D30173490 Pulled By: pdillinger fbshipit-source-id: 06aa542a7a28b353bc2c865b9b2f8bdfe44158e4	2021-08-20 18:41:51 -07:00
Peter Dillinger	2a383f21f4	Add Bloom/Ribbon hybrid API support (#8679 ) Summary: This is essentially resurrection and fixing of the part of https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically, when configuring Ribbon filter, you can specify an LSM level before which Bloom will be used instead of Ribbon. But Bloom is only considered for Leveled and Universal compaction styles and file going into a known LSM level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as you would expect with NewRibbonFilterPolicy. So that this can be controlled with a single int value and so that flushes can be distinguished from intra-L0, we consider flush to go to level -1 for the purposes of this option. (Explained in API comment.) I also expect the most common and recommended Ribbon configuration to use Bloom during flush, to minimize slowing down writes and because according to my estimates, Ribbon only pays off if the structure lives in memory for more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy to be this mild hybrid configuration. I don't really want to add something like NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for flush, Ribbon otherwise) should be considered a natural choice. C APIs also updated, but because they don't support overloading, rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid configuration. While touching C API, I changed bits per key options from int to double. BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit unused fields from BloomFilterPolicy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679 Test Plan: new + updated tests, including crash test Reviewed By: jay-zhuang Differential Revision: D30445797 Pulled By: pdillinger fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1	2021-08-20 18:00:16 -07:00
Peter Dillinger	b6269b078a	Stable cache keys on ingested SST files (#8669 ) Summary: Extends https://github.com/facebook/rocksdb/issues/8659 to work for ingested external SST files, even the same file ingested into different DBs sharing a block cache. Note: These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8669 Test Plan: Extended unit test Reviewed By: zhichao-cao Differential Revision: D30398532 Pulled By: pdillinger fbshipit-source-id: 1f13e2af4b8bfff5741953a69466e9589fbc23c7	2021-08-18 11:33:03 -07:00
Andrew Kryczka	82b81dc8b5	Simplify GenericRateLimiter algorithm (#8602 ) Summary: `GenericRateLimiter` slow path handles requests that cannot be satisfied immediately. Such requests enter a queue, and their thread stays in `Request()` until they are granted or the rate limiter is stopped. These threads are responsible for unblocking themselves. The work to do so is split into two main duties. (1) Waiting for the next refill time. (2) Refilling the bytes and granting requests. Prior to this PR, the slow path logic involved a leader election algorithm to pick one thread to perform (1) followed by (2). It elected the thread whose request was at the front of the highest priority non-empty queue since that request was most likely to be granted. This algorithm was efficient in terms of reducing intermediate wakeups, which is a thread waking up only to resume waiting after finding its request is not granted. However, the conceptual complexity of this algorithm was too high. It took me a long time to draw a timeline to understand how it works for just one edge case yet there were so many. This PR drops the leader election to reduce conceptual complexity. Now, the two duties can be performed by whichever thread acquires the lock first. The risk of this change is increasing the number of intermediate wakeups, however, we took steps to mitigate that. - `wait_until_refill_pending_` flag ensures only one thread performs (1). This\ prevents the thundering herd problem at the next refill time. The remaining\ threads wait on their condition variable with an unbounded duration -- thus we\ must remember to notify them to ensure forward progress. - (1) is typically done by a thread at the front of a queue. This is trivial\ when the queues are initially empty as the first choice that arrives must be\ the only entry in its queue. When queues are initially non-empty, we achieve\ this by having (2) notify a thread at the front of a queue (preferring higher\ priority) to perform the next duty. - We do not require any additional wakeup for (2). Typically it will just be\ done by the thread that finished (1). Combined, the second and third bullet points above suggest the refill/granting will typically be done by a request at the front of its queue. This is important because one wakeup is saved when a granted request happens to be in an already running thread. Note there are a few cases that still lead to intermediate wakeup, however. The first two are existing issues that also apply to the old algorithm, however, the third (including both subpoints) is new. - No request may be granted (only possible when rate limit dynamically\ decreases). - Requests from a different queue may be granted. - (2) may be run by a non-front request thread causing it to not be granted even\ if some requests in that same queue are granted. It can happen for a couple\ (unlikely) reasons. - A new request may sneak in and grab the lock at the refill time, before the\ thread finishing (1) can wake up and grab it. - A new request may sneak in and grab the lock and execute (1) before (2)'s\ chosen candidate can wake up and grab the lock. Then that non-front request\ thread performing (1) can carry over to perform (2). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8602 Test Plan: - Use existing tests. The edge cases listed in the comment are all performance\ related; I could not really think of any related to correctness. The logic\ looks the same whether a thread wakes up/finishes its work early/on-time/late,\ or whether the thread is chosen vs. "steals" the work. - Verified write throughput and CPU overhead are basically the same with and\ without this change, even in a rate limiter heavy workload: Test command: ``` $ rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=fillrandom -num_multi_db=64 -num_low_pri_threads=64 -num_high_pri_threads=64 -write_buffer_size=262144 -target_file_size_base=262144 -max_bytes_for_level_base=1048576 -rate_limiter_bytes_per_sec=16777216 -key_size=24 -value_size=1000 -num=10000 -compression_type=none -rate_limiter_refill_period_us=1000 ``` Results before this PR: ``` fillrandom : 108.463 micros/op 9219 ops/sec; 9.0 MB/s 7.40user 8.84system 1:26.20elapsed 18%CPU (0avgtext+0avgdata 256140maxresident)k ``` Results after this PR: ``` fillrandom : 108.108 micros/op 9250 ops/sec; 9.0 MB/s 7.45user 8.23system 1:26.68elapsed 18%CPU (0avgtext+0avgdata 255688maxresident)k ``` Reviewed By: hx235 Differential Revision: D30048013 Pulled By: ajkr fbshipit-source-id: 6741bba9d9dfbccab359806d725105817fef818b	2021-08-09 16:47:15 -07:00
Lucian Grijincu	a756fb9c85	rocksdb: don't call LZ4_loadDictHC with null dictionary Summary: UBSAN revealed a pointer underflow when `LZ4HC_init_internal` is called with a null `start`. Reviewed By: ajkr Differential Revision: D30181874 fbshipit-source-id: ca9bbac1a85c58782871d7f153af733b000cc66c	2021-08-09 16:05:46 -07:00
Andrew Kryczka	23ffed9cb7	Prevent joining detached thread in ThreadPoolImpl (#8635 ) Summary: This draining mechanism should not be run during `JoinThreads()` because it can detach threads that will be joined. Joining detached threads would throw an exception. With this PR, we skip draining when `JoinThreads()` has already decided what threads to `join()`, so the threads will exit naturally once the work queue empties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8635 Test Plan: verified it unblocked using `WaitForJobsAndJoinAllThreads()` in https://github.com/facebook/rocksdb/issues/8611. Reviewed By: riversand963 Differential Revision: D30174587 Pulled By: ajkr fbshipit-source-id: 144966398a607987e0763c7152a0f653fdbf3c8b	2021-08-06 19:06:02 -07:00
hx235	dbe3810c74	Improve rate limiter implementation's readability (#8596 ) Summary: Context: As need for new feature of resource management using RocksDB's rate limiter like [https://github.com/facebook/rocksdb/issues/8595](https://github.com/facebook/rocksdb/pull/8595) arises, it is about time to re-learn our rate limiter and make this learning process easier for others by improving its readability. The comment/assertion/one extra else-branch are added based on my best understanding toward the rate_limiter.cc and rate_limiter_test.cc up to date after giving it a hard read. - Add code comments/assertion/one extra else-branch (that is not affecting existing behavior, see PR comment) to describe how leader-election works under multi-thread settings in GenericRateLimiter::Request() - Add code comments to describe a non-obvious trick during clean-up of rate limiter destructor - Add code comments to explain more about the starvation being fixed in GenericRateLimiter::Refill() through partial byte-granting - Add code comments to the rate limiter's setup in a complicated unit test in rate_limiter_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8596 Test Plan: - passed existing rate_limiter_test.cc Reviewed By: ajkr Differential Revision: D29982590 Pulled By: hx235 fbshipit-source-id: c3592986bb5b0c90d8229fe44f425251ec7e8a0a	2021-08-04 10:43:47 -07:00
Peter Dillinger	1d34cd797e	Fix insecure internal API for GetImpl (#8590 ) Summary: Calling the GetImpl function could leave reference to a local callback function in a field of a parameter struct. As this is performance-critical code, I'm not going to attempt to sanitize this code too much, but make the existing hack a bit cleaner by reverting what it overwrites in the input struct. Added SaveAndRestore utility class to make that easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8590 Test Plan: added unit test for SaveAndRestore; existing tests for GetImpl Reviewed By: riversand963 Differential Revision: D29947983 Pulled By: pdillinger fbshipit-source-id: 2f608853f970bc06724e834cc84dcc4b8599ddeb	2021-07-29 17:23:01 -07:00
Drewryz	3b27725245	Fix a minor issue with initializing the test path (#8555 ) Summary: The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555 Reviewed By: ajkr Differential Revision: D29758399 Pulled By: jay-zhuang fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011	2021-07-23 08:38:45 -07:00
Peter Dillinger	84eef260de	Remove TaskLimiterToken::ReleaseOnce for fix (#8567 ) Summary: Rare TSAN and valgrind failures are caused by unnecessary reading of a field on the TaskLimiterToken::limiter_ for an assertion after the token has been released and the limiter destroyed. To simplify we can simply destroy the token before triggering DB shutdown (potentially destroying the limiter). This makes the ReleaseOnce logic unnecessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8567 Test Plan: watch for more failures in CI Reviewed By: ajkr Differential Revision: D29811795 Pulled By: pdillinger fbshipit-source-id: 135549ebb98fe4f176d1542ed85d5bd6350a40b3	2021-07-21 17:37:53 -07:00
sherriiiliu	7b9ecd4067	fix several MSVC build errors (#8519 ) Summary: Fixed a few MSVC (VCToolsVersion=14.0) build errors and warnings * `DEFINE_string` is a macro and VC compiler complains that it cannot put [ifdef-inside-define](https://stackoverflow.com/questions/5586429/ifdef-inside-define) * `sleep()` is not a recognizable function. Use `FLAGS_env->SleepForMicroseconds` instead * Define precise type in comparison to avoid mismatch warning Pull Request resolved: https://github.com/facebook/rocksdb/pull/8519 Reviewed By: jay-zhuang Differential Revision: D29683086 fbshipit-source-id: 8c80941472089f8daba84ae29597e75e603850e4	2021-07-13 12:40:43 -07:00
mrambacher	89f66d4484	Add customizable_util.h to the public API (#8301 ) Summary: Useful for allowing new classes to create and manage Customizable objects without using internal APIs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8301 Reviewed By: zhichao-cao Differential Revision: D29428303 Pulled By: mrambacher fbshipit-source-id: 3d33d5197cc8379fe35b54d3d169f91f0dfe7a47	2021-06-29 09:08:57 -07:00
Zhichao Cao	a904c62d28	Using existing crc32c checksum in checksum handoff for Manifest and WAL (#8412 ) Summary: In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead. In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412 Test Plan: make check, add new testing cases. Reviewed By: anand1976 Differential Revision: D29151545 Pulled By: zhichao-cao fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772	2021-06-25 00:47:17 -07:00
Zhichao Cao	ecccc63179	Implementation of Crc32c combine function (#8305 ) Summary: Implement a function to generate the crc32c of two combined strings. Suppose we have the string 1 (s1) with crc32c checksum crc32c_1 and string 2 (s2) with crc32c checksum crc32c_2, the new string is s1+s2 and its checksum is crc32c_new=Crc32cCombine(crc32c_1, crc32c_2, s2.size). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8305 Test Plan: make check, added new testing case Reviewed By: pdillinger Differential Revision: D28651665 Pulled By: zhichao-cao fbshipit-source-id: c84116108388f11a81f6a217b49f99c70d4ffacf	2021-06-16 18:30:34 -07:00
mrambacher	6ad0810393	Make Comparator into a Customizable Object (#8336 ) Summary: Makes the Comparator class into a Customizable object. Added/Updated the CreateFromString method to create Comparators. Added test for using the ObjectRegistry to create one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8336 Reviewed By: jay-zhuang Differential Revision: D28999612 Pulled By: mrambacher fbshipit-source-id: bff2cb2814eeb9fef6a00fddc61d6e34b6fbcf2e	2021-06-11 06:22:59 -07:00
Jay Zhuang	3786181a90	Add remote compaction public API (#8300 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8300 Reviewed By: ajkr Differential Revision: D28464726 Pulled By: jay-zhuang fbshipit-source-id: 49e9f4fb791808a6cbf39a7b1a331373f645fc5e	2021-05-19 21:41:31 -07:00
Glebanister	748e3acc11	Add StartThread type checking wrapper (#8303 ) Summary: - Add class `FunctorWrapper` to invoke the function with given parameters - Implement `StartThreadTyped` which wraps `StartThread` with type checking cover - Demonstrate `StartThreadTyped` in test `util/thread_local_test.cc` https://github.com/facebook/rocksdb/issues/8285 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8303 Reviewed By: ajkr Differential Revision: D28539318 Pulled By: pdillinger fbshipit-source-id: 624789c236bde31163deda95c1e1471aee68933e	2021-05-19 16:51:13 -07:00
Jay Zhuang	a79b46c503	Add De/Serialization for CompactionInput/Result (#8247 ) Summary: The functions will be used for remote compaction parameter input and result. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8247 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D28104680 Pulled By: jay-zhuang fbshipit-source-id: c0a5178e6277125118384278efea2acbf90aa6cb	2021-05-12 12:36:43 -07:00
Andrew Kryczka	c70bae1b05	Fix ConcurrentTaskLimiter token release for shutdown (#8253 ) Summary: Previously the shutdown process did not properly wait for all `compaction_thread_limiter` tokens to be released before proceeding to delete the DB's C++ objects. When this happened, we saw tests like "DBCompactionTest.CompactionLimiter" flake with the following error: ``` virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed. ``` There is a case where a token can still be alive even after the shutdown process has waited for BG work to complete. In particular, this happens because the shutdown process only waits for flush/compaction scheduled/unscheduled counters to all reach zero. These counters are decremented in `BackgroundCallCompaction()` functions. However, tokens are released in `BGWorkCompaction()` functions, which actually wrap the `BackgroundCallCompaction()` function. A simple sleep could repro the race condition: ``` $ diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc index 806bc548a..ba59efa89 100644 --- a/db/db_impl/db_impl_compaction_flush.cc +++ b/db/db_impl/db_impl_compaction_flush.cc @@ -2442,6 +2442,7 @@ void DBImpl::BGWorkCompaction(void arg) { static_cast<PrepickedCompaction*>(ca.prepicked_compaction); static_cast_with_check<DBImpl>(ca.db)->BackgroundCallCompaction( prepicked_compaction, Env::Priority::LOW); + sleep(1); delete prepicked_compaction; } $ ./db_compaction_test --gtest_filter=DBCompactionTest.CompactionLimiter db_compaction_test: util/concurrent_task_limiter_impl.cc:24: virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed. Received signal 6 (Aborted) #0 /usr/local/fbcode/platform007/lib/libc.so.6(gsignal+0xcf) [0x7f02673c30ff] ?? ??:0 https://github.com/facebook/rocksdb/issues/1 /usr/local/fbcode/platform007/lib/libc.so.6(abort+0x134) [0x7f02673ac934] ?? ??:0 ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8253 Test Plan: sleeps to expose race conditions Reviewed By: akankshamahajan15 Differential Revision: D28168064 Pulled By: ajkr fbshipit-source-id: 9e5167c74398d323e7975980c5cc00f450631160	2021-05-04 17:27:24 -07:00
mrambacher	0ca6d6297f	Rename variables in ImmutableCFOptions to avoid conflicts with ImmutableDBOptions (#8227 ) Summary: Renaming ImmutableCFOptions::info_log and statistics to logger and stats. This is stage 2 in creating an ImmutableOptions class. It is necessary because the names match those in ImmutableOptions and have different types. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8227 Reviewed By: jay-zhuang Differential Revision: D28000967 Pulled By: mrambacher fbshipit-source-id: 3bf2aa04e8f1e8724d825b7deacf41080c14420b	2021-04-26 12:43:45 -07:00
mrambacher	01e460d538	Make types of Immutable/Mutable Options fields match that of the underlying Option (#8176 ) Summary: This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code. With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively. readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release). There are still fields in the ImmutableCFOptions that are not CF options but DB options. Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions). But that will be part of a future PR to minimize changes and disruptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176 Reviewed By: pdillinger Differential Revision: D27954339 Pulled By: mrambacher fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad	2021-04-22 20:43:54 -07:00
Peter Dillinger	95f6add746	Revert Ribbon starting level support from #8198 (#8212 ) Summary: This partially reverts commit `10196d7edc`. The problem with this change is because of important filter use cases: FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so would only use Ribbon filters if specifically including level 0 for the Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate unknown level, and this would be treated the same as level 0 unless fixed. We are keeping the part about committing to permanent schema, which is only changes to API comments and HISTORY.md. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212 Test Plan: CI Reviewed By: jay-zhuang Differential Revision: D27896468 Pulled By: pdillinger fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e	2021-04-20 19:46:40 -07:00
Peter Dillinger	10196d7edc	Ribbon long-term support, starting level support (#8198 ) Summary: Since the Ribbon filter schema seems good (compatible back to 6.15.0), this change commits to long term support of the SST schema, even though we expect the API for enabling Ribbon to change (still called NewExperimentalRibbonFilterPolicy). This also adds support for "hybrid" configuration in which some levels use Bloom (higher levels, lower numbered) for speed and the rest use Ribbon (lower levels, higher numbered) for memory space efficiency. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198 Test Plan: unit test added, crash test support Reviewed By: jay-zhuang Differential Revision: D27831232 Pulled By: pdillinger fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab	2021-04-16 15:43:08 -07:00
Levi Tamasi	303cb23a0f	Introduce a ThreadGuard class and use it in ExternalSSTFileTest.PickedLevelBug (#8112 ) Summary: The patch adds a resource management/RAII class called `ThreadGuard`, which can be used to ensure that the managed thread is joined when the `ThreadGuard` is destroyed, regardless of whether it is due to the object going out of scope, an early return, an exception etc. This is important because if an `std::thread` object is destroyed without having been joined (or detached) first, the process is aborted (via `std::terminate`). For now, `ThreadGuard` is only used in the test case `ExternalSSTFileTest.PickedLevelBug`; however, it could come in handy elsewhere in the codebase as well (both in test code and "real" code). Case in point: in the `PickedLevelBug` test case, with the earlier code we could end up in the above situation when the following assertion (which is before the threads are joined) is triggered: ``` ASSERT_FALSE(bg_compact_started.load()); ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8112 Test Plan: ``` make check gtest-parallel --repeat=10000 ./external_sst_file_test --gtest_filter="*PickedLevelBug" ``` Reviewed By: riversand963 Differential Revision: D27343185 Pulled By: ltamasi fbshipit-source-id: 2a8c3aa68bc78cc03ec0dbae909fb25c2cd15c69	2021-03-25 22:08:58 -07:00
Jay Zhuang	45c65d6dcf	Use thread-safe `strerror_r()` to get error message (#8087 ) Summary: `strerror()` is not thread-safe, using `strerror_r()` instead. The API could be different on the different platforms, used the code from `0deef031cb/folly/String.cpp (L457)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8087 Reviewed By: mrambacher Differential Revision: D27267151 Pulled By: jay-zhuang fbshipit-source-id: 4b8856d1ec069d5f239b764750682c56e5be9ddb	2021-03-24 23:07:27 -07:00
Peter Dillinger	da6b90ab48	Improve bloom_test bits_per_key flag (#8093 ) Summary: Improved handling of -bits_per_key other than 10, but at least the OptimizeForMemory test is simply not designed for generally handling other settings. (ribbon_test does have a statistical framework for this kind of testing, but it's not important to do that same for Bloom right now.) Closes https://github.com/facebook/rocksdb/issues/7019 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8093 Test Plan: for I in `seq 1 20`; do ./bloom_test --gtest_filter=-OptimizeForMemory --bits_per_key=$I &> /dev/null \|\| echo FAILED; done Reviewed By: mrambacher Differential Revision: D27275875 Pulled By: pdillinger fbshipit-source-id: 7362e8ac2c41ea11f639412e4f30c8b375f04388	2021-03-23 21:42:40 -07:00
mrambacher	3dff28cf9b	Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033 ) Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67	2021-03-15 04:34:11 -07:00
Peter Dillinger	01c2ec3fcb	Add ROCKSDB_GTEST_BYPASS (#8048 ) Summary: This is for cases that do not meet the Facebook criteria for SKIP (see new comments). Also made ROCKSDB_GTEST_{SKIP,BYPASS} print the message because gtest doesn't ever seem to. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8048 Test Plan: manual inspection of ./ribbon_test output, CI Reviewed By: mrambacher Differential Revision: D26953688 Pulled By: pdillinger fbshipit-source-id: c914eaffe7d419db6ab90a193d474531e23582e5	2021-03-12 16:02:06 -08:00
Yanqin Jin	82b3888433	Enable backward iterator for keys with user-defined timestamp (#8035 ) Summary: This PR does the following: - Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet. - Introduces a new helper API `Comparator::EqualWithoutTimestamp()`. - Fix a typo in `SetTimestamp()`. - Add/update unit tests Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps. 1. Disable timestamp, run: ``` ./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5 ``` Results: > Baseline > - seekrandom [AVG 6 runs] : 96115 ops/sec; 53.2 MB/sec > - seekrandom [MEDIAN 6 runs] : 98075 ops/sec; 54.2 MB/sec > > This PR > - seekrandom [AVG 6 runs] : 95521 ops/sec; 52.8 MB/sec > - seekrandom [MEDIAN 6 runs] : 96338 ops/sec; 53.3 MB/sec 2. Enable timestamp, run: ``` ./db_bench -user_timestamp_size=8 -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5 ``` Result: > Baseline: not supported > > This PR > - seekrandom [AVG 6 runs] : 90514 ops/sec; 50.1 MB/sec > - seekrandom [MEDIAN 6 runs] : 90834 ops/sec; 50.2 MB/sec Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035 Reviewed By: ltamasi Differential Revision: D26926668 Pulled By: riversand963 fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5	2021-03-10 11:15:46 -08:00
David CARLIER	7a3444bf1f	Mac M1 crc32 intrinsics ARM64 check support proposal (#7893 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7893 Reviewed By: ajkr Differential Revision: D26050966 Pulled By: jay-zhuang fbshipit-source-id: 9df2bb65d82defd7fad49d5369979b03e22d39c2	2021-03-10 09:05:56 -08:00
Peter Dillinger	4b18c46d10	Refactor: add LineFileReader and Status::MustCheck (#8026 ) Summary: Removed confusing, awkward, and undocumented internal API ReadOneLine and replaced with very simple LineFileReader. In refactoring backupable_db.cc, this has the side benefit of removing the arbitrary cap on the size of backup metadata files. Also added Status::MustCheck to make it easy to mark a Status as "must check." Using this, I can ensure that after LineFileReader::ReadLine returns false the caller checks GetStatus(). Also removed some excessive conditional compilation in status.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/8026 Test Plan: added unit test, and running tests with ASSERT_STATUS_CHECKED Reviewed By: mrambacher Differential Revision: D26831687 Pulled By: pdillinger fbshipit-source-id: ef749c265a7a26bb13cd44f6f0f97db2955f6f0f	2021-03-09 20:12:38 -08:00
Levi Tamasi	cb25bc1128	Update compaction statistics to include the amount of data read from blob files (#8022 ) Summary: The patch does the following: 1) Exposes the amount of data (number of bytes) read from blob files from `BlobFileReader::GetBlob` / `Version::GetBlob`. 2) Tracks the total number and size of blobs read from blob files during a compaction (due to garbage collection or compaction filter usage) in `CompactionIterationStats` and propagates this data to `InternalStats::CompactionStats` / `CompactionJobStats`. 3) Updates the formulae for write amplification calculations to include the amount of data read from blob files. 4) Extends the compaction stats dump with a new column `Rblob(GB)` and a new line containing the total number and size of blob files in the current `Version` to complement the information about the shape and size of the LSM tree that's already there. 5) Updates `CompactionJobStats` so that the number of files and amount of data written by a compaction are broken down per file type (i.e. table/blob file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022 Test Plan: Ran `make check` and `db_bench`. Reviewed By: riversand963 Differential Revision: D26801199 Pulled By: ltamasi fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2	2021-03-04 00:43:48 -08:00
Peter Dillinger	a8b3b9a20c	Refine Ribbon configuration, improve testing, add Homogeneous (#7879 ) Summary: This change only affects non-schema-critical aspects of the production candidate Ribbon filter. Specifically, it refines choice of internal configuration parameters based on inputs. The changes are minor enough that the schema tests in bloom_test, some of which depend on this, are unaffected. There are also some minor optimizations and refactorings. This would be a schema change for "smash" Ribbon, to fix some known issues with small filters, but "smash" Ribbon is not accessible in public APIs. Unit test CompactnessAndBacktrackAndFpRate updated to test small and medium-large filters. Run with --thoroughness=100 or so for much better detection power (not appropriate for continuous regression testing). Homogenous Ribbon: This change adds internally a Ribbon filter variant we call Homogeneous Ribbon, in collaboration with Stefan Walzer. The expected "result" value for every key is zero, instead of computed from a hash. Entropy for queries not to be false positives comes from free variables ("overhead") in the solution structure, which are populated pseudorandomly. Construction is slightly faster for not tracking result values, and never fails. Instead, FP rate can jump up whenever and whereever entries are packed too tightly. For small structures, we can choose overhead to make this FP rate jump unlikely, as seen in updated unit test CompactnessAndBacktrackAndFpRate. Unlike standard Ribbon, Homogeneous Ribbon seems to scale to arbitrary number of keys when accepting an FP rate penalty for small pockets of high FP rate in the structure. For example, 64-bit ribbon with 8 solution columns and 10% allocated space overhead for slots seems to achieve about 10.5% space overhead vs. information-theoretic minimum based on its observed FP rate with expected pockets of degradation. (FP rate is close to 1/256.) If targeting a higher FP rate with fewer solution columns, Homogeneous Ribbon can be even more space efficient, because the penalty from degradation is relatively smaller. If targeting a lower FP rate, Homogeneous Ribbon is less space efficient, as more allocated overhead is needed to keep the FP rate impact of degradation relatively under control. The new OptimizeHomogAtScale tool in ribbon_test helps to find these optimal allocation overheads for different numbers of solution columns. And Ribbon widths, with 128-bit Ribbon apparently cutting space overheads in half vs. 64-bit. Other misc item specifics: * Ribbon APIs in util/ribbon_config.h now provide configuration data for not just 5% construction failure rate (95% success), but also 50% and 0.1%. * Note that the Ribbon structure does not exhibit "threshold" behavior as standard Xor filter does, so there is a roughly fixed space penalty to cut construction failure rate in half. Thus, there isn't really an "almost sure" setting. * Although we can extrapolate settings for large filters, we don't have a good formula for configuring smaller filters (< 2^17 slots or so), and efforts to summarize with a formula have failed. Thus, small data is hard-coded from updated FindOccupancy tool. * Enhances ApproximateNumEntries for public API Ribbon using more precise data (new API GetNumToAdd), thus a more accurate but not perfect reversal of CalculateSpace. (bloom_test updated to expect the greater precision) * Move EndianSwapValue from coding.h to coding_lean.h to keep Ribbon code easily transferable from RocksDB * Add some missing 'const' to member functions * Small optimization to 128-bit BitParity * Small refactoring of BandingStorage in ribbon_alg.h to support Homogeneous Ribbon * CompactnessAndBacktrackAndFpRate now has an "expand" test: on construction failure, a possible alternative to re-seeding hash functions is simply to increase the number of slots (allocated space overhead) and try again with essentially the same hash values. (Start locations will be different roundings of the same scaled hash values--because fastrange not mod.) This seems to be as effective or more effective than re-seeding, as long as we increase the number of slots (m) by roughly m += m/w where w is the Ribbon width. This way, there is effectively an expansion by one slot for each ribbon-width window in the banding. (This approach assumes that getting "bad data" from your hash function is as unlikely as it naturally should be, e.g. no adversary.) * 32-bit and 16-bit Ribbon configurations are added to ribbon_test for understanding their behavior, e.g. with FindOccupancy. They are not considered useful at this time and not tested with CompactnessAndBacktrackAndFpRate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7879 Test Plan: unit test updates included Reviewed By: jay-zhuang Differential Revision: D26371245 Pulled By: pdillinger fbshipit-source-id: da6600d90a3785b99ad17a88b2a3027710b4ea3a	2021-02-26 08:50:42 -08:00
sherriiiliu	e017af15c1	Fix testcase failures on windows (#7992 ) Summary: Fixed 5 test case failures found on Windows 10/Windows Server 2016 1. In `flush_job_test`, the DestroyDir function fails in deconstructor because some file handles are still being held by VersionSet. This happens on Windows Server 2016, so need to manually reset versions_ pointer to release all file handles. 2. In `StatsHistoryTest.InMemoryStatsHistoryPurging` test, the capping memory cost of stats_history_size on Windows becomes 14000 bytes with latest changes, not just 13000 bytes. 3. In `SSTDumpToolTest.RawOutput` test, the output file handle is not closed at the end. 4. In `FullBloomTest.OptimizeForMemory` test, ROCKSDB_MALLOC_USABLE_SIZE is undefined on windows so `total_mem` is always equal to `total_size`. The internal memory fragmentation assertion does not apply in this case. 5. In `BlockFetcherTest.FetchAndUncompressCompressedDataBlock` test, XPRESS cannot reach 87.5% compression ratio with original CreateTable method, so I append extra zeros to the string value to enhance compression ratio. Beside, since XPRESS allocates memory internally, thus does not support for custom allocator verification, we will skip the allocator verification for XPRESS Pull Request resolved: https://github.com/facebook/rocksdb/pull/7992 Reviewed By: jay-zhuang Differential Revision: D26615283 Pulled By: ajkr fbshipit-source-id: 3632612f84b99e2b9c77c403b112b6bedf3b125d	2021-02-23 14:35:06 -08:00
Andrew Kryczka	d904233d2f	Limit buffering for collecting samples for compression dictionary (#7970 ) Summary: For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file. However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage. Related changes include: - Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks - Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary - Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970 Test Plan: - updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level - looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set. Reviewed By: pdillinger Differential Revision: D26467994 Pulled By: ajkr fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465	2021-02-19 14:09:54 -08:00
Max Neunhoeffer	cf14cb3e29	Avoid self-move-assign in pop operation of binary heap. (#7942 ) Summary: The current implementation of a binary heap in `util/heap.h` does a move-assign in the `pop` method. In the case that there is exactly one element stored in the heap, this ends up being a self-move-assign. This can cause trouble with certain classes, which are not prepared for this. Furthermore, it trips up the glibc STL debugger (`-D_GLIBCXX_DEBUG`), which produces an assertion failure in this case. This PR addresses this problem by not doing the (unnecessary in this case) move-assign if there is only one element in the heap. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7942 Reviewed By: jay-zhuang Differential Revision: D26528739 Pulled By: ajkr fbshipit-source-id: 5ca570e0c4168f086b10308ad766dff84e6e2d03	2021-02-19 13:47:25 -08:00
Wilfried Goesgens	8a05c21e32	add string separation while composing error message (#7919 ) Summary: This will fix a missing string separation between `msg[n]` and `state_`. Example of an error message how its looking now: ``` IO error: No space left on deviceWhile appending to file: /home/willi/src/stable-3.7/tmp/arangosh_CL6EFQ/shell_client/single1/data/engine-rocksdb/126426.sst: No space left on device ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7919 Reviewed By: ajkr Differential Revision: D26242246 Pulled By: jay-zhuang fbshipit-source-id: 5d9a0997a410aecfb3781478e57395d3d937bb84	2021-02-18 12:25:35 -08:00
Zhichao Cao	d1c510baec	Handoff checksum Implementation (#7523 ) Summary: in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0	2021-02-10 22:20:32 -08:00
Peter Dillinger	e4f1e64c30	Add prefetching (batched MultiGet) for experimental Ribbon filter (#7889 ) Summary: Adds support for prefetching data in Ribbon queries, which especially optimizes batched Ribbon queries for MultiGet (~222ns/key to ~97ns/key) but also single key queries on cold memory (~333ns to ~226ns) because many queries span more than one cache line. This required some refactoring of the query algorithm, and there does not appear to be a noticeable regression in "hot memory" query times (perhaps from 48ns to 50ns). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7889 Test Plan: existing unit tests, plus performance validation with filter_bench: Each data point is the best of two runs. I saturated the machine CPUs with other filter_bench runs in the background. Before: $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50 WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 125.86 Number of filters: 1993 Total size (MB): 168.166 Reported total allocated memory (MB): 183.211 Reported internal fragmentation: 8.94626% Bits/key stored: 7.05341 Prelim FP rate %: 0.951827 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 48.0111 Batched, prepared net ns/op: 222.384 Batched, unprepared net ns/op: 343.908 Skewed 50% in 1% net ns/op: 252.916 Skewed 80% in 20% net ns/op: 320.579 Random filter net ns/op: 332.957 After: $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50 WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 128.117 Number of filters: 1993 Total size (MB): 168.166 Reported total allocated memory (MB): 183.211 Reported internal fragmentation: 8.94626% Bits/key stored: 7.05341 Prelim FP rate %: 0.951827 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 49.8812 Batched, prepared net ns/op: 97.1514 Batched, unprepared net ns/op: 222.025 Skewed 50% in 1% net ns/op: 197.48 Skewed 80% in 20% net ns/op: 212.457 Random filter net ns/op: 226.464 Bloom comparison, for reference: $ ./filter_bench -impl=2 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50 WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 35.3042 Number of filters: 1993 Total size (MB): 238.488 Reported total allocated memory (MB): 262.875 Reported internal fragmentation: 10.2255% Bits/key stored: 10.0029 Prelim FP rate %: 0.965327 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 9.09931 Batched, prepared net ns/op: 34.21 Batched, unprepared net ns/op: 88.8564 Skewed 50% in 1% net ns/op: 139.75 Skewed 80% in 20% net ns/op: 181.264 Random filter net ns/op: 173.88 Reviewed By: jay-zhuang Differential Revision: D26378710 Pulled By: pdillinger fbshipit-source-id: 058428967c55ed763698284cd3b4bbe3351b6e69	2021-02-10 21:04:56 -08:00
Andrew Kryczka	78ee8564ad	Integrity protection for live updates to WriteBatch (#7748 ) Summary: This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.). The foundation classes (`ProtectionInfo.`) embed the coverage info in their type, and provide `Protect.()` and `Strip.()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer. When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748 Test Plan: - an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught - add to stress/crash test to verify it works in variety of configs/operations without intentional corruption - [deferred] unit tests for `ProtectionInfo.` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc. Reviewed By: pdillinger Differential Revision: D25754492 Pulled By: ajkr fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866	2021-01-29 12:18:58 -08:00
mrambacher	4a09d632c4	Remove Legacy and Custom FileWrapper classes from header files (#7851 ) Summary: Removed the uses of the Legacy FileWrapper classes from the source code. The wrappers were creating an additional layer of indirection/wrapping, as the Env already has a FileSystem. Moved the Custom FileWrapper classes into the CustomEnv, as these classes are really for the private use the the CustomEnv class. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7851 Reviewed By: anand1976 Differential Revision: D26114816 Pulled By: mrambacher fbshipit-source-id: db32840e58d969d3a0fa6c25aaf13d6dcdc74150	2021-01-28 22:10:32 -08:00
mrambacher	0a9a05ae12	Make builds reproducible (#7866 ) Summary: Closes https://github.com/facebook/rocksdb/issues/7035 Changed how build_version.cc was generated: - Included the GIT tag/branch in the build_version file - Changed the "Build Date" to be: - If the GIT branch is "clean" (no changes), the date of the last git commit - If the branch is not clean, the current date - Added APIs to access the "build information", rather than accessing the strings directly. The build_version.cc file is now regenerated whenever the library objects are rebuilt. Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866 Reviewed By: pdillinger Differential Revision: D26086565 Pulled By: mrambacher fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f	2021-01-28 17:42:16 -08:00
mrambacher	12f1137355	Add a SystemClock class to capture the time functions of an Env (#7858 ) Summary: Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock. Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock. There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc). Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858 Reviewed By: pdillinger Differential Revision: D26006406 Pulled By: mrambacher fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90	2021-01-25 22:09:11 -08:00
Andrew Kryczka	e18a4df62a	workaround race conditions during `PeriodicWorkScheduler` registration (#7888 ) Summary: This provides a workaround for two race conditions that will be fixed in a more sophisticated way later. This PR: (1) Makes the client serialize calls to `Timer::Start()` and `Timer::Shutdown()` (see https://github.com/facebook/rocksdb/issues/7711). The long-term fix will be to make those functions thread-safe. (2) Makes `PeriodicWorkScheduler` atomically add/cancel work together with starting/shutting down its `Timer`. The long-term fix will be for `Timer` API to offer more specialized APIs so the client will not need to synchronize. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7888 Test Plan: ran the repro provided in https://github.com/facebook/rocksdb/issues/7881 Reviewed By: jay-zhuang Differential Revision: D25990891 Pulled By: ajkr fbshipit-source-id: a97fdaebbda6d7db7ddb1b146738b68c16c5be38	2021-01-21 08:50:38 -08:00
Andrew Kryczka	5b748b9e68	Cover all status codes in `Status::ToString()` (#7872 ) Summary: - Completed the switch statement for all possible `Code` values (the only one missing was `kCompactionTooLarge`). - Removed the default case so compiler can alert us if a new value is added to `Code` without handling it in `Status::ToString()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7872 Test Plan: verified the log message for this scenario looks right ``` 2021/01/15-17:26:34.564450 7fa6845fe700 [ERROR] [/db_impl/db_impl_compaction_flush.cc:2621] Waiting after background compaction error: Compaction too large: , Accumulated background error counts: 1 ``` Reviewed By: ramvadiv Differential Revision: D25934539 Pulled By: ajkr fbshipit-source-id: 2e0b3c0d993e356a4987276d6f8a163f0ee8be7a	2021-01-16 04:28:50 -08:00
mrambacher	c1a65a4de4	Make StringEnv, StringSink, StringSource use FS classes (#7786 ) Summary: Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones. The StringSink and StringSource classes were changed to be based on the corresponding FS file classes. Part of a cleanup to use the newer interfaces. This change also eliminates some of the casts/wrappers to LegacyFile classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786 Reviewed By: jay-zhuang Differential Revision: D25761460 Pulled By: anand1976 fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f	2021-01-04 16:01:01 -08:00
Andrew Kryczka	61e324422e	fix thread status synchronization in thread_list_test (#7825 ) Summary: The test was flaky because the BG threads could increase `running_count_` up to `job_count_` before applying their thread status updates. Then the test thread would see non-deterministic results when counting threads with each status. The fix is to acquire mutex in test thread so it sees `running_count_` and thread status updated atomically. I think simply reordering the two updates would have been insufficient since the thread status update uses `memory_order_relaxed`. This change happens to also eliminate an undesirable sleep loop. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7825 Test Plan: injected sleeps to verify the failure repros before this PR and does not repro after. Reviewed By: jay-zhuang Differential Revision: D25742409 Pulled By: ajkr fbshipit-source-id: 926a2223fe856e20bc4c0c27df6736ee5cb02c97	2021-01-04 10:46:24 -08:00
mrambacher	55e99688cc	No elide constructors (#7798 ) Summary: Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds. This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it. In this case, without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798 Reviewed By: jay-zhuang Differential Revision: D25680451 Pulled By: pdillinger fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917	2020-12-23 16:55:53 -08:00
mrambacher	02418194d7	Add more tests for assert status checked (#7524 ) Summary: Added 10 more tests that pass the ASSERT_STATUS_CHECKED test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7524 Reviewed By: akankshamahajan15 Differential Revision: D24323093 Pulled By: ajkr fbshipit-source-id: 28d4106d0ca1740c3b896c755edf82d504b74801	2020-12-22 23:45:58 -08:00
Peter Dillinger	239d17a19c	Support optimize_filters_for_memory for Ribbon filter (#7774 ) Summary: Primarily this change refactors the optimize_filters_for_memory code for Bloom filters, based on malloc_usable_size, to also work for Ribbon filters. This change also replaces the somewhat slow but general BuiltinFilterBitsBuilder::ApproximateNumEntries with implementation-specific versions for Ribbon (new) and Legacy Bloom (based on a recently deleted version). The reason is to emphasize speed in ApproximateNumEntries rather than 100% accuracy. Justification: ApproximateNumEntries (formerly CalculateNumEntry) is only used by RocksDB for range-partitioned filters, called each time we start to construct one. (In theory, it should be possible to reuse the estimate, but the abstractions provided by FilterPolicy don't really make that workable.) But this is only used as a heuristic estimate for hitting a desired partitioned filter size because of alignment to data blocks, which have various numbers of unique keys or prefixes. The two factors lead us to prioritize reasonable speed over 100% accuracy. optimize_filters_for_memory adds extra complication, because precisely calculating num_entries for some allowed number of bytes depends on state with optimize_filters_for_memory enabled. And the allocator-agnostic implementation of optimize_filters_for_memory, using malloc_usable_size, means we would have to actually allocate memory, many times, just to precisely determine how many entries (keys) could be added and stay below some size budget, for the current state. (In a draft, I got this working, and then realized the balance of speed vs. accuracy was all wrong.) So related to that, I have made CalculateSpace, an internal-only API only used for testing, non-authoritative also if optimize_filters_for_memory is enabled. This simplifies some code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7774 Test Plan: unit test updated, and for FilterSize test, range of tested values is greatly expanded (still super fast) Also tested `db_bench -benchmarks=fillrandom,stats -bloom_bits=10 -num=1000000 -partition_index_and_filters -format_version=5 [-optimize_filters_for_memory] [-use_ribbon_filter]` with temporary debug output of generated filter sizes. Bloom+optimize_filters_for_memory: 1 Filter size: 197 (224 in memory) 134 Filter size: 3525 (3584 in memory) 107 Filter size: 4037 (4096 in memory) Total on disk: 904,506 Total in memory: 918,752 Ribbon+optimize_filters_for_memory: 1 Filter size: 3061 (3072 in memory) 110 Filter size: 3573 (3584 in memory) 58 Filter size: 4085 (4096 in memory) Total on disk: 633,021 (-30.0%) Total in memory: 634,880 (-30.9%) Bloom (no offm): 1 Filter size: 261 (320 in memory) 1 Filter size: 3333 (3584 in memory) 240 Filter size: 3717 (4096 in memory) Total on disk: 895,674 (-1% on disk vs. +offm; known tolerable overhead of offm) Total in memory: 986,944 (+7.4% vs. +offm) Ribbon (no offm): 1 Filter size: 2949 (3072 in memory) 1 Filter size: 3381 (3584 in memory) 167 Filter size: 3701 (4096 in memory) Total on disk: 624,397 (-30.3% vs. Bloom) Total in memory: 690,688 (-30.0% vs. Bloom) Note that optimize_filters_for_memory is even more effective for Ribbon filter than for cache-local Bloom, because it can close the unused memory gap even tighter than Bloom filter, because of 16 byte increments for Ribbon vs. 64 byte increments for Bloom. Reviewed By: jay-zhuang Differential Revision: D25592970 Pulled By: pdillinger fbshipit-source-id: 606fdaa025bb790d7e9c21601e8ea86e10541912	2020-12-18 14:31:03 -08:00
Peter Dillinger	003e72b201	Use size_t for filter APIs, protect against overflow (#7726 ) Summary: Deprecate CalculateNumEntry and replace with ApproximateNumEntries (better name) using size_t instead of int and uint32_t, to minimize confusing casts and bad overflow behavior (possible though probably not realistic). Bloom sizes are now explicitly capped at max size supported by implementations: just under 4GiB for fv=5 Bloom, and just under 512MiB for fv<5 Legacy Bloom. This hardening could help to set up for fuzzing. Also, since RocksDB only uses this information as an approximation for trying to hit certain sizes for partitioned filters, it's more important that the function be reasonably fast than for it to be completely accurate. It's hard enough to be 100% accurate for Ribbon (currently reversing CalculateSpace) that adding optimize_filters_for_memory into the mix is just not worth trying to be 100% accurate for num entries for bytes. Also: - Cleaned up filter_policy.h to remove MSVC warning handling and potentially unsafe use of exception for "not implemented" - Correct the number of entries limit beyond which current Ribbon implementation falls back on Bloom instead. - Consistently use "num_entries" rather than "num_entry" - Remove LegacyBloomBitsBuilder::CalculateNumEntry as it's essentially obsolete from general implementation BuiltinFilterBitsBuilder::CalculateNumEntries. - Fix filter_bench to skip some tests that don't make sense when only one or a small number of filters has been generated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7726 Test Plan: expanded existing unit tests for CalculateSpace / ApproximateNumEntries. Also manually used filter_bench to verify Legacy and fv=5 Bloom size caps work (much too expensive for unit test). Note that the actual bits per key is below requested due to space cap. $ ./filter_bench -impl=0 -bits_per_key=20 -average_keys_per_filter=256000000 -vary_key_count_ratio=0 -m_keys_total_max=256 -allow_bad_fp_rate ... Total size (MB): 511.992 Bits/key stored: 16.777 ... $ ./filter_bench -impl=2 -bits_per_key=20 -average_keys_per_filter=2000000000 -vary_key_count_ratio=0 -m_keys_total_max=2000 ... Total size (MB): 4096 Bits/key stored: 17.1799 ... $ Reviewed By: jay-zhuang Differential Revision: D25239800 Pulled By: pdillinger fbshipit-source-id: f94e6d065efd31e05ec630ae1a82e6400d8390c4	2020-12-11 22:18:12 -08:00
pkubaj	66e54c5984	Fix build on FreeBSD/powerpc64(le) (#7732 ) Summary: To build on FreeBSD, arch_ppc_probe needs to be adapted to FreeBSD. Since FreeBSD uses elf_aux_info as an getauxval equivalent, use it and include necessary headers: - machine/cpu.h for PPC_FEATURE2_HAS_VEC_CRYPTO, - sys/auxv.h for elf_aux_info, - sys/elf_common.h for AT_HWCAP2. elf_aux_info isn't checked for being available, because it's available since FreeBSD 12.0. rocksdb assumes using Clang on FreeBSD, but powerpc* platforms switch to Clang only since 13.0. This patch makes rocksdb build on FreeBSD on powerpc64 and powerpc64le platforms. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7732 Reviewed By: ltamasi Differential Revision: D25399194 Pulled By: pdillinger fbshipit-source-id: 9c905147d75f98cd2557dd2f86a940b8e6c5afcd	2020-12-08 15:31:56 -08:00
Vincent Milum Jr	93c6c18cf9	Adding ARM AT_HWCAP support for FreeBSD (#7750 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7750 Reviewed By: ltamasi Differential Revision: D25400609 Pulled By: pdillinger fbshipit-source-id: 13b15e2f490acc011b648fbd9615ea8e580cccc7	2020-12-08 13:33:21 -08:00
Adam Retter	ee4bd4780b	Fix compilation on Apple Silicon (#7714 ) Summary: Closes - https://github.com/facebook/rocksdb/issues/7710 I tested this on an Apple DTK (Developer Transition Kit) with an Apple A12Z Bionic CPU and macOS Big Sur (11.0.1). Previously the arm64 specific CRC optimisations were limited to Linux only OS... Well now Apple Silicon is also arm64 but runs macOS ;-) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7714 Reviewed By: ltamasi Differential Revision: D25287349 Pulled By: pdillinger fbshipit-source-id: 639b168bf0ac2652907531e9604936ac4974b577	2020-12-04 15:22:33 -08:00
Peter Dillinger	3b9bfe8f14	Skip minimum rate check in Sandcastle (#7728 ) Summary: The minimum rate check in RateLimiterTest.Rate can fail in Facebook's CI system Sandcastle, presumably due to heavily loaded machines. This change disables the minimum rate check for Sandcastle runs, and cleans up the code disabling it on other CI environments. (The amount of conditionally compiled code shall be minimized.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7728 Test Plan: try new test with and without setting envvar SANDCASTLE=1 Reviewed By: ltamasi Differential Revision: D25247642 Pulled By: pdillinger fbshipit-source-id: d786233af37af9a874adbb3a9e2707ec52c27a5a	2020-12-02 15:16:49 -08:00
Adam Retter	b937be3779	Fix Compilation on ppc64le using Clang 11 (#7713 ) Summary: Closes https://github.com/facebook/rocksdb/issues/7691 The optimised CRC code for PPC64le which was originally imported in https://github.com/facebook/rocksdb/pull/2353 is not compatible with Clang 11. It looks like the code most likely originated from https://github.com/antonblanchard/crc32-vpmsum. The code relied on a GCC header file `ppc-asm.h` which is not available in Clang. To solve this, I have taken the same approach as the the upstream project from which the CRC code came `ffc8018efc (diff-ec3e62c56fbcddeb07230f2a4673c1abd7f0f1cc8e48a2aa560056cfc1b25d60)` and simply imported a copy of the GCC header file into our code-base which will be used when Clang is the compiler on pcc64le. NOTE: The new file `util/ppc-asm.h` may have licensing implications which I guess need to be approved by RocksDB/Facebook before this is merged Pull Request resolved: https://github.com/facebook/rocksdb/pull/7713 Reviewed By: jay-zhuang Differential Revision: D25222645 Pulled By: pdillinger fbshipit-source-id: e3fec9136f26ce1eb7a027048bcf77a6cb3c769c	2020-12-01 11:21:44 -08:00
Peter Dillinger	0baa5055f1	Add Ribbon schema test to bloom_test (#7696 ) Summary: These new unit tests should ensure that we don't accidentally change the interpretation of bits for what I call Standard128Ribbon filter internally, available publicly as NewExperimentalRibbonFilterPolicy. There is very little intuitive reason for the values we check against in these tests; I just plug in the right expected values upon watching the test fail initially. Most (but not all) of the tests are essentially "whitebox" "round-trip." We create a filter from fixed keys, and first compare the checksum of those filter bytes against a saved value. We also run queries against other fixed keys, comparing which return false positives against a saved set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7696 Test Plan: test addition and refactoring only Reviewed By: jay-zhuang Differential Revision: D25082289 Pulled By: pdillinger fbshipit-source-id: b5ca646fdcb5a1c2ad2085eda4a1fd44c4287f67	2020-11-22 19:52:04 -08:00
Peter Dillinger	60af964372	Experimental (production candidate) SST schema for Ribbon filter (#7658 ) Summary: Added experimental public API for Ribbon filter: NewExperimentalRibbonFilterPolicy(). This experimental API will take a "Bloom equivalent" bits per key, and configure the Ribbon filter for the same FP rate as Bloom would have but ~30% space savings. (Note: optimize_filters_for_memory is not yet implemented for Ribbon filter. That can be added with no effect on schema.) Internally, the Ribbon filter is configured using a "one_in_fp_rate" value, which is 1 over desired FP rate. For example, use 100 for 1% FP rate. I'm expecting this will be used in the future for configuring Bloom-like filters, as I expect people to more commonly hold constant the filter accuracy and change the space vs. time trade-off, rather than hold constant the space (per key) and change the accuracy vs. time trade-off, though we might make that available. ### Benchmarking ``` $ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing Building... Build avg ns/key: 34.1341 Number of filters: 1993 Total size (MB): 238.488 Reported total allocated memory (MB): 262.875 Reported internal fragmentation: 10.2255% Bits/key stored: 10.0029 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 18.7508 Random filter net ns/op: 258.246 Average FP rate %: 0.968672 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing Building... Build avg ns/key: 130.851 Number of filters: 1993 Total size (MB): 168.166 Reported total allocated memory (MB): 183.211 Reported internal fragmentation: 8.94626% Bits/key stored: 7.05341 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 58.4523 Random filter net ns/op: 363.717 Average FP rate %: 0.952978 ---------------------------- Done. (For more info, run with -legend or -help.) ``` 168.166 / 238.488 = 0.705 -> 29.5% space reduction 130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction) ### Working around a hashing "flaw" bloom_test discovered a flaw in the simple hashing applied in StandardHasher when num_starts == 1 (num_slots == 128), showing an excessively high FP rate. The problem is that when many entries, on the order of number of hash bits or kCoeffBits, are associated with the same start location, the correlation between the CoeffRow and ResultRow (for efficiency) can lead to a solution that is "universal," or nearly so, for entries mapping to that start location. (Normally, variance in start location breaks the effective association between CoeffRow and ResultRow; the same value for CoeffRow is effectively different if start locations are different.) Without kUseSmash and with num_starts > 1 (thus num_starts ~= num_slots), this flaw should be completely irrelevant. Even with 10M slots, the chances of a single slot having just 16 (or more) entries map to it--not enough to cause an FP problem, which would be local to that slot if it happened--is 1 in millions. This spreadsheet formula shows that: =1/(10000000(1 - POISSON(15, 1, TRUE))) As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is intended for CPU efficiency of filters with many more entries/slots than kCoeffBits, a very reasonable work-around is to disallow num_starts==1 when !kUseSmash, by making the minimum non-zero number of slots 2kCoeffBits. This is the work-around I've applied. This also means that the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not space-efficient for less than a few hundred entries. Because of this, I have made it fall back on constructing a Bloom filter, under existing schema, when that is more space efficient for small filters. (We can change this in the future if we want.) TODO: better unit tests for this case in ribbon_test, and probably update StandardHasher for kUseSmash case so that it can scale nicely to small filters. ### Other related changes * Add Ribbon filter to stress/crash test * Add Ribbon filter to filter_bench as -impl=3 * Add option string support, as in "filter_policy=experimental_ribbon:5.678;" where 5.678 is the Bloom equivalent bits per key. * Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom * Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on binary searching CalculateSpace (inefficient), so that subclasses (especially experimental ones) don't have to provide an efficient implementation inverting CalculateSpace. * Minor refactor FastLocalBloomBitsBuilder for new base class XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder, which allows the latter to fall back on Bloom construction in some extreme cases. * Mostly updated bloom_test for Ribbon filter, though a test like FullBloomTest::Schema is a next TODO to ensure schema stability (in case this becomes production-ready schema as it is). * Add some APIs to ribbon_impl.h for configuring Ribbon filters. Although these are reasonably covered by bloom_test, TODO more unit tests in ribbon_test * Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data for constructing the linear approximations in GetNumSlotsFor95PctSuccess. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658 Test Plan: Some unit tests updated but other testing is left TODO. This is considered experimental but laying down schema compatibility as early as possible in case it proves production-quality. Also tested in stress/crash test. Reviewed By: jay-zhuang Differential Revision: D24899349 Pulled By: pdillinger fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8	2020-11-12 20:46:14 -08:00
Yanqin Jin	8b6b6aeb1a	Refactor with VersionEditHandler (#6581 ) Summary: Added a few classes in the same class hierarchy to remove code duplication and refactor the logic of reading and processing MANIFEST files. New classes are as follows. ``` class VersionEditHandlerBase; class ListColumnFamiliesHandler : VersionEditHandlerBase; class FileChecksumRetriever : VersionEditHandlerBase; class DumpManifestHandler : VersionEditHandler; ``` Classes that already existed before this PR are as follows. ``` class VersionEditHandler : VersionEditHandlerBase; ``` With these classes, refactored functions: `VersionSet::Recover()`, `VersionSet::ListColumnFamilies()`, `VersionSet::DumpManifest()`, `GetFileChecksumFromManifest()`. Test Plan (devserver): ``` make check COMPILE_WITH_ASAN=1 make check ``` These refactored code, especially recovery-related logic, will be tested intensively by all existing unit tests and stress tests. For example, run ``` make crash_test ``` Verified 3 successful runs on devserver. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6581 Reviewed By: ajkr Differential Revision: D20616217 Pulled By: riversand963 fbshipit-source-id: 048c7743aa4be2623ccd0cc3e61c0027e604e78b	2020-11-11 08:00:14 -08:00
Peter Dillinger	c57f914482	Use NPHash64 in more places (#7632 ) Summary: Since the hashes should not be persisted in output_validator nor mock_env. Also updated NPHash64 to use 64-bit seed, and comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7632 Test Plan: make check, and new build setting that enables modification to NPHash64, to check for behavior depending on specific values. Added that setting to one of the CircleCI configurations. Reviewed By: jay-zhuang Differential Revision: D24833780 Pulled By: pdillinger fbshipit-source-id: 02a57652ccf1ac105fbca79e77875bb7bf7c071f	2020-11-10 23:42:13 -08:00
Peter Dillinger	8b8a2e9f05	Ribbon: major re-work of hashing, seeds, and more (#7635 ) Summary: * Fully optimized StandardHasher, in terms of efficiently generating Start, CoeffRow, and ResultRow from a stock hash value, with sufficient independence between them to have no measurably degraded behavior. (Degraded behavior would be an FP rate higher than explainable by 2^-b and, if using a 32-bit stock hash function, expected stock hash collisions.) Details in code comments. * Our standard 64-bit and 32-bit hash functions do not exhibit sufficient independence on sequential seeds (for one Ribbon construction attempt to have independent probability from the next). I have worked around this in the Ribbon code by "pre-mixing" "ordinal seeds," sequentially tried and appropriate for storage in persisted metadata, into "raw seeds," ready for application and appropriate for in-memory storage. This way the pre-mixing step (though fast) is only applied on loading or configuring the structure, not on each query or banding add. * Fix a subtle flaw in which backtracking not clearing ResultRow data could lead to elevated FP rate on keys that were backtracked on and should (for generality) exhibit the same FP rate as novel keys. * Added a basic test for PhsfQuery and construction algorithms (map or "retrieval structure" rather than set or filter), and made a few trivial related fixes. * Better random configuration generation in unit tests * Some other minor cleanup / clarification / etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7635 Test Plan: unit tests included Reviewed By: jay-zhuang Differential Revision: D24738978 Pulled By: pdillinger fbshipit-source-id: f9d03599d9e2ca3e30e9d3e7d81cd936b56f76f0	2020-11-07 17:22:54 -08:00
Peter Dillinger	746909ceda	Ribbon: InterleavedSolutionStorage (#7598 ) Summary: The core algorithms for InterleavedSolutionStorage and the implementation SerializableInterleavedSolution make Ribbon fast for filter queries. Example output from new unit test: Simple outside query, hot, incl hashing, ns/key: 117.796 Interleaved outside query, hot, incl hashing, ns/key: 42.2655 Bloom outside query, hot, incl hashing, ns/key: 24.0071 Also includes misc cleanup of previous Ribbon code and comments. Some TODOs and FIXMEs remain for futher work / investigation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7598 Test Plan: unit tests included (integration work and tests coming later) Reviewed By: jay-zhuang Differential Revision: D24559209 Pulled By: pdillinger fbshipit-source-id: fea483cd354ba782aea3e806f2bc96e183d59441	2020-11-03 12:46:36 -08:00
Andrew Kryczka	1adbceb581	Expand effect of dictionary settings in `ColumnFamilyOptions::compression_opts` (#7619 ) Summary: In dictionary compression's initial implementation, in order to save CPU overhead, we only enabled it for bottom level under the assumption that the vast majority of data is stored there. At that time, there was no such thing as `ColumnFamilyOptions::bottommost_compression_opts`, so we just hardcoded disabling dictionary compression in flush and compactions to non-bottommost level. Now, we have users who generate all their files through flush and are considering using dictionary compression. To support such a use case, this PR expands the scope of `ColumnFamilyOptions::compression_opts` to additionally include flushed files and files generated by compaction to a non-bottommost level. Users can still get the old behavior by moving their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts` and explicitly enabling both that and `ColumnFamilyOptions::bottommost_compression`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7619 Reviewed By: ltamasi Differential Revision: D24665610 Pulled By: ajkr fbshipit-source-id: 656b90bce1033fe21c71e09af931ef5bde3e464c	2020-11-02 19:21:11 -08:00
Yanqin Jin	394210f280	Remove unused includes (#7604 ) Summary: This is a PR generated semi-automatically by an internal tool to remove unused includes and `using` statements. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604 Test Plan: make check Reviewed By: ajkr Differential Revision: D24579392 Pulled By: riversand963 fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd	2020-10-28 23:22:27 -07:00
mrambacher	f35f7f2704	Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566 ) Summary: This PR does a few things: 1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation. 2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed: - The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated - The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory). 3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10). I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged. Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently. Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566 Reviewed By: zhichao-cao Differential Revision: D24408980 Pulled By: jay-zhuang fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f	2020-10-27 10:33:09 -07:00
Peter Dillinger	25d54c799c	Ribbon: initial (general) algorithms and basic unit test (#7491 ) Summary: This is intended as the first commit toward a near-optimal alternative to static Bloom filters for SSTs. Stephan Walzer and I have agreed upon the name "Ribbon" for a PHSF based on his linear system construction in "Efficient Gauss Elimination for Near-Quadratic Matrices with One Short Random Block per Row, with Applications" ("SGauss") and my much faster "on the fly" algorithm for gaussian elimination (or for this linear system, "banding"), which can be faster than peeling while also more compact and flexible. See util/ribbon_alg.h for more detailed introduction and background. RIBBON = Rapid Incremental Boolean Banding ON-the-fly This commit just adds generic (templatized) core algorithms and a basic unit test showing some features, including the ability to construct structures within 2.5% space overhead vs. information theoretic lower bound. (Compare to cache-local Bloom filter's ~50% space overhead -> ~30% reduction anticipated.) This commit does not include the storage scheme necessary to make queries fast, especially for filter queries, nor fractional "result bits", but there is some description already and those implementations will come soon. Nor does this commit add FilterPolicy support, for use in SST files, but that will also come soon. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7491 Reviewed By: jay-zhuang Differential Revision: D24517954 Pulled By: pdillinger fbshipit-source-id: 0119ee597e250d7e0edd38ada2ba50d755606fa7	2020-10-25 20:44:49 -07:00
Peter Dillinger	a16d1b2fd3	Add Encode/DecodeFixedGeneric, coding_lean.h (#7587 ) Summary: To minimize dependencies for Ribbon filter code in progress, core part of coding.h for fixed sizes has been moved to coding_lean.h. Also, generic versions of these functions have been added to math128.h (since the generic versions are likely only to be used along with Unsigned128). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7587 Test Plan: Unit tests added for new functions Reviewed By: jay-zhuang Differential Revision: D24486718 Pulled By: pdillinger fbshipit-source-id: a69768f742379689442135fa52237c01dfe2647e	2020-10-23 14:11:15 -07:00
Cheng Chang	cb2581031a	Add macos tests to circleci (#7536 ) Summary: as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/7536 Test Plan: see the new `build-macos` tests pass in circleci Reviewed By: jay-zhuang Differential Revision: D24243218 Pulled By: cheng-chang fbshipit-source-id: 9b5f8a859e54c99a9ebe7efff6f336458a5d42de	2020-10-12 10:46:40 -07:00
peterpaule	60649aa01b	Fix wrong comments about function TruncateToPageBoundary. (#6975 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6975 Reviewed By: ajkr Differential Revision: D22914797 Pulled By: riversand963 fbshipit-source-id: 5be2cd322d41447f638dba1336e96dcdc090f9dd	2020-10-07 12:34:34 -07:00
Peter Dillinger	9082771b86	Add is_full_compaction to CompactionJobStats, cleanup (#7451 ) Summary: This exposes to the listener interface whether a compaction was full or not. Also cleaned up API comment for CompactionJobInfo::stats, which is not of a nullable type. And since CompactionJob is always created with non-null CompactionJobStats, removed conditionals on it being nullptr and instead assert non-null. TODO later: update C and Java interfaces Pull Request resolved: https://github.com/facebook/rocksdb/pull/7451 Test Plan: updated existing unit tests to check new field, make check Reviewed By: ltamasi Differential Revision: D23977796 Pulled By: pdillinger fbshipit-source-id: 1ae7e26cb949631c2b2fb9e696710daf53cc378d	2020-10-01 12:52:58 -07:00
Koby Kahane	3e745053b7	Fix MSVC-related build issues (#7439 ) Summary: This PR addresses some build and functional issues on MSVC targets, as a step towards an eventual goal of having RocksDB build successfully for Windows on ARM64. Addressed issues include: - BitsSetToOne and CountTrailingZeroBits do not compile on non-x64 MSVC targets. A fallback implementation of BitsSetToOne when Intel intrinsics are not available is added, based on the C++20 `<bit>` popcount implementation in Microsoft's STL. - The implementation of FloorLog2 for MSVC targets (including x64) gives incorrect results. The unit test easily detects this, but CircleCI is currently configured to only run a specific set of tests for Windows CMake builds, so this seems to have been unnoticed. - AsmVolatilePause does not use YieldProcessor on Windows ARM64 targets, even though it is available. - When CondVar::TimedWait calls Microsoft STL's condition_variable::wait_for, it can potentially trigger a bug (just recently fixed in the upcoming VS 16.8's STL) that deadlocks various tests that wait for a timer to execute, since `Timer::Run` doesn't get a chance to execute before being blocked by the test function acquiring the mutex. - In c_test, `GetTempDir` assumes a POSIX-style temp path. - `NormalizePath` did not eliminate consecutive POSIX-style path separators on Windows, resulting in test failures in e.g., wal_manager_test. - Various other test failures. In a followup PR I hope to modify CircleCI's config.yml to invoke all RocksDB unit tests in Windows CMake builds with CTest, instead of the current use of `run_ci_db_test.ps1` which requires individual tests to be specified and is missing many of the existing tests. Notes from peterd: FloorLog2 is not yet used in production code (it's for something in progress). I also added a few more inexpensive platform-dependent tests to Windows CircleCI runs. And included facebook/folly#1461 as requested Pull Request resolved: https://github.com/facebook/rocksdb/pull/7439 Reviewed By: jay-zhuang Differential Revision: D24021563 Pulled By: pdillinger fbshipit-source-id: 0ec2027c0d6a494d8a0fe38d9667fc2f7e29f7e7	2020-10-01 09:23:04 -07:00
Yanqin Jin	8c7bac6491	Check status for file_reader_writer_test (#7449 ) Summary: Check the status for file_reader_writer_test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7449 Test Plan: ``` ASSERT_STATUS_CHECKED=1 make -j20 file_reader_writer_test ./file_reader_writer_test ``` Reviewed By: zhichao-cao Differential Revision: D23975609 Pulled By: riversand963 fbshipit-source-id: a468eb04b386967fcc0478a56e4f0a19bdf81cdf	2020-09-28 16:05:11 -07:00
Peter Dillinger	08552b19d3	Genericize and clean up FastRange (#7436 ) Summary: A generic algorithm in progress depends on a templatized version of fastrange, so this change generalizes it and renames it to fit our style guidelines, FastRange32, FastRange64, and now FastRangeGeneric. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7436 Test Plan: added a few more test cases Reviewed By: jay-zhuang Differential Revision: D23958153 Pulled By: pdillinger fbshipit-source-id: 8c3b76101653417804997e5f076623a25586f3e8	2020-09-28 11:35:00 -07:00
Levi Tamasi	30fb9dd50f	Introduce a helper method UncompressData (#7434 ) Summary: The patch introduces a helper method in `util/compression.h` called `UncompressData` that dispatches calls to the correct uncompression method based on type, and changes `UncompressBlockContentsForCompressionType` and `Benchmark::Uncompress` in `db_bench` so they are implemented in terms of the new method. This eliminates some code duplication. (`Benchmark::Compress` is also updated to use the previously introduced `CompressData` helper.) In addition, the patch brings the implementation of `Snappy_Uncompress` into sync with the other uncompression methods by making the method compute the buffer size and allocate the buffer itself. Finally, the patch eliminates some potentially risky back-and-forth conversions between various unsigned and signed integer types by exposing the size of the allocated buffer as a `size_t` instead of an `int`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7434 Test Plan: `make check` `./db_bench -benchmarks=compress,uncompress --compression_type ...` Reviewed By: riversand963 Differential Revision: D23900011 Pulled By: ltamasi fbshipit-source-id: b25df63ceec4639889be94acb22eb53e530c54e0	2020-09-25 09:01:45 -07:00
Yuqi Gu	29f7bbef99	Fix RocksDB SIGILL error on Raspberry PI 4 (#7233 ) Summary: Issue：https://github.com/facebook/rocksdb/issues/7042 No PMULL runtime check will lead to SIGILL on a Raspberry pi 4. Leverage 'getauxval' to get Hardware-Cap to detect whether target platform does support PMULL or not in runtime. Consider the condition that the target platform does support crc32 but not support PMULL. In this condition, the code should leverage the crc32 instruction rather than skip all hardware crc32 instruction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7233 Reviewed By: jay-zhuang Differential Revision: D23790116 fbshipit-source-id: a3ebd821fbd4a38dd2f59064adbb7c3013ee8140	2020-09-22 10:41:19 -07:00
mrambacher	7d472accdc	Bring the Configurable options together (#5753 ) Summary: This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753 Reviewed By: ajkr Differential Revision: D23385030 Pulled By: zhichao-cao fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a	2020-09-14 17:01:01 -07:00
anand76	18a3227b12	Add a new IOStatus subcode to indicate that writes are fenced off (#7374 ) Summary: In a distributed file system, directory ownership is enforced by fencing off the previous owner once they've been preempted by a new owner. This PR adds a IOStatus subcode for ```StatusCode::IOError``` to indicate this. Once this error is returned for a file write, the DB is put in read-only mode and not allowed to resume in read-write mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7374 Test Plan: Add new unit tests in ```error_handler_fs_test``` Reviewed By: riversand963 Differential Revision: D23687777 Pulled By: anand1976 fbshipit-source-id: bef948642089dc0af399057864d9a8ca339e8b2f	2020-09-14 16:04:47 -07:00
Peter Dillinger	c4d8838a2b	New bit manipulation functions and 128-bit value library (#7338 ) Summary: These new functions and 128-bit value bit operations are expected to be used in a forthcoming Bloom filter alternative. No functional changes to production code, just new code only called by unit tests, cosmetic changes to existing headers, and fix an existing function for a yet-unused template instantiation (BitsSetToOne on something signed and smaller than 32 bits). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7338 Test Plan: Unit tests included. Works with and without TEST_UINT128_COMPAT=1 to check compatibility with and without __uint128_t. Also added that parameter to the CircleCI build build-linux-shared_lib-alt_namespace-status_checked. Reviewed By: jay-zhuang Differential Revision: D23494945 Pulled By: pdillinger fbshipit-source-id: 5c0dc419100d9df5d4d9abb153b2855d5aea39e8	2020-09-03 09:32:59 -07:00
Peter Dillinger	9aad24da55	Real fix for race in backup custom checksum checking (#7309 ) Summary: This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294. To get DB checksum info for live files, we now read the manifest file that will become part of the checkpoint/backup. This requires a little extra handling in taking a custom checkpoint, including only reading the manifest file up to the size prescribed by the checkpoint. This moves GetFileChecksumsFromManifest from backup code to file_checksum_helper.{h,cc} and removes apparently unnecessary checking related to column families. Updated HISTORY.md and warned potential future users of DB::GetLiveFilesChecksumInfo() Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309 Test Plan: updated unit test, before and after Reviewed By: ajkr Differential Revision: D23311994 Pulled By: pdillinger fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5	2020-08-26 10:39:20 -07:00
Jay Zhuang	e500c730cf	Shutdown timer in destructor (#7292 ) Summary: Make sure deleting a running timer works fine. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7292 Test Plan: unittest and an invalid benchmark command: `./db_bench --db=/tmp --use_existing_db=false --benchmarks=fred --compression_type=none` Reviewed By: riversand963 Differential Revision: D23248500 Pulled By: jay-zhuang fbshipit-source-id: 04111681b389a9aa23a439db4568d5ca351f1144	2020-08-21 15:48:52 -07:00
Jay Zhuang	187964a039	Add test function MockTimeEnv.SleepForMicroseconds() (#7293 ) Summary: And change the internal time value from seconds to microseconds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7293 Reviewed By: pdillinger Differential Revision: D23253751 Pulled By: jay-zhuang fbshipit-source-id: 36aa9376b8801b85bd10163173590a17cf4f3a3a	2020-08-21 11:34:37 -07:00
Jay Zhuang	3e422ce0ca	Fix a timer_test deadlock (#7277 ) Summary: There's a potential deadlock caused by MockTimeEnv time value get to a large number, which causes TimedWait() wait forever. The test misuses the microseconds as seconds, making it more likely to happen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7277 Reviewed By: pdillinger Differential Revision: D23183873 Pulled By: jay-zhuang fbshipit-source-id: 6fc38ebd40b4125a99551204b271f91a27e70086	2020-08-20 08:43:13 -07:00
Jay Zhuang	69760b4d05	Introduce a global StatsDumpScheduler for stats dumping (#7223 ) Summary: Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223 Reviewed By: riversand963 Differential Revision: D23056737 Pulled By: jay-zhuang fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba	2020-08-14 20:12:44 -07:00
Levi Tamasi	9d6f48ec1d	Clean up CompressBlock/CompressBlockInternal a bit (#7249 ) Summary: The patch cleans up and refactors `CompressBlock` and `CompressBlockInternal` a bit. In particular, it does the following: * It renames `CompressBlockInternal` to `CompressData` and moves it to `util/compression.h`, where other general compression-related utilities are located. This will facilitate reuse in the BlobDB write path. * The signature of the method is changed so it now takes `compression_format_version` (similarly to the compression library specific methods) instead of `format_version` (which is specific to the block based table). * `GetCompressionFormatForVersion` no longer takes `compression_type` as a parameter. This parameter was only used in a (not entirely up-to-date) assertion; also, removing it eliminates the need to ensure this precondition holds at all call sites. * Does some minor cleanup in `CompressBlock`, for instance, it is now possible to pass only one of `sampled_output_fast` and `sampled_output_slow`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7249 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D23087278 Pulled By: ltamasi fbshipit-source-id: e6316e45baed8b4e7de7c1780c90501c2a3439b3	2020-08-12 18:25:48 -07:00
Zitan Chen	b578ca2e4d	BackupEngine supports custom file checksums (#7085 ) Summary: A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups. Tests are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085 Test Plan: Passed make check Reviewed By: pdillinger Differential Revision: D22390756 Pulled By: gg814 fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f	2020-08-12 13:31:09 -07:00
Yanqin Jin	76609cd38a	Fix potential memory leak (#7245 ) Summary: ``` int* value = new int; ASSERT_NE(nullptr, value); ``` `ASSERT_NE` can expand the expression such that a memory leak is reported by clang analyzer. We can remove this ASSERT_NE since we can assume the memory allocation must succeed. Otherwise a bad alloc exception will be thrown and the process will be killed anyway. Test plan (dev server): ``` USE_CLANG=1 make analyze ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7245 Reviewed By: jay-zhuang Differential Revision: D23079641 Pulled By: riversand963 fbshipit-source-id: a6739a903f90f8715f6f1ef3e5c8a329245b8e78	2020-08-12 12:03:22 -07:00
Yanqin Jin	f15414b656	Timer should run scheduled function without mutex (#7228 ) Summary: Timer (defined in timer.h) schedules and runs user-specified fuctions regularly. Current implementation holds the mutex while running user function, which will lead to contention and waiting. To fix, Timer::Run releases mutex before running user function, and re-acquires it afterwards. This fix will impact how we can cancel a task. If the task is running, it is not holding the mutex. The thread calling Cancel() should wait until the current task finishes. Test Plan (devserver): make check COMPILE_WITH_ASAN=1 make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/7228 Reviewed By: jay-zhuang Differential Revision: D23065487 Pulled By: riversand963 fbshipit-source-id: 07cb59741f506d3eb875c8ab90f73437568d3724	2020-08-11 22:37:39 -07:00
Peter Dillinger	6ac1d25fd0	Fix+clean up handling of mock sleeps (#7101 ) Summary: We have a number of tests hanging on MacOS and windows due to mishandling of code for mock sleeps. In addition, the code was in terrible shape because the same variable (addon_time_) would sometimes refer to microseconds and sometimes to seconds. One test even assumed it was nanoseconds but was written to pass anyway. This has been cleaned up so that DB tests generally use a SpecialEnv function to mock sleep, for either some number of microseconds or seconds depending on the function called. But to call one of these, the test must first call SetMockSleep (precondition enforced with assertion), which also turns sleeps in RocksDB into mock sleeps. To also removes accounting for actual clock time, call SetTimeElapseOnlySleepOnReopen, which implies SetMockSleep (on DB re-open). This latter setting only works by applying on DB re-open, otherwise havoc can ensue if Env goes back in time with DB open. More specifics: Removed some unused test classes, and updated comments on the general problem. Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead of mock time. For this we have the only modification to production code, inserting a sync point callback in flush_job.cc, which is not a change to production behavior. Removed unnecessary resetting of mock times to 0 in many tests. RocksDB deals in relative time. Any behaviors relying on absolute date/time are likely a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one clearly injecting a specific absolute time for actual testing convenience.) Just in case I misunderstood some test, I put this note in each replacement: // NOTE: Presumed unnecessary and removed: resetting mock time in env Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and FilterCompactionTimeTest in db_test.cc stats_history_test and blob_db_test are each their own beast, rather deeply dependent on MockTimeEnv. Each gets its own variant of a work-around for TimedWait in a mock time environment. (Reduces redundancy and inconsistency in stats_history_test.) Intended follow-up: Remove TimedWait from the public API of InstrumentedCondVar, and only make that accessible through Env by passing in an InstrumentedCondVar and a deadline. Then the Env implementations mocking time can fix this problem without using sync points. (Test infrastructure using sync points interferes with individual tests' control over sync points.) With that change, we can simplify/consolidate the scattered work-arounds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101 Test Plan: make check on Linux and MacOS Reviewed By: zhichao-cao Differential Revision: D23032815 Pulled By: pdillinger fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b	2020-08-11 12:41:30 -07:00
Jay Zhuang	fea286d914	Fix Timer unable to schedule new added job (#7216 ) Summary: And added test to reproduce the problem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7216 Reviewed By: riversand963 Differential Revision: D22905193 Pulled By: jay-zhuang fbshipit-source-id: 8ca1435c91bf829f9076c743bdd66861364ff68c	2020-08-04 09:20:28 -07:00
Jay Zhuang	d941b89ddd	Fix hang timer tests on macos (#7208 ) Summary: And re-enable disabled tests. The issue is caused by `CondVar.TimedWait()` doesn't use `MockTimeEnv`. Issue: https://github.com/facebook/rocksdb/issues/6698 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7208 Test Plan: `./timer_test --gtest_repeat=1000` Reviewed By: riversand963 Differential Revision: D22857855 Pulled By: jay-zhuang fbshipit-source-id: 6d15f65f6ae58b75b76cb132815c16ad81ffd12f	2020-08-03 10:17:01 -07:00
yxj25245	e8d5a24815	Fix typo in ThreadData comment (#7131 ) Summary: Fix typo in ThreadData comment Pull Request resolved: https://github.com/facebook/rocksdb/pull/7131 Reviewed By: riversand963 Differential Revision: D22543135 Pulled By: jay-zhuang fbshipit-source-id: 39c9d0e8cd5a364af9a2f05fd3783e8482dea976	2020-07-15 09:23:23 -07:00
Zhichao Cao	a9a973869a	Fix status message size assert (#7045 ) Summary: In status.cc, the assert is `assert(sizeof(msgs) > index)`; msgs is a const char* array, sizeof(msgs) is the array sizechar size, which will make the assert pass all the time. Change it to sizeof(msgs)/sizeof(char*) > index. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7045 Test Plan: pass make check Reviewed By: cheng-chang Differential Revision: D22291337 Pulled By: zhichao-cao fbshipit-source-id: 4ba8ebbb8da80ace7ca6adcdb0c66726f993659d	2020-07-09 18:12:55 -07:00
mrambacher	c7c7b07f06	More Makefile Cleanup (#7097 ) Summary: Cleans up some of the dependencies on test code in the Makefile while building tools: - Moves the test::RandomString, DBBaseTest::RandomString into Random - Moves the test::RandomHumanReadableString into Random - Moves the DestroyDir method into file_utils - Moves the SetupSyncPointsToMockDirectIO into sync_point. - Moves the FaultInjection Env and FS classes under env These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies. By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated. Tested both release and debug builds via Make and CMake for both static and shared libraries. More work remains to clean up how the tools are built and remove some unnecessary dependencies. There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097 Reviewed By: riversand963 Differential Revision: D22463160 Pulled By: pdillinger fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2	2020-07-09 14:35:17 -07:00
Andrew Kryczka	dd29ad4223	Separate internal and user key comparators in `BlockIter` (#6944 ) Summary: Replace `BlockIter::comparator_` and `IndexBlockIter::user_comparator_wrapper_` with a concrete `UserComparatorWrapper` and `InternalKeyComparator`. The motivation for this change was the inconvenience of not knowing the concrete type of `BlockIter::comparator_`, which prevented calling specialized internal key comparison functions to optimize comparison of keys with global seqno applied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6944 Test Plan: benchmark setup -- single file DBs, in-memory, no compression. "normal_db" created by regular flush; "ingestion_db" created by ingesting a file. Both DBs have same contents. ``` $ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000 $ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst \| awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}') $ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ ``` benchmark run command: ``` $ TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=$SEEK_NEXT -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=0 -threads=1 -reads=200000000 -mmap_read=1 -verify_checksum=false ``` results: perf improved marginally for ingestion_db and did not change significantly for normal_db: SEEK_NEXT \| DB \| code \| ops/sec \| % change -- \| -- \| -- \| -- \| -- 0 \| normal_db \| master \| 350880 \| 0 \| normal_db \| PR6944 \| 351040 \| 0.0 0 \| ingestion_db \| master \| 343255 \| 0 \| ingestion_db \| PR6944 \| 349424 \| 1.8 10 \| normal_db \| master \| 218711 \| 10 \| normal_db \| PR6944 \| 217892 \| -0.4 10 \| ingestion_db \| master \| 220334 \| 10 \| ingestion_db \| PR6944 \| 226437 \| 2.8 Reviewed By: pdillinger Differential Revision: D21924676 Pulled By: ajkr fbshipit-source-id: ea4288a2eefa8112eb6c651a671c1de18c12e538	2020-07-07 17:26:16 -07:00
Jay Zhuang	00de699096	Replace reinterpret_cast with static_cast_with_check (#7067 ) Summary: Replace `reinterpret_cast` with `static_cast_with_check` for `DBImpl` and `ColumnFamilyHandleImpl`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7067 Reviewed By: siying Differential Revision: D22361587 Pulled By: jay-zhuang fbshipit-source-id: dfe9e8f3af39c3d27cc372c55ab9ad905eb0a5a1	2020-07-02 19:25:41 -07:00
Daniel Black	c2b0b696c4	filelock_test: add freebsd headers for waitpid (#7010 ) Summary: Per manual https://www.unix.com/man-page/FreeBSD/2/waitpid Pull Request resolved: https://github.com/facebook/rocksdb/pull/7010 Reviewed By: siying Differential Revision: D22176164 Pulled By: ajkr fbshipit-source-id: 0a850ae6f1791d10951d5e4a79cfee01a3981d5a	2020-06-25 17:25:42 -07:00
Peter Dillinger	5b2bbacb6f	Minimize memory internal fragmentation for Bloom filters (#6427 ) Summary: New experimental option BBTO::optimize_filters_for_memory builds filters that maximize their use of "usable size" from malloc_usable_size, which is also used to compute block cache charges. Rather than always "rounding up," we track state in the BloomFilterPolicy object to mix essentially "rounding down" and "rounding up" so that the average FP rate of all generated filters is the same as without the option. (YMMV as heavily accessed filters might be unluckily lower accuracy.) Thus, the option near-minimizes what the block cache considers as "memory used" for a given target Bloom filter false positive rate and Bloom filter implementation. There are no forward or backward compatibility issues with this change, though it only works on the format_version=5 Bloom filter. With Jemalloc, we see about 10% reduction in memory footprint (and block cache charge) for Bloom filters, but 1-2% increase in storage footprint, due to encoding efficiency losses (FP rate is non-linear with bits/key). Why not weighted random round up/down rather than state tracking? By only requiring malloc_usable_size, we don't actually know what the next larger and next smaller usable sizes for the allocator are. We pick a requested size, accept and use whatever usable size it has, and use the difference to inform our next choice. This allows us to narrow in on the right balance without tracking/predicting usable sizes. Why not weight history of generated filter false positive rates by number of keys? This could lead to excess skew in small filters after generating a large filter. Results from filter_bench with jemalloc (irrelevant details omitted): (normal keys/filter, but high variance) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.6278 Number of filters: 5516 Total size (MB): 200.046 Reported total allocated memory (MB): 220.597 Reported internal fragmentation: 10.2732% Bits/key stored: 10.0097 Average FP rate %: 0.965228 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.5104 Number of filters: 5464 Total size (MB): 200.015 Reported total allocated memory (MB): 200.322 Reported internal fragmentation: 0.153709% Bits/key stored: 10.1011 Average FP rate %: 0.966313 (very few keys / filter, optimization not as effective due to ~59 byte internal fragmentation in blocked Bloom filter representation) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.5649 Number of filters: 162950 Total size (MB): 200.001 Reported total allocated memory (MB): 224.624 Reported internal fragmentation: 12.3117% Bits/key stored: 10.2951 Average FP rate %: 0.821534 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 31.8057 Number of filters: 159849 Total size (MB): 200 Reported total allocated memory (MB): 208.846 Reported internal fragmentation: 4.42297% Bits/key stored: 10.4948 Average FP rate %: 0.811006 (high keys/filter) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.7017 Number of filters: 164 Total size (MB): 200.352 Reported total allocated memory (MB): 221.5 Reported internal fragmentation: 10.5552% Bits/key stored: 10.0003 Average FP rate %: 0.969358 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.7131 Number of filters: 160 Total size (MB): 200.928 Reported total allocated memory (MB): 200.938 Reported internal fragmentation: 0.00448054% Bits/key stored: 10.1852 Average FP rate %: 0.963387 And from db_bench (block cache) with jemalloc: $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ (for FILE in /dev/shm/dbbench.no_optimize/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17063835 $ (for FILE in /dev/shm/dbbench/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17430747 $ #^ 2.1% additional filter storage $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8440400 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 21087528 rocksdb.bloom.filter.useful COUNT : 4963889 rocksdb.bloom.filter.full.positive COUNT : 1214081 rocksdb.bloom.filter.full.true.positive COUNT : 1161999 $ #^ 1.04 % observed FP rate $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8448592 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 18220328 rocksdb.bloom.filter.useful COUNT : 5360933 rocksdb.bloom.filter.full.positive COUNT : 1321315 rocksdb.bloom.filter.full.true.positive COUNT : 1262999 $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters (Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427 Test Plan: unit test added, 'make check' with gcc, clang and valgrind Reviewed By: siying Differential Revision: D22124374 Pulled By: pdillinger fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e	2020-06-22 13:32:07 -07:00
Cheng Chang	f7613e2a9e	Make it able to lower cpu priority to specific level in threadpool (#6969 ) Summary: `Env::LowerThreadPoolCPUPriority` takes a new parameter `CpuPriority` to be able to lower to a specific priority such as `CpuPriority::kIdle`, previously, the priority is always lowered to `CpuPriority::kLow`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6969 Test Plan: unit test `EnvPosixTest::LowerThreadPoolCpuPriority` added to `env_test.cc`. Reviewed By: siying Differential Revision: D22011169 Pulled By: cheng-chang fbshipit-source-id: 568878c24a924912e35cef00c552d4a63431cdf4	2020-06-13 13:25:20 -07:00
sdong	31bd2d790e	Fix ThreadLocalTest.SequentialReadWriteTest failure when running individually (#6929 ) Summary: When running ThreadLocalTest.SequentialReadWriteTest individually, the test fails with: ] ./thread_local_test --gtest_filter="SequentialReadWriteTest" Note: Google Test filter = SequentialReadWriteTest [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from ThreadLocalTest [ RUN ] ThreadLocalTest.SequentialReadWriteTest internal_repo_rocksdb/repo/util/thread_local_test.cc:144: Failure Expected: IDChecker::PeekId() Which is: 3 To be equal to: base_id + 1u Which is: 2 [ FAILED ] ThreadLocalTest.SequentialReadWriteTest (1 ms) [----------] 1 test from ThreadLocalTest (1 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (1 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] ThreadLocalTest.SequentialReadWriteTest 1 FAILED TEST It appears that when running as the first test, PeakId() was updated twice. I didn't dig into it why but it doesn't seem to break the contract. Relax the assertion to make it pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6929 Test Plan: Run the test individually and as the whole thread_local_test Reviewed By: riversand963 Differential Revision: D21873999 fbshipit-source-id: 1dcb6a2e9c38b6afd848027308bfe633342b7548	2020-06-04 11:44:09 -07:00
sdong	afa3518839	Revert "Update googletest from 1.8.1 to 1.10.0 (#6808 )" (#6923 ) Summary: This reverts commit `8d87e9cea1`. Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923 Reviewed By: pdillinger Differential Revision: D21864799 fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc	2020-06-03 15:55:03 -07:00
Zhichao Cao	2adb7e3768	Fix potential overflow of unsigned type in for loop (#6902 ) Summary: x.size() -1 or y - 1 can overflow to an extremely large value when x.size() pr y is 0 when they are unsigned type. The end condition of i in the for loop will be extremely large, potentially causes segment fault. Fix them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6902 Test Plan: pass make asan_check Reviewed By: ajkr Differential Revision: D21843767 Pulled By: zhichao-cao fbshipit-source-id: 5b8b88155ac5a93d86246d832e89905a783bb5a1	2020-06-02 15:05:07 -07:00
Adam Retter	8d87e9cea1	Update googletest from 1.8.1 to 1.10.0 (#6808 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6808 Reviewed By: anand1976 Differential Revision: D21483984 Pulled By: pdillinger fbshipit-source-id: 70c5eff2bd54ddba469761d95e4cd4611fb8e598	2020-06-01 20:33:42 -07:00
mrambacher	826295a5e9	Change autovector to have a reserved size in LITE mode (#6868 ) Summary: Previously in LITE mode, an autovector did not have a reserved size. When elements were added to the vector, the underlying array could be reallocated. There was a set of code that never expands the autovector and was doing &autovector::back(). When the vector is resized, the old addresses may become invalid, causing a later exception to be thrown. By reserving space in the autovector up front, this problem is eliminated for those uses where the vector will never exceed the initial size. the resize happens, these pointers become invalid, leading to SEGV or other exceptions. This change allows the autovector to be fully populated before we take the address of any of its elements, thereby elminating the potential for a resize. There is comparable code to this change in Version::MultiGet for dealing with the context objects. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6868 Reviewed By: ajkr Differential Revision: D21693505 Pulled By: cheng-chang fbshipit-source-id: e71d516b15e08f202593cb80f2a42f048fc95768	2020-05-21 14:48:10 -07:00
mrambacher	38be686160	Add Struct Type to OptionsTypeInfo (#6425 ) Summary: Added code for generically handing structs to OptionTypeInfo. A struct is a collection of variables handled by their own map of OptionTypeInfos. Examples of structs include Compaction and Cache options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6425 Reviewed By: siying Differential Revision: D21668789 Pulled By: zhichao-cao fbshipit-source-id: 064b110de39dadf82361ed4663f7ac1a535b0b07	2020-05-21 10:58:39 -07:00
Peter Dillinger	c7aedf1b48	Clean up some code related to file checksums (#6861 ) Summary: * Add missing unit test for schema stability of FileChecksumGenCrc32c (previously was only comparing to itself) * A lot of clarifying comments * Add some assertions for preconditions * Rename WritableFileWriter::CalculateFileChecksum -> UpdateFileChecksum * Simplify FileChecksumGenCrc32c with shared functions * Implement EndianSwapValue to replace unused EndianTransform And incidentally since I had trouble with 'make check-format' GitHub action disagreeing with local run, * Output full diagnostic information when 'make check-format' fails in CI Pull Request resolved: https://github.com/facebook/rocksdb/pull/6861 Test Plan: new unit test passes before & after other changes Reviewed By: zhichao-cao Differential Revision: D21667115 Pulled By: pdillinger fbshipit-source-id: 6a99970f87605aa024fa540c78cd519ff322c3e6	2020-05-21 08:12:51 -07:00
Ziyue Yang	c384c08a4f	Add tests for compression failure in BlockBasedTableBuilder (#6709 ) Summary: Currently there is no check for whether BlockBasedTableBuilder will expose compression error status if compression fails during the table building. This commit adds fake faulting compressors and a unit test to test such cases. This check finds 5 bugs, and this commit also fixes them: 1. Not handling compression failure well in BlockBasedTableBuilder::BGWorkWriteRawBlock. 2. verify_compression failing in BlockBasedTableBuilder when used with ZSTD. 3. Wrongly passing the same reference of block contents to BlockBasedTableBuilder::CompressAndVerifyBlock in parallel compression. 4. Wrongly setting block_rep->first_key_in_next_block to nullptr in BlockBasedTableBuilder::EnterUnbuffered when there are still incoming data blocks. 5. Not maintaining variables for compression ratio estimation and first_block in BlockBasedTableBuilder::EnterUnbuffered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6709 Reviewed By: ajkr Differential Revision: D21236254 fbshipit-source-id: 101f6e62b2bac2b7be72be198adf93cd32a1ff46	2020-05-12 09:27:35 -07:00
Sagar Vemuri	2e9324718a	Disable a few timer tests (#6833 ) Summary: Disable `TimerTest.SingleScheduleRepeatedlyTest` and `TimerTest.MultipleScheduleRepeatedlyTest`. This is to help people to not hit any hangs (https://github.com/facebook/rocksdb/issues/6698) during their development process while I investigate further; I could not reproduce the issue on my dev machine yet. Note that timer is not being utilized anywhere yet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6833 Test Plan: ``` svemuri@devbig187 ~/rocksdb (timer-disable-test) $ TEST_TMPDIR=/dev/shm ./timer_test [==========] Running 2 tests from 1 test case. [----------] Global test environment set-up. [----------] 2 tests from TimerTest [ RUN ] TimerTest.SingleScheduleOnceTest [ OK ] TimerTest.SingleScheduleOnceTest (1 ms) [ RUN ] TimerTest.MultipleScheduleOnceTest [ OK ] TimerTest.MultipleScheduleOnceTest (0 ms) [----------] 2 tests from TimerTest (1 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test case ran. (1 ms total) [ PASSED ] 2 tests. YOU HAVE 2 DISABLED TESTS ``` Reviewed By: pdillinger Differential Revision: D21502474 Pulled By: sagar0 fbshipit-source-id: ac67caee2011fd14ffb2476a8914a6286a4f9abe	2020-05-11 13:30:00 -07:00
sdong	d9cd33516a	Suppress UBSAN warning in CRC32 ARM (#6827 ) Summary: UBSAN shows following warning: util/crc32c_arm64.cc:111:11: runtime error: load of misaligned address 0x00001afcda86 for type 'const uint64_t', which requires 8 byte alignment 0x00001afcda86: note: pointer points here cc c1 2d 00 01 81 40 24 30 66 39 66 30 37 30 63 2d 32 36 63 34 2d 34 62 61 61 2d 38 35 33 31 2d ^ Suppress it just as what we do in x86 CRC. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6827 Test Plan: Run the same UBSAN and see it to pass now. Reviewed By: ltamasi Differential Revision: D21471838 fbshipit-source-id: 02943dd39a7030d2b03e5d894dcb23ed72b6c9c3	2020-05-08 14:14:00 -07:00
Andrew Kryczka	1c84660457	prototype status check enforcement (#6798 ) Summary: Tried making Status object enforce that it is checked in some way. In cases it is not checked, `PermitUncheckedError()` must be called explicitly. Added a way to run tests (`ASSERT_STATUS_CHECKED=1 make -j48 check`) on a whitelist. The effort appears significant to get each test to pass with this assertion, so I only fixed up enough to get one test (`options_test`) working and added it to the whitelist. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6798 Reviewed By: pdillinger Differential Revision: D21377404 Pulled By: ajkr fbshipit-source-id: 73236f9c8df38f01cf24ecac4a6d1661b72d077e	2020-05-08 12:40:43 -07:00
Derrick Pallas	5272305437	Fix FilterBench when RTTI=0 (#6732 ) Summary: The dynamic_cast in the filter benchmark causes release mode to fail due to no-rtti. Replace with static_cast_with_check. Signed-off-by: Derrick Pallas <derrick@pallas.us> Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732 Reviewed By: ltamasi Differential Revision: D21304260 Pulled By: pdillinger fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c	2020-04-29 13:09:23 -07:00
Peter Dillinger	bae6f58696	Basic MultiGet support for partitioned filters (#6757 ) Summary: In MultiGet, access each applicable filter partition only once per batch, rather than for each applicable key. Also, * Fix Bloom stats for MultiGet * Fix/refactor MultiGetContext::Range::KeysLeft, including * Add efficient BitsSetToOne implementation * Assert that MultiGetContext::Range does not go beyond shift range Performance test: Generate db: $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true ... Before (middle performing run of three; note some missing Bloom stats): $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 \| egrep 'micros/op\|block.cache.filter.hit\|bloom.filter.(full\|use)\|number.multiget' multireadrandom : 26.403 micros/op 597517 ops/sec; (548427 of 671968 found) rocksdb.block.cache.filter.hit COUNT : 83443275 rocksdb.bloom.filter.useful COUNT : 0 rocksdb.bloom.filter.full.positive COUNT : 0 rocksdb.bloom.filter.full.true.positive COUNT : 7931450 rocksdb.number.multiget.get COUNT : 385984 rocksdb.number.multiget.keys.read COUNT : 12351488 rocksdb.number.multiget.bytes.read COUNT : 793145000 rocksdb.number.multiget.keys.found COUNT : 7931450 After (middle performing run of three): $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 \| egrep 'micros/op\|block.cache.filter.hit\|bloom.filter.(full\|use)\|number.multiget' multireadrandom : 21.024 micros/op 752963 ops/sec; (705188 of 863968 found) rocksdb.block.cache.filter.hit COUNT : 49856682 rocksdb.bloom.filter.useful COUNT : 45684579 rocksdb.bloom.filter.full.positive COUNT : 10395458 rocksdb.bloom.filter.full.true.positive COUNT : 9908456 rocksdb.number.multiget.get COUNT : 481984 rocksdb.number.multiget.keys.read COUNT : 15423488 rocksdb.number.multiget.bytes.read COUNT : 990845600 rocksdb.number.multiget.keys.found COUNT : 9908456 So that's about 25% higher throughput even for random keys Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757 Test Plan: unit test included Reviewed By: anand1976 Differential Revision: D21243256 Pulled By: pdillinger fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f	2020-04-28 14:49:34 -07:00
Peter Dillinger	249eff0f30	Stats for redundant insertions into block cache (#6681 ) Summary: Since read threads do not coordinate on loading data into block cache, two threads between Lookup and Insert can end up loading and inserting the same data. This is particularly concerning with cache_index_and_filter_blocks since those are hot and more likely to be race targets if ejected from (or not pre-populated in) the cache. Particularly with moves toward disaggregated / network storage, the cost of redundant retrieval might be high, and we should at least have some hard statistics from which we can estimate impact. Example with full filter thrashing "cliff": $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 ... $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 \| egrep '^rocksdb.block.cache.(.add\|.redundant)' \| grep -v compress \| sort rocksdb.block.cache.add COUNT : 14181 rocksdb.block.cache.add.failures COUNT : 0 rocksdb.block.cache.add.redundant COUNT : 476 rocksdb.block.cache.data.add COUNT : 12749 rocksdb.block.cache.data.add.redundant COUNT : 18 rocksdb.block.cache.filter.add COUNT : 1003 rocksdb.block.cache.filter.add.redundant COUNT : 217 rocksdb.block.cache.index.add COUNT : 429 rocksdb.block.cache.index.add.redundant COUNT : 241 $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 \| egrep '^rocksdb.block.cache.(.add\|.redundant)' \| grep -v compress \| sort rocksdb.block.cache.add COUNT : 1182223 rocksdb.block.cache.add.failures COUNT : 0 rocksdb.block.cache.add.redundant COUNT : 302728 rocksdb.block.cache.data.add COUNT : 31425 rocksdb.block.cache.data.add.redundant COUNT : 12 rocksdb.block.cache.filter.add COUNT : 795455 rocksdb.block.cache.filter.add.redundant COUNT : 130238 rocksdb.block.cache.index.add COUNT : 355343 rocksdb.block.cache.index.add.redundant COUNT : 172478 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681 Test Plan: Some manual testing (above) and unit test covering key metrics is included Reviewed By: ltamasi Differential Revision: D21134113 Pulled By: pdillinger fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87	2020-04-27 13:20:27 -07:00
Tomas Kolda	6ee66cf8f0	Prevents Table Cache to open same files more times (#6707 ) Summary: In highly concurrent requests table cache opens same file more times which lowers purpose of max_open_files. Fixes (https://github.com/facebook/rocksdb/issues/6699) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6707 Reviewed By: ltamasi Differential Revision: D21044965 fbshipit-source-id: f6e91d90b60dad86e518b5147021da42460ee1d2	2020-04-21 13:16:31 -07:00
Peter Dillinger	31da5e34c1	C++20 compatibility (#6697 ) Summary: Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended: * Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists * Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition * std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API * Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line * Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20) * Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor * Added GCC 9 C++11 & GCC9 C++20 in Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697 Test Plan: make check and CI Reviewed By: cheng-chang Differential Revision: D21020318 Pulled By: pdillinger fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91	2020-04-20 13:24:25 -07:00
Peter Dillinger	45d2b4efca	Fix tabs and lint-ignores (#6734 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6734 Reviewed By: cheng-chang Differential Revision: D21134556 Pulled By: pdillinger fbshipit-source-id: 3636cc1d1333137b70031f8277458781c21631fb	2020-04-20 11:39:31 -07:00
Zhichao Cao	38dfa406ff	Add NewFileChecksumGenCrc32cFactory to file checksum (#6688 ) Summary: Add NewFileChecksumGenCrc32cFactory to file checksum public interface such that applications can use the build in crc32 checksum factory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6688 Test Plan: pass make asan_check Reviewed By: riversand963 Differential Revision: D21006859 Pulled By: zhichao-cao fbshipit-source-id: ea8a45196a8b77c310728ab05f6cc0f49f3baef0	2020-04-13 19:13:41 -07:00
Sagar Vemuri	0355d14dd9	Add a simple timer support to schedule work at fixed times/intervals (#6543 ) Summary: Adding a simple timer support to schedule work at a fixed time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6543 Test Plan: TODO: clean up the unit tests, and make them better. Reviewed By: siying Differential Revision: D20465390 Pulled By: sagar0 fbshipit-source-id: cba143f70b6339863e1d0f8b8bf92e51c2b3d678	2020-04-07 11:55:27 -07:00
Peter Dillinger	079e77ff9e	Revamp cache_bench to resemble a real workload (#6629 ) Summary: I suspect LRUCache could use some optimization, and to support such an effort, a good benchmarking tool is needed. The existing cache_bench was heavily skewed toward insertion and lookup misses, and did not saturate memory with other work. This change should improve those things to better resemble a real workload. (All below using clang compiler, for some consistency, but not necessarily same version and settings.) The real workload is from production MySQL on RocksDB, filtering stacks containing "LRU", "ShardedCache" or "CacheShard." Lookup inclusive: 66% Insert inclusive: 17% Release inclusive: 15% An alternate simulated workload is MySQL running a LinkBench read test: Lookup inclusive: 54% Insert inclusive: 24% Release inclusive: 21% cache_bench default settings, prior to this change: Lookup inclusive: 35.8% Insert inclusive: 63.6% Release inclusive: 0% cache_bench after this change (intended as somewhat "tighter" workload than average production, more like LinkBench): Lookup inclusive: 52% Insert inclusive: 20% Release inclusive: 26% And top exclusive stacks (portion of stack samples as filtered above): Production MySQL: LRUHandleTable::FindPointer: 25.3% rocksdb::operator==: 15.1% <-- Slice == LRUCacheShard::LRU_Remove: 13.8% ShardedCache::Lookup: 8.9% __pthread_mutex_lock: 7.1% LRUCacheShard::LRU_Insert: 6.3% MurmurHash64A: 4.8% <-- Since upgraded to XXH3p ... Old cache_bench: LRUHandleTable::FindPointer: 23.6% __pthread_mutex_lock: 15.0% __pthread_mutex_unlock_usercnt: 11.7% __lll_lock_wait: 8.6% __lll_unlock_wake: 6.8% LRUCacheShard::LRU_Insert: 6.0% ShardedCache::Lookup: 4.4% LRUCacheShard::LRU_Remove: 2.8% ... rocksdb::operator==: 0.2% <-- Slice == ... New cache_bench: LRUHandleTable::FindPointer: 22.8% __pthread_mutex_unlock_usercnt: 14.3% rocksdb::operator==: 10.5% <-- Slice == LRUCacheShard::LRU_Insert: 9.0% __pthread_mutex_lock: 5.9% LRUCacheShard::LRU_Remove: 5.0% ... ShardedCache::Lookup: 2.9% ... So there's a bit more lock contention in the benchmark than in production, but otherwise looks similar enough to me. At least it's a big improvement over the existing code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6629 Test Plan: No production code changes, ran cache_bench with ASAN Reviewed By: ltamasi Differential Revision: D20824318 Pulled By: pdillinger fbshipit-source-id: 6f8dc5891ead0f87edbed3a615ecd5289d9abe12	2020-04-03 10:26:49 -07:00
Ziyue Yang	03a781a90c	Add pipelined & parallel compression optimization (#6262 ) Summary: This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262 Reviewed By: ajkr Differential Revision: D20651306 fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb	2020-04-01 16:40:18 -07:00
Zhichao Cao	e8d332d97e	Use FileChecksumGenFactory for SST file checksum (#6600 ) Summary: In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600 Test Plan: tested with make asan_check Reviewed By: riversand963 Differential Revision: D20717670 Pulled By: zhichao-cao fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6	2020-03-29 15:58:46 -07:00
Cheng Chang	ee50b8d499	Be able to decrease background thread's CPU priority when creating database backup (#6602 ) Summary: When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries. This PR makes it possible to decrease CPU priority of these threads when creating a new backup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602 Test Plan: make check Reviewed By: siying, zhichao-cao Differential Revision: D20683216 Pulled By: cheng-chang fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84	2020-03-28 19:07:25 -07:00
sdong	6fd0ed4993	CompactRange() to use bottom pool when goes to bottommost level (#6593 ) Summary: In automatic compaction, if a compaction is bottommost, it goes to bottom thread pool. We should do the same for manual compaction too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6593 Test Plan: Add a unit test. See all existing tests pass. Reviewed By: ajkr Differential Revision: D20637408 fbshipit-source-id: cb03031e8f895085f7acf6d2d65e69e84c9ddef3	2020-03-24 20:24:32 -07:00
sdong	6c50fe1ec9	Change HashMap::Insert()'s value to a const reference (#6567 ) Summary: When building RocksDB on VS2015, an error shows up with hash_map.h(39): error C2719: 'value': formal parameter with requested alignment of 8 won't be aligned Making the reference a reference can solve the problem, and there isn't a reason we can't do that, at least for the current use of the hash map. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6567 Test Plan: See CI tests pass. Reviewed By: pdillinger Differential Revision: D20548543 fbshipit-source-id: 255b55d74cf68a0b324e6f504c56608a97ea6276	2020-03-20 14:59:54 -07:00
Peter Dillinger	db02664f35	Remove XXH3(preview) streaming APIs (#6540 ) Summary: There was an alignment bug in our copy of the streaming APIs for XXH3 (which we dubbed "XXH3p" for "preview" release). Since those APIs are unused and some values for XXH3 have changed since XXH3p, I'm simply removing those APIs, expecting it's better to use finalized XXH3 function if/when we decide to use those APIs (e.g. for checksums). Fixes https://github.com/facebook/rocksdb/issues/6508 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6540 Test Plan: make check Differential Revision: D20479271 Pulled By: pdillinger fbshipit-source-id: 246cf1690d614d3b31042b563d249de32dec1e0d	2020-03-16 17:02:00 -07:00
Huisheng Liu	07a3f7f008	fix MSVC build failures (#6517 ) Summary: fix a few build warnings that are treated as failures with more strict MSVC warning settings Pull Request resolved: https://github.com/facebook/rocksdb/pull/6517 Differential Revision: D20401325 Pulled By: pdillinger fbshipit-source-id: b44979dfaafdc7b3b8cb44a565400a99b331dd30	2020-03-12 08:42:39 -07:00
sdong	331e6199df	Include more information in file lock failure (#6507 ) Summary: When users fail to open a DB with file lock failure, it is sometimes hard for users to debug. We now include the time the lock is acquired and the thread ID that acquired the lock, to help users debug problems like this. Default Env's thread ID is used. Since type of lockedFiles is changed, rename it to follow naming convention too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6507 Test Plan: Add a unit test and improve an existing test to validate the case. Differential Revision: D20378333 fbshipit-source-id: 312fe0e9733fd1d1e9969c321b90ce523cf4708a	2020-03-11 16:23:08 -07:00
Adam Retter	8fc20ac468	Add ppc64le builds to Travis (#6144 ) Summary: Let's see how this goes... Pull Request resolved: https://github.com/facebook/rocksdb/pull/6144 Differential Revision: D20387515 Pulled By: pdillinger fbshipit-source-id: ba2669348c267141dfddff910b4c2224a22cbb38	2020-03-11 12:33:45 -07:00
Yanqin Jin	d93812c9ae	Iterator with timestamp (#6255 ) Summary: Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests. Create an iterator with timestamp. ``` ... read_opts.timestamp = &ts; auto* iter = db->NewIterator(read_opts); // target is key without timestamp. for (iter->Seek(target); iter->Valid(); iter->Next()) {} for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {} delete iter; read_opts.timestamp = &ts1; // lower_bound and upper_bound are without timestamp. read_opts.iterate_lower_bound = &lower_bound; read_opts.iterate_upper_bound = &upper_bound; auto* iter1 = db->NewIterator(read_opts); // Do Seek or SeekToFirst() delete iter1; ``` Test plan (dev server) ``` $make check ``` Simple benchmarking (dev server) 1. The overhead introduced by this PR even when timestamp is disabled. key size: 16 bytes value size: 100 bytes Entries: 1000000 Data reside in main memory, and try to stress iterator. Repeated three times on master and this PR. - Seek without next ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 ``` master: 159047.0 ops/sec this PR: 158922.3 ops/sec (2% drop in throughput) - Seek and next 10 times ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10 ``` master: 109539.3 ops/sec this PR: 107519.7 ops/sec (2% drop in throughput) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255 Differential Revision: D19438227 Pulled By: riversand963 fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746	2020-03-06 16:24:27 -08:00
Cheng Chang	0a0151fb99	Remove memcpy from RandomAccessFileReader::Read in direct IO mode (#6455 ) Summary: In direct IO mode, RandomAccessFileReader::Read allocates an internal aligned buffer, and then copies the result into the scratch buffer. If the result is only temporarily used inside a function, there is no need to do the memcpy and just let the result Slice refer to the internally allocated buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6455 Test Plan: make check Differential Revision: D20106753 Pulled By: cheng-chang fbshipit-source-id: 44f505843837bba47a56e3fa2c4dd3bd76486b58	2020-03-06 14:05:12 -08:00
Fabrice Fontaine	8bbd76edbf	Check for sys/auxv.h (#6359 ) Summary: Check for sys/auxv.h and getauxval before using them as they are not always available (for example on uclibc) Signed-off-by: Fabrice Fontaine <fontaine.fabrice@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6359 Differential Revision: D20239797 fbshipit-source-id: 175a098094d81545628c2372e7c388e70a32fd48	2020-03-03 18:09:59 -08:00
Peter Dillinger	ab65278b1f	Misc filter_bench improvements (#6444 ) Summary: Useful in validating/testing internal fragmentation changes (https://github.com/facebook/rocksdb/issues/6427) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6444 Test Plan: manual (no changes to production code) Differential Revision: D20040076 Pulled By: pdillinger fbshipit-source-id: 32d26f363d2a9ab9f5bebd281dcebd9915ae340e	2020-02-21 13:31:57 -08:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
acelyc111	3a3457575d	Fix compile error when LZ4 is up to r123 (#6412 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6412 Differential Revision: D19914063 Pulled By: ajkr fbshipit-source-id: 4e401e665d4b449d24c4cdec35a4585eeda95996	2020-02-14 15:55:23 -08:00
Cheng Chang	d3ba398bbd	Update unit tests for PinnableSlice (#6399 ) Summary: 1. remove AssertEmpty because calling methods on moved objects is discouraged. 2. add a test to assert that the internal buffer is moved instead of being copied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6399 Test Plan: make slice_test && ./slice_test USE_CLANG=1 make analyze Differential Revision: D19825372 Pulled By: cheng-chang fbshipit-source-id: 2e26f8ce5ec3edbfce067db045e80bd433e704f4	2020-02-10 18:13:27 -08:00
Cheng Chang	dafb568052	Add utility class Defer (#6382 ) Summary: Add a utility class `Defer` to defer the execution of a function until the Defer object goes out of scope. Used in VersionSet:: ProcessManifestWrites as an example. The inline comments for class `Defer` have more details. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6382 Test Plan: `make defer_test version_set_test && ./defer_test && ./version_set_test` Differential Revision: D19797538 Pulled By: cheng-chang fbshipit-source-id: b1a9b7306e4fd4f48ec2ab55783caa561a315f0f	2020-02-10 17:59:47 -08:00
Zhichao Cao	4369f2c7bb	Checksum for each SST file and stores in MANIFEST (#6216 ) Summary: In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata. Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216 Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check. Differential Revision: D19171461 Pulled By: zhichao-cao fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87	2020-02-10 15:52:52 -08:00
sdong	876c2dbff4	Allow readahead when reading option files. (#6372 ) Summary: Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372 Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace. Differential Revision: D19727739 fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583	2020-02-07 15:18:26 -08:00
Cheng Chang	b42fa1497f	Support move semantics for PinnableSlice (#6374 ) Summary: It's logically correct for PinnableSlice to support move semantics to transfer ownership of the pinned memory region. This PR adds both move constructor and move assignment to PinnableSlice. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6374 Test Plan: A set of unit tests for the move semantics are added in slice_test. So `make slice_test && ./slice_test`. Differential Revision: D19739254 Pulled By: cheng-chang fbshipit-source-id: f898bd811bb05b2d87384ec58b645e9915e8e0b1	2020-02-07 14:26:26 -08:00
Yanqin Jin	f2fbc5d668	Shorten certain test names to avoid infra failure (#6352 ) Summary: Unit test names, together with other components, are used to create log files during some internal testing. Overly long names cause infra failure due to file names being too long. Look for internal tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6352 Differential Revision: D19649307 Pulled By: riversand963 fbshipit-source-id: 6f29de096e33c0eaa87d9c8702f810eda50059e7	2020-01-30 23:10:24 -08:00
Burton Li	c9a5e48762	fix build warnnings on MSVC (#6309 ) Summary: Fix build warnings on MSVC. siying Pull Request resolved: https://github.com/facebook/rocksdb/pull/6309 Differential Revision: D19455012 Pulled By: ltamasi fbshipit-source-id: 940739f2c92de60e47cc2bed8dd7f921459545a9	2020-01-30 16:07:26 -08:00
Peter Dillinger	8aa99fc71e	Warn on excessive keys for legacy Bloom filter with 32-bit hash (#6317 ) Summary: With many millions of keys, the old Bloom filter implementation for the block-based table (format_version <= 4) would have excessive FP rate due to the limitations of feeding the Bloom filter with a 32-bit hash. This change computes an estimated inflated FP rate due to this effect and warns in the log whenever an SST filter is constructed (almost certainly a "full" not "partitioned" filter) that exceeds 1.5x FP rate due to this effect. The detailed condition is only checked if 3 million keys or more have been added to a filter, as this should be a lower bound for common bits/key settings (< 20). Recommended remedies include smaller SST file size, using format_version >= 5 (for new Bloom filter), or using partitioned filters. This does not change behavior other than generating warnings for some constructed filters using the old implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6317 Test Plan: Example with warning, 15M keys @ 15 bits / key: (working_mem_size_mb is just to stop after building one filter if it's large) $ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=15000000 2>&1 \| grep 'FP rate' [WARN] [/block_based/filter_policy.cc:292] Using legacy SST/BBT Bloom filter with excessive key count (15.0M @ 15bpk), causing estimated 1.8x higher filter FP rate. Consider using new Bloom with format_version>=5, smaller SST file size, or partitioned filters. Predicted FP rate %: 0.766702 Average FP rate %: 0.66846 Example without warning (150K keys): $ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=150000 2>&1 \| grep 'FP rate' Predicted FP rate %: 0.422857 Average FP rate %: 0.379301 $ With more samples at 15 bits/key: 150K keys -> no warning; actual: 0.379% FP rate (baseline) 1M keys -> no warning; actual: 0.396% FP rate, 1.045x 9M keys -> no warning; actual: 0.563% FP rate, 1.485x 10M keys -> warning (1.5x); actual: 0.564% FP rate, 1.488x 15M keys -> warning (1.8x); actual: 0.668% FP rate, 1.76x 25M keys -> warning (2.4x); actual: 0.880% FP rate, 2.32x At 10 bits/key: 150K keys -> no warning; actual: 1.17% FP rate (baseline) 1M keys -> no warning; actual: 1.16% FP rate 10M keys -> no warning; actual: 1.32% FP rate, 1.13x 25M keys -> no warning; actual: 1.63% FP rate, 1.39x 35M keys -> warning (1.6x); actual: 1.81% FP rate, 1.55x At 5 bits/key: 150K keys -> no warning; actual: 9.32% FP rate (baseline) 25M keys -> no warning; actual: 9.62% FP rate, 1.03x 200M keys -> no warning; actual: 12.2% FP rate, 1.31x 250M keys -> warning (1.5x); actual: 12.8% FP rate, 1.37x 300M keys -> warning (1.6x); actual: 13.4% FP rate, 1.43x The reason for the modest inaccuracy at low bits/key is that the assumption of independence between a collision between 32-hash values feeding the filter and an FP in the filter is not quite true for implementations using "simple" logic to compute indices from the stock hash result. There's math on this in my dissertation, but I don't think it's worth the effort just for these extreme cases (> 100 million keys and low-ish bits/key). Differential Revision: D19471715 Pulled By: pdillinger fbshipit-source-id: f80c96893a09bf1152630ff0b964e5cdd7e35c68	2020-01-20 21:31:47 -08:00
Peter Dillinger	4b86fe1123	Log warning for high bits/key in legacy Bloom filter (#6312 ) Summary: Help users that would benefit most from new Bloom filter implementation by logging a warning that recommends the using format_version >= 5. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6312 Test Plan: $ (for BPK in 10 13 14 19 20 50; do ./filter_bench -quick -impl=0 -bits_per_key=$BPK -m_queries=1 2>&1; done) \| grep 'its/key' Bits/key actual: 10.0647 Bits/key actual: 13.0593 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (14) bits/key. Significant filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 14.0581 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (19) bits/key. Significant filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 19.0542 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 20.0584 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (50) bits/key. Dramatic filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 50.0577 Differential Revision: D19457191 Pulled By: pdillinger fbshipit-source-id: 073d94cde5c70e03a160f953e1100c15ea83eda4	2020-01-17 19:37:35 -08:00
sdong	338c149b92	crash_test to cover bottommost compression and some other changes (#6215 ) Summary: Several improvements to crash_test/stress_test: (1) Stress_test to support an parameter of bottommost compression (2) Rename those FLAGS_* variables that are not gflags to avoid confusion (3) Crash_test to randomly generate compression type for bottommost compression with half the chance. (4) Stress_test to sanitize unsupported compression type to snappy, so that crash_test to cover all possible compression types and people don't need to worry about they don't support all comrpession types in their environment. (5) In crash_test, when generating db_stress command, sort arguments in alphabeta order, so that it is easier to find value for a specific argument. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6215 Test Plan: Run "make crash_test" for a while and see the botommost option shown in LOG files. Differential Revision: D19171255 fbshipit-source-id: d7001e246c4ff9ee5760776eea0be97738650735	2019-12-20 16:14:52 -08:00
Peter Dillinger	a92bd0a183	Optimize memory and CPU for building new Bloom filter (#6175 ) Summary: The filter bits builder collects all the hashes to add in memory before adding them (because the number of keys is not known until we've walked over all the keys). Existing code uses a std::vector for this, which can mean up to 2x than necessary space allocated (and not freed) and up to ~2x write amplification in memory. Using std::deque uses close to minimal space (for large filters, the only time it matters), no write amplification, frees memory while building, and no need for large contiguous memory area. The only cost is more calls to allocator, which does not appear to matter, at least in benchmark test. For now, this change only applies to the new (format_version=5) Bloom filter implementation, to ease before-and-after comparison downstream. Temporary memory use during build is about the only way the new Bloom filter could regress vs. the old (because of upgrade to 64-bit hash) and that should only matter for full filters. This change should largely mitigate that potential regression. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6175 Test Plan: Using filter_bench with -new_builder option and 6M keys per filter is like large full filter (improvement). 10k keys and no -new_builder is like partitioned filters (about the same). (Corresponding configurations run simultaneously on devserver.) std::vector impl (before) $ /usr/bin/time -v ./filter_bench -impl=2 -quick -new_builder -working_mem_size_mb=1000 - average_keys_per_filter=6000000 Build avg ns/key: 52.2027 Maximum resident set size (kbytes): 1105016 $ /usr/bin/time -v ./filter_bench -impl=2 -quick -working_mem_size_mb=1000 - average_keys_per_filter=10000 Build avg ns/key: 30.5694 Maximum resident set size (kbytes): 1208152 std::deque impl (after) $ /usr/bin/time -v ./filter_bench -impl=2 -quick -new_builder -working_mem_size_mb=1000 - average_keys_per_filter=6000000 Build avg ns/key: 39.0697 Maximum resident set size (kbytes): 1087196 $ /usr/bin/time -v ./filter_bench -impl=2 -quick -working_mem_size_mb=1000 - average_keys_per_filter=10000 Build avg ns/key: 30.9348 Maximum resident set size (kbytes): 1207980 Differential Revision: D19053431 Pulled By: pdillinger fbshipit-source-id: 2888e748723a19d9ea40403934f13cbb8483430c	2019-12-15 21:31:08 -08:00
anand76	afa2420c2b	Introduce a new storage specific Env API (#5761 ) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f	2019-12-13 14:48:41 -08:00
Peter Dillinger	58d46d1915	Add useful idioms to Random API (OneInOpt, PercentTrue) (#6154 ) Summary: And clean up related code, especially in stress test. (More clean up of db_stress_test_base.cc coming after this.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6154 Test Plan: make check, make blackbox_crash_test for a bit Differential Revision: D18938180 Pulled By: pdillinger fbshipit-source-id: 524d27621b8dbb25f6dff40f1081e7c00630357e	2019-12-13 14:30:14 -08:00
Maysam Yabandeh	a796c06fef	Fix build breakage from lock_guard error (#6161 ) Summary: This change fixes a source issue that caused compile time error which breaks build for many fbcode services in that setup. The size() member function of channel is a const member, so member variables accessed within it are implicitly const as well. This caused error when clang fails to resolve to a constructor that takes std::mutex because the suitable constructor got rejected due to loss of constness for its argument. The fix is to add mutable modifier to the lock_ member of channel. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6161 Differential Revision: D18967685 Pulled By: maysamyabandeh fbshipit-source-id: 698b6a5153c3c92eeacb842c467aa28cc350d432	2019-12-12 13:50:27 -08:00
sdong	3c347821b7	Fix thread_local_test failure caused by recent io_uring change (#6136 ) Summary: thread_local_test now fails because it asserts no thread local instance is created when the test started. However, right now a thread local instance might be created when creating PosixEnv as a static variable. Fix the test by relaxing the assumption of starting from 0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6136 Test Plan: Find an environment where the test fails, and see it passes with the fix applied. Differential Revision: D18889224 fbshipit-source-id: 7946f3bfea81d236f7bb1554076696705b211b92	2019-12-09 12:03:30 -08:00
John Ericson	c16b087427	Work around weird unused errors with Mingw (#6075 ) Summary: From the reset of the code, it looks this this maybe can be unconditionally given the attribute? But I couldn't test with MSVC so I defensively put under CPP. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6075 Differential Revision: D18723749 fbshipit-source-id: 45fc8732c28dd29aab1644225d68f3c6f39bd69b	2019-11-26 21:42:29 -08:00
Peter Dillinger	ca3b6c28c9	Expose and elaborate FilterBuildingContext (#6088 ) Summary: This change enables custom implementations of FilterPolicy to wrap a variety of NewBloomFilterPolicy and select among them based on contextual information such as table level and compaction style. * Moves FilterBuildingContext to public API and elaborates it with more useful data. (It would be nice to put more general options-like data, but at the time this object is constructed, we are using internal APIs ImmutableCFOptions and MutableCFOptions and don't have easy access to ColumnFamilyOptions that I can tell.) * Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to GetBuilderWithContext, because it's now public. * Plumbs through the table's "level_at_creation" for filter building context. * Simplified some tests by adding GetBuilder() to MockBlockBasedTableTester. * Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including sample wrapper class LevelAndStyleCustomFilterPolicy. * Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits where it does not reset perf context. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088 Test Plan: make check, valgrind on db_bloom_filter_test Differential Revision: D18697817 Pulled By: pdillinger fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4	2019-11-26 18:24:10 -08:00
Peter Dillinger	57f3032285	Allow fractional bits/key in BloomFilterPolicy (#6092 ) Summary: There's no technological impediment to allowing the Bloom filter bits/key to be non-integer (fractional/decimal) values, and it provides finer control over the memory vs. accuracy trade-off. This is especially handy in using the format_version=5 Bloom filter in place of the old one, because bits_per_key=9.55 provides the same accuracy as the old bits_per_key=10. This change not only requires refining the logic for choosing the best num_probes for a given bits/key setting, it revealed a flaw in that logic. As bits/key gets higher, the best num_probes for a cache-local Bloom filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a standard Bloom filter. For example, at 16 bits per key, the best num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%). This change fixes and refines that logic (for the format_version=5 Bloom filter only, just in case) based on empirical tests to find accuracy inflection points between each num_probes. Although bits_per_key is now specified as a double, the new Bloom filter converts/rounds this to "millibits / key" for predictable/precise internal computations. Just in case of unforeseen compatibility issues, we round to the nearest whole number bits / key for the legacy Bloom filter, so as not to unlock new behaviors for it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6092 Test Plan: unit tests included Differential Revision: D18711313 Pulled By: pdillinger fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a	2019-11-26 15:59:34 -08:00
Levi Tamasi	75dfc7883d	Fix the constness issues around autovector::iterator_impl's dereference operators (#6057 ) Summary: As described in detail in issue https://github.com/facebook/rocksdb/issues/6048, iterators' dereference operators (`*`, `->`, and `[]`) should return `pointer`s/`reference`s (as opposed to `const_pointer`s/`const_reference`s) even if the iterator itself is `const` to be in sync with the standard's iterator concept. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6057 Test Plan: make check Differential Revision: D18623235 Pulled By: ltamasi fbshipit-source-id: 04e82d73bc0c67fb0ded018383af8dfc332050cc	2019-11-22 21:23:00 -08:00
Stephan T. Lavavej	3cd75736a7	Add operator[] to autovector::iterator_impl. (#6047 ) Summary: This is a required operator for random-access iterators, and an upcoming update for Visual Studio 2019 will change the C++ Standard Library's heap algorithms to use this operator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6047 Differential Revision: D18618531 Pulled By: ltamasi fbshipit-source-id: 08d10bc85bf2dbc3f7ef0fa3c777e99f1e927ef5	2019-11-20 11:28:41 -08:00
Peter Dillinger	0306e01233	Fixes for g++ 4.9.2 compatibility (#6053 ) Summary: Taken from merryChris in https://github.com/facebook/rocksdb/issues/6043 Stackoverflow ref on {{}} vs. {}: https://stackoverflow.com/questions/26947704/implicit-conversion-failure-from-initializer-list Note to reader: .clear() does not empty out an ostringstream, but .str("") suffices because we don't have to worry about clearing error flags. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6053 Test Plan: make check, manual run of filter_bench Differential Revision: D18602259 Pulled By: pdillinger fbshipit-source-id: f6190f83b8eab4e80e7c107348839edabe727841	2019-11-19 15:43:37 -08:00
Peter Dillinger	f059c7d9b9	New Bloom filter implementation for full and partitioned filters (#6007 ) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only slightly slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f	2019-11-13 16:44:01 -08:00
Peter Dillinger	18f57f5ef8	Add new persistent 64-bit hash (#5984 ) Summary: For upcoming new SST filter implementations, we will use a new 64-bit hash function (XXH3 preview, slightly modified). This change updates hash.{h,cc} for that change, adds unit tests, and out-of-lines the implementations to keep hash.h as clean/small as possible. In developing the unit tests, I discovered that the XXH3 preview always returns zero for the empty string. Zero is problematic for some algorithms (including an upcoming SST filter implementation) if it occurs more often than at the "natural" rate, so it should not be returned from trivial values using trivial seeds. I modified our fork of XXH3 to return a modest hash of the seed for the empty string. With hash function details out-of-lines in hash.h, it makes sense to enable XXH_INLINE_ALL, so that direct calls to XXH64/XXH32/XXH3p are inlined. To fix array-bounds warnings on some inline calls, I injected some casts to uintptr_t in xxhash.cc. (Issue reported to Yann.) Revised: Reverted using XXH_INLINE_ALL for now. Some Facebook checks are unhappy about #include on xxhash.cc file. I would fix that by rename to xxhash_cc.h, but to best preserve history I want to do that in a separate commit (PR) from the uintptr casts. Also updated filter_bench for this change, improving the performance predictability of dry run hashing and adding support for 64-bit hash (for upcoming new SST filter implementations, minor dead code in the tool for now). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5984 Differential Revision: D18246567 Pulled By: pdillinger fbshipit-source-id: 6162fbf6381d63c8cc611dd7ec70e1ddc883fbb8	2019-10-31 16:36:35 -07:00
Peter Dillinger	685e895652	Prepare filter tests for more implementations (#5967 ) Summary: This change sets up for alternate implementations underlying BloomFilterPolicy: * Refactor BloomFilterPolicy and expose in internal .h file so that it's easy to iterate over / select implementations for testing, regardless of what the best public interface will look like. Most notably updated db_bloom_filter_test to use this. * Hide FullFilterBitsBuilder from unit tests (alternate derived classes planned); expose the part important for testing (CalculateSpace), as abstract class BuiltinFilterBitsBuilder. (Also cleaned up internally exposed interface to CalculateSpace.) * Rename BloomTest -> BlockBasedBloomTest for clarity (despite ongoing confusion between block-based table and block-based filter) * Assert that block-based filter construction interface is only used on BloomFilterPolicy appropriately constructed. (A couple of tests updated to add ", true".) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5967 Test Plan: make check Differential Revision: D18138704 Pulled By: pdillinger fbshipit-source-id: 55ef9273423b0696309e251f50b8c1b5e9ec7597	2019-10-31 14:12:33 -07:00
Peter Dillinger	26dc29633e	filter_bench not needed for ROCKSDB_LITE (#5978 ) Summary: filter_bench is a specialized micro-benchmarking tool that should not be needed with ROCKSDB_LITE. This should fix the LITE build. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5978 Test Plan: make LITE=1 check Differential Revision: D18177941 Pulled By: pdillinger fbshipit-source-id: b73a171404661e09e018bc99afcf8d4bf1e2949c	2019-10-28 14:12:36 -07:00
Peter Dillinger	3f891c40a0	More improvements to filter_bench (#5968 ) Summary: * Adds support for plain table filter. This is not critical right now, but does add a -impl flag that will be useful for new filter implementations initially targeted at block-based table (and maybe later ported to plain table) * Better mixing of inside vs. outside queries, for more realism * A -best_case option handy for implementation tuning inner loop * Option for whether to include hashing time in dry run / net timings No modifications to production code, just filter_bench. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5968 Differential Revision: D18139872 Pulled By: pdillinger fbshipit-source-id: 5b09eba963111b48f9e0525a706e9921070990e8	2019-10-25 13:27:07 -07:00
Peter Dillinger	b3dc2f3691	Update xxhash.cc to allow combined compilation (#5969 ) Summary: To fix unity_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/5969 Test Plan: make unity_test Differential Revision: D18140426 Pulled By: pdillinger fbshipit-source-id: d5516e6d665f57e3706b9f9b965b0c458e58ccef	2019-10-25 12:54:41 -07:00
Peter Dillinger	013babc685	Clean up some filter tests and comments (#5960 ) Summary: Some filtering tests were unfriendly to new implementations of FilterBitsBuilder because of dynamic_cast to FullFilterBitsBuilder. Most of those have now been cleaned up, worked around, or at least changed from crash on dynamic_cast failure to individual test failure. Also put some clarifying comments on filter-related APIs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5960 Test Plan: make check Differential Revision: D18121223 Pulled By: pdillinger fbshipit-source-id: e83827d9d5d96315d96f8e25a99cd70f497d802c	2019-10-24 18:48:16 -07:00
Peter Dillinger	ca7ccbe2ea	Misc hashing updates / upgrades (#5909 ) Summary: - Updated our included xxhash implementation to version 0.7.2 (== the latest dev version as of 2019-10-09). - Using XXH_NAMESPACE (like other fb projects) to avoid potential name collisions. - Added fastrange64, and unit tests for it and fastrange32. These are faster alternatives to hash % range. - Use preview version of XXH3 instead of MurmurHash64A for NPHash64 -- Had to update cache_test to increase probability of passing for any given hash function. - Use fastrange64 instead of % with uses of NPHash64 -- Had to fix WritePreparedTransactionTest.CommitOfDelayedPrepared to avoid deadlock apparently caused by new hash collision. - Set default seed for NPHash64 because specifying a seed rarely makes sense for it. - Removed unnecessary include xxhash.h in a popular .h file - Rename preview version of XXH3 to XXH3p for clarity and to ease backward compatibility in case final version of XXH3 is integrated. Relying on existing unit tests for NPHash64-related changes. Each new implementation of fastrange64 passed unit tests when manipulating my local build to select it. I haven't done any integration performance tests, but I consider the improved performance of the pieces being swapped in to be well established. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5909 Differential Revision: D18125196 Pulled By: pdillinger fbshipit-source-id: f6bf83d49d20cbb2549926adf454fd035f0ecc0d	2019-10-24 17:16:46 -07:00
Peter Dillinger	ec11eff3bc	FilterPolicy consolidation, part 2/2 (#5966 ) Summary: The parts that are used to implement FilterPolicy / NewBloomFilterPolicy and not used other than for the block-based table should be consolidated under table/block_based/filter_policy*. This change is step 2 of 2: mv util/bloom.cc table/block_based/filter_policy.cc This gets its own PR so that git has the best chance of following the rename for blame purposes. Note that low-level shared implementation details of Bloom filters remain in util/bloom_impl.h, and util/bloom_test.cc remains where it is for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5966 Test Plan: make check Differential Revision: D18124930 Pulled By: pdillinger fbshipit-source-id: 823bc09025b3395f092ef46a46aa5ba92a914d84	2019-10-24 15:44:51 -07:00
Peter Dillinger	dd19014a7a	FilterPolicy consolidation, part 1/2 (#5963 ) Summary: The parts that are used to implement FilterPolicy / NewBloomFilterPolicy and not used other than for the block-based table should be consolidated under table/block_based/filter_policy*. I don't foresee sharing these APIs with e.g. the Plain Table because they don't expose hashes for reuse in indexing. This change is step 1 of 2: (a) mv table/full_filter_bits_builder.h to table/block_based/filter_policy_internal.h which I expect to expand soon to internally reveal more implementation details for testing. (b) consolidate eventual contents of table/block_based/filter_policy.cc in util/bloom.cc, which has the most elaborate revision history (see step 2 ...) Step 2 soon to follow: mv util/bloom.cc table/block_based/filter_policy.cc This gets its own PR so that git has the best chance of following the rename for blame purposes. Note that low-level shared implementation details of Bloom filters are in util/bloom_impl.h. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5963 Test Plan: make check Differential Revision: D18121199 Pulled By: pdillinger fbshipit-source-id: 8f21732c3d8909777e3240e4ac3123d73140326a	2019-10-24 13:20:35 -07:00
Peter Dillinger	2837008525	Vary key size and alignment in filter_bench (#5933 ) Summary: The first version of filter_bench has selectable key size but that size does not vary throughout a test run. This artificially favors "branchy" hash functions like the existing BloomHash, MurmurHash1, probably because of optimal return for branch prediction. This change primarily varies those key sizes from -2 to +2 bytes vs. the average selected size. We also set the default key size at 24 to better reflect our best guess of typical key size. But steadily random key sizes may not be realistic either. So this change introduces a new filter_bench option: -vary_key_size_log2_interval=n where the same key size is used 2^n times and then changes to another size. I've set the default at 5 (32 times same size) as a compromise between deployments with rather consistent vs. rather variable key sizes. On my Skylake system, the performance boost to MurmurHash1 largely lies between n=10 and n=15. Also added -vary_key_alignment (bool, now default=true), though this doesn't currently seem to matter in hash functions under consideration. This change also does a "dry run" for each testing scenario, to improve the accuracy of those numbers, as there was more difference between scenarios than expected. Subtracting gross test run times from dry run times is now also embedded in the output, because these "net" times are generally the most useful. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5933 Differential Revision: D18121683 Pulled By: pdillinger fbshipit-source-id: 3c7efee1c5661a5fe43de555e786754ddf80dc1e	2019-10-24 13:08:30 -07:00
Peter Dillinger	6a32e3b562	Remove unused BloomFilterPolicy::hash_func_ (#5961 ) Summary: This is an internal, file-local "feature" that is not used and potentially confusing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5961 Test Plan: make check Differential Revision: D18099018 Pulled By: pdillinger fbshipit-source-id: 7870627eeed09941d12538ec55d10d2e164fc716	2019-10-23 15:47:17 -07:00
Levi Tamasi	29ccf2075c	Store the filter bits reader alongside the filter block contents (#5936 ) Summary: Amongst other things, PR https://github.com/facebook/rocksdb/issues/5504 refactored the filter block readers so that only the filter block contents are stored in the block cache (as opposed to the earlier design where the cache stored the filter block reader itself, leading to potentially dangling pointers and concurrency bugs). However, this change introduced a performance hit since with the new code, the metadata fields are re-parsed upon every access. This patch reunites the block contents with the filter bits reader to eliminate this overhead; since this is still a self-contained pure data object, it is safe to store it in the cache. (Note: this is similar to how the zstd digest is handled.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5936 Test Plan: make asan_check filter_bench results for the old code: ``` $ ./filter_bench -quick WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.7153 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 33.4258 Single filter ns/op: 42.5974 Random filter ns/op: 217.861 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.4217 Single filter ns/op: 50.9855 Random filter ns/op: 219.167 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -quick -use_full_block_reader WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.5172 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 32.3556 Single filter ns/op: 83.2239 Random filter ns/op: 370.676 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.2265 Single filter ns/op: 93.5651 Random filter ns/op: 408.393 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) ``` With the new code: ``` $ ./filter_bench -quick WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 25.4285 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 31.0594 Single filter ns/op: 43.8974 Random filter ns/op: 226.075 ---------------------------- Outside queries... Dry run (25d) ns/op: 31.0295 Single filter ns/op: 50.3824 Random filter ns/op: 226.805 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -quick -use_full_block_reader WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.5308 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 33.2968 Single filter ns/op: 58.6163 Random filter ns/op: 291.434 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.1839 Single filter ns/op: 66.9039 Random filter ns/op: 292.828 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) ``` Differential Revision: D17991712 Pulled By: ltamasi fbshipit-source-id: 7ea205550217bfaaa1d5158ebd658e5832e60f29	2019-10-18 19:32:59 -07:00
Peter Dillinger	5f8f2fda0e	Refactor / clean up / optimize FullFilterBitsReader (#5941 ) Summary: FullFilterBitsReader, after creating in BloomFilterPolicy, was responsible for decoding metadata bits. This meant that FullFilterBitsReader::MayMatch had some metadata checks in order to implement "always true" or "always false" functionality in the case of inconsistent or trivial metadata. This made for ugly mixing-of-concerns code and probably had some runtime cost. It also didn't really support plugging in alternative filter implementations with extensions to the existing metadata schema. BloomFilterPolicy::GetFilterBitsReader is now (exclusively) responsible for decoding filter metadata bits and constructing appropriate instances deriving from FilterBitsReader. "Always false" and "always true" derived classes allow FullFilterBitsReader not to be concerned with handling of trivial or inconsistent metadata. This also makes for easy expansion to alternative filter implementations in new, alternative derived classes. This change makes calls to FilterBitsReader::MayMatch necessarily virtual because there's now more than one built-in implementation. Compared with the previous implementation's extra 'if' checks in MayMatch, there's no consistent performance difference, measured by (an older revision of) filter_bench (differences here seem to be within noise): Inside queries... - Dry run (407) ns/op: 35.9996 + Dry run (407) ns/op: 35.2034 - Single filter ns/op: 47.5483 + Single filter ns/op: 47.4034 - Batched, prepared ns/op: 43.1559 + Batched, prepared ns/op: 42.2923 ... - Random filter ns/op: 150.697 + Random filter ns/op: 149.403 ---------------------------- Outside queries... - Dry run (980) ns/op: 34.6114 + Dry run (980) ns/op: 34.0405 - Single filter ns/op: 56.8326 + Single filter ns/op: 55.8414 - Batched, prepared ns/op: 48.2346 + Batched, prepared ns/op: 47.5667 - Random filter ns/op: 155.377 + Random filter ns/op: 153.942 Average FP rate %: 1.1386 Also, the FullFilterBitsReader ctor was responsible for a surprising amount of CPU in production, due in part to inefficient determination of the CACHE_LINE_SIZE used to construct the filter being read. The overwhelming common case (same as my CACHE_LINE_SIZE) is now substantially optimized, as shown with filter_bench with -new_reader_every=1 (old option - see below) (repeatable result): Inside queries... - Dry run (453) ns/op: 118.799 + Dry run (453) ns/op: 105.869 - Single filter ns/op: 82.5831 + Single filter ns/op: 74.2509 ... - Random filter ns/op: 224.936 + Random filter ns/op: 194.833 ---------------------------- Outside queries... - Dry run (aa1) ns/op: 118.503 + Dry run (aa1) ns/op: 104.925 - Single filter ns/op: 90.3023 + Single filter ns/op: 83.425 ... - Random filter ns/op: 220.455 + Random filter ns/op: 175.7 Average FP rate %: 1.13886 However PR#5936 has/will reclaim most of this cost. After that PR, the optimization of this code path is likely negligible, but nonetheless it's clear we aren't making performance any worse. Also fixed inadequate check of consistency between filter data size and num_lines. (Unit test updated.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5941 Test Plan: previously added unit tests FullBloomTest.CorruptFilters and FullBloomTest.RawSchema Differential Revision: D18018353 Pulled By: pdillinger fbshipit-source-id: 8e04c2b4a7d93223f49a237fd52ef2483929ed9c	2019-10-18 14:50:52 -07:00
Peter Dillinger	93edd51c4a	bloom_test.cc: include <array> (#5920 ) Summary: Fix build failure on some platforms, reported in issue https://github.com/facebook/rocksdb/issues/5914 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5920 Test Plan: make bloom_test && ./bloom_test Differential Revision: D17918328 Pulled By: pdillinger fbshipit-source-id: b822004d4442de0171db2aeff433677783f7b94e	2019-10-14 15:38:31 -07:00
Vijay Nadimpalli	4c49e38f15	MultiGet batching in memtable (#5818 ) Summary: RocksDB has a MultiGet() API that implements batched key lookup for higher performance (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L468). Currently, batching is implemented in BlockBasedTableReader::MultiGet() for SST file lookups. One of the ways it improves performance is by pipelining bloom filter lookups (by prefetching required cachelines for all the keys in the batch, and then doing the probe) and thus hiding the cache miss latency. The same concept can be extended to the memtable as well. This PR involves implementing a pipelined bloom filter lookup in DynamicBloom, and implementing MemTable::MultiGet() that can leverage it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5818 Test Plan: Existing tests Performance Test: Ran the below command which fills up the memtable and makes sure there are no flushes and then call multiget. Ran it on master and on the new change and see atleast 1% performance improvement across all the test runs I did. Sometimes the improvement was upto 5%. TEST_TMPDIR=/data/users/$USER/benchmarks/feature/ numactl -C 10 ./db_bench -benchmarks="fillseq,multireadrandom" -num=600000 -compression_type="none" -level_compaction_dynamic_level_bytes -write_buffer_size=200000000 -target_file_size_base=200000000 -max_bytes_for_level_base=16777216 -reads=90000 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 -statistics -memtable_whole_key_filtering=true -memtable_bloom_size_ratio=10 Differential Revision: D17578869 Pulled By: vjnadimpalli fbshipit-source-id: 23dc651d9bf49db11d22375bf435708875a1f192	2019-10-10 09:39:39 -07:00
Peter Dillinger	90e285efde	Fix some implicit conversions in filter_bench (#5894 ) Summary: Fixed some spots where converting size_t or uint_fast32_t to uint32_t. Wrapped mt19937 in a new Random32 class to avoid future such traps. NB: I tried using Random32::Uniform (std::uniform_int_distribution) in filter_bench instead of fastrange, but that more than doubled the dry run time! So I added fastrange as Random32::Uniformish. ;) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5894 Test Plan: USE_CLANG=1 build, and manual re-run filter_bench Differential Revision: D17825131 Pulled By: pdillinger fbshipit-source-id: 68feee333b5f8193c084ded760e3d6679b405ecd	2019-10-08 19:22:07 -07:00
Peter Dillinger	46ca51d430	filter_bench - a prelim tool for SST filter benchmarking (#5825 ) Summary: Example: using the tool before and after PR https://github.com/facebook/rocksdb/issues/5784 shows that the refactoring, presumed performance-neutral, actually sped up SST filters by about 3% to 8% (repeatable result): Before: - Dry run ns/op: 22.4725 - Single filter ns/op: 51.1078 - Random filter ns/op: 120.133 After: + Dry run ns/op: 22.2301 + Single filter run ns/op: 47.4313 + Random filter ns/op: 115.9 Only tests filters for the block-based table (full filters and partitioned filters - same implementation; not block-based filters), which seems to be the recommended format/implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5825 Differential Revision: D17804987 Pulled By: pdillinger fbshipit-source-id: 0f18a9c254c57f7866030d03e7fa4ba503bac3c5	2019-10-07 20:10:53 -07:00
Peter Dillinger	9f54446525	Fix type in shift operation in bloom_test (#5882 ) Summary: Broken type for shift in PR#5834. Fixing code means fixing expected values in test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5882 Test Plan: thisisthetest Differential Revision: D17746136 Pulled By: pdillinger fbshipit-source-id: d3c456ed30b433d55fcab6fc7d836940fe3b46b8	2019-10-03 13:19:20 -07:00
Peter Dillinger	9e4913ce9d	Add FullBloomTest.CorruptFilters,RawSchema (#5834 ) Summary: There was significant untested logic in FullFilterBitsReader in the handling of serialized Bloom filter bits that cannot be generated by FullFilterBitsBuilder in the current compilation. These now test many of those corner-case behaviors, including bad metadata or filters created with different cache line size than the current compiled-in value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5834 Test Plan: thisisthetest Differential Revision: D17726372 Pulled By: pdillinger fbshipit-source-id: fb7b8003b5a8e6fb4666fe95206128f3d5835fc7	2019-10-02 15:33:48 -07:00
sdong	e8263dbdaa	Apply formatter to recent 200+ commits. (#5830 ) Summary: Further apply formatter to more recent commits. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830 Test Plan: Run all existing tests. Differential Revision: D17488031 fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab	2019-09-20 12:04:26 -07:00
sdong	c06b54d0c6	Apply formatter on recent 45 commits. (#5827 ) Summary: Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827 Test Plan: Run all existing tests. Differential Revision: D17483727 fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f	2019-09-19 12:34:17 -07:00
Levi Tamasi	2cbb61eadb	Make clang-analyzer happy (#5821 ) Summary: clang-analyzer has uncovered a bunch of places where the code is relying on pointers being valid and one case (in VectorIterator) where a moved-from object is being used: In file included from db/range_tombstone_fragmenter.cc:17: ./util/vector_iterator.h:23:18: warning: Method called on moved-from object 'keys' of type 'std::vector' current_(keys.size()) { ^~~~~~~~~~~ 1 warning generated. utilities/persistent_cache/block_cache_tier_file.cc:39:14: warning: Called C++ object pointer is null Status s = env->NewRandomAccessFile(filepath, file, opt); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ utilities/persistent_cache/block_cache_tier_file.cc:47:19: warning: Called C++ object pointer is null Status status = env_->GetFileSize(Path(), size); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ utilities/persistent_cache/block_cache_tier_file.cc:290:14: warning: Called C++ object pointer is null Status s = env_->FileExists(Path()); ^~~~~~~~~~~~~~~~~~~~~~~~ utilities/persistent_cache/block_cache_tier_file.cc:363:35: warning: Called C++ object pointer is null CacheWriteBuffer* const buf = alloc_->Allocate(); ^~~~~~~~~~~~~~~~~~ utilities/persistent_cache/block_cache_tier_file.cc:399:41: warning: Called C++ object pointer is null const uint64_t file_off = buf_doff_ * alloc_->BufferSize(); ^~~~~~~~~~~~~~~~~~~~ utilities/persistent_cache/block_cache_tier_file.cc:463:33: warning: Called C++ object pointer is null size_t start_idx = lba.off_ / alloc_->BufferSize(); ^~~~~~~~~~~~~~~~~~~~ utilities/persistent_cache/block_cache_tier_file.cc:515:5: warning: Called C++ object pointer is null alloc_->Deallocate(bufs_[i]); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7 warnings generated. ar: creating librocksdb_debug.a utilities/memory/memory_test.cc:68:25: warning: Called C++ object pointer is null cache_set->insert(db->GetDBOptions().row_cache.get()); ^~~~~~~~~~~~~~~~~~ 1 warning generated. The patch fixes these by adding assertions and explicitly passing in zero when initializing VectorIterator::current_ (which preserves the existing behavior). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5821 Test Plan: Ran make check and make analyze to make sure the warnings have disappeared. Differential Revision: D17455949 Pulled By: ltamasi fbshipit-source-id: 363619618ea649a0674287f9f3b3393e390571ee	2019-09-18 15:25:48 -07:00
风	2389aa2da9	Remove unneeded unlock statement (#5809 ) Summary: The dtor will automatically do unlock Pull Request resolved: https://github.com/facebook/rocksdb/pull/5809 Differential Revision: D17453694 Pulled By: ltamasi fbshipit-source-id: 5348bff8e6a620a05ff639a5454e8d82ae98a22d	2019-09-18 14:26:37 -07:00
andrew	622683000c	Allow users to stop manual compactions (#3971 ) Summary: Manual compaction may bring in very high load because sometime the amount of data involved in a compaction could be large, which may affect online service. So it would be good if the running compaction making the server busy can be stopped immediately. In this implementation, stopping manual compaction condition is only checked in slow process. We let deletion compaction and trivial move go through. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3971 Test Plan: add tests at more spots. Differential Revision: D17369043 fbshipit-source-id: 575a624fb992ce0bb07d9443eb209e547740043c	2019-09-16 21:01:47 -07:00
Peter Dillinger	68626249c3	Refactor/consolidate legacy Bloom implementation details (#5784 ) Summary: Refactoring to consolidate implementation details of legacy Bloom filters. This helps to organize and document some related, obscure code. Also added make/cpp var TEST_CACHE_LINE_SIZE so that it's easy to compile and run unit tests for non-native cache line size. (Fixed a related test failure in db_properties_test.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5784 Test Plan: make check, including Recently added Bloom schema unit tests (in ./plain_table_db_test && ./bloom_test), and including with TEST_CACHE_LINE_SIZE=128U and TEST_CACHE_LINE_SIZE=256U. Tested the schema tests with temporary fault injection into new implementations. Some performance testing with modified unit tests suggest a small to moderate improvement in speed. Differential Revision: D17381384 Pulled By: pdillinger fbshipit-source-id: ee42586da996798910fc45ac0b6289147f16d8df	2019-09-16 16:17:09 -07:00
Peter Dillinger	d3a6726f02	Revert changes from PR#5784 accidentally in PR#5780 (#5810 ) Summary: This will allow us to fix history by having the code changes for PR#5784 properly attributed to it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5810 Differential Revision: D17400231 Pulled By: pdillinger fbshipit-source-id: 2da8b1cdf2533cfedb35b5526eadefb38c291f09	2019-09-16 11:38:53 -07:00
sdong	b931f84e56	Divide file_reader_writer.h and .cc (#5803 ) Summary: file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/ Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803 Test Plan: Build whole project using make and cmake. Differential Revision: D17374550 fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987	2019-09-16 10:33:51 -07:00
Peter Dillinger	915d72d849	Improve accuracy testing for DynamicBloom (#5805 ) Summary: DynamicBloom unit test now tests non-sequential as well as sequential keys in testing FP rates. Also now verifies larger structures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5805 Test Plan: thisisthetest Differential Revision: D17398109 Pulled By: pdillinger fbshipit-source-id: 374074206c76d242efa378afc27830448a0e892a	2019-09-16 09:37:42 -07:00
Peter Dillinger	aa2486b23c	Refactor some confusing logic in PlainTableReader Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5780 Test Plan: existing plain table unit test Differential Revision: D17368629 Pulled By: pdillinger fbshipit-source-id: f25409cdc2f39ebe8d5cbb599cf820270e6b5d26	2019-09-13 10:26:36 -07:00
HouBingjian	a378a4c2ac	arm64 crc prefetch optimise (#5773 ) Summary: prefetch data for following block，avoid cache miss when doing crc caculate I do performance test at kunpeng-920 server(arm-v8, 64core@2.6GHz) ./db_bench --benchmarks=crc32c --block_size=500000000 before optimise : 587313.500 micros/op 1 ops/sec; 811.9 MB/s (500000000 per op) after optimise : 289248.500 micros/op 3 ops/sec; 1648.5 MB/s (500000000 per op) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5773 Differential Revision: D17347339 fbshipit-source-id: bfcd74f0f0eb4b322b959be68019ddcaae1e3341	2019-09-12 16:59:44 -07:00
Shylock Hg	9eb3e1f77d	Use delete to disable automatic generated methods. (#5009 ) Summary: Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009 Differential Revision: D17288733 fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3	2019-09-11 18:09:00 -07:00
Peter Dillinger	7af6ced14b	Fix block allocation bug in new DynamicBloom (#5783 ) Summary: Bug found by valgrind. New DynamicBloom wasn't allocating in block sizes. New assertion added that probes starting in final word would be in bounds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5783 Test Plan: ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 valgrind --leak-check=full ./dynamic_bloom_test Differential Revision: D17270623 Pulled By: pdillinger fbshipit-source-id: 1e0407504b875133a771383cd488c70f91be2b87	2019-09-09 15:26:43 -07:00
Peter Dillinger	108c619acb	Add regression test for serialized Bloom filters (#5778 ) Summary: Check that we don't accidentally change the on-disk format of existing Bloom filter implementations, including for various CACHE_LINE_SIZE (by changing temporarily). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5778 Test Plan: thisisthetest Differential Revision: D17269630 Pulled By: pdillinger fbshipit-source-id: c77017662f010a77603b7d475892b1f0d5563d8b	2019-09-09 14:51:30 -07:00
Wilfried Goesgens	fbab9913e2	upgrade gtest 1.7.0 => 1.8.1 for json result writing Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332 Differential Revision: D17242232 fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64	2019-09-09 11:24:11 -07:00
HouBingjian	ac97e6930f	bloom test check fail on arm (#5745 ) Summary: FullFilterBitsBuilder::CalculateSpace use CACHE_LINE_SIZE which is 64@X86 but 128@ARM64 when it run bloom_test.FullVaryingLengths it failed on ARM64 server, the assert can be fixed by change 128->CACHE_LINE_SIZE2 as merged ASSERT_LE(FilterSize(), (size_t)((length 10 / 8) + CACHE_LINE_SIZE * 2 + 5)) << length; run bloom_test before fix: /root/rocksdb-master/util/bloom_test.cc:281: Failure Expected: (FilterSize()) <= ((size_t)((length * 10 / 8) + 128 + 5)), actual: 389 vs 383 200 [ FAILED ] FullBloomTest.FullVaryingLengths (32 ms) [----------] 4 tests from FullBloomTest (32 ms total) [----------] Global test environment tear-down [==========] 7 tests from 2 test cases ran. (116 ms total) [ PASSED ] 6 tests. [ FAILED ] 1 test, listed below: [ FAILED ] FullBloomTest.FullVaryingLengths after fix: Filters: 37 good, 0 mediocre [ OK ] FullBloomTest.FullVaryingLengths (90 ms) [----------] 4 tests from FullBloomTest (90 ms total) [----------] Global test environment tear-down [==========] 7 tests from 2 test cases ran. (174 ms total) [ PASSED ] 7 tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5745 Differential Revision: D17076047 fbshipit-source-id: e7beb5d55d4855fceb2b84bc8119a6b0759de635	2019-09-05 17:03:24 -07:00
Peter Dillinger	b55b2f45d0	Faster new DynamicBloom implementation (for memtable) (#5762 ) Summary: Since DynamicBloom is now only used in-memory, we're free to change it without schema compatibility issues. The new implementation is drawn from (with manifest permission) `303542a767/bloom_simulation_tests/foo.cc (L613)` This has several speed advantages over the prior implementation: * Uses fastrange instead of % * Minimum logic to determine first (and all) probed memory addresses * (Major) Two probes per 64-bit memory fetch/write. * Very fast and effective (murmur-like) hash expansion/re-mixing. (At least on recent CPUs, integer multiplication is very cheap.) While a Bloom filter with 512-bit cache locality has about a 1.15x FP rate penalty (e.g. 0.84% to 0.97%), further restricting to two probes per 64 bits incurs an additional 1.12x FP rate penalty (e.g. 0.97% to 1.09%). Nevertheless, the unit tests show no "mediocre" FP rate samples, unlike the old implementation with more erratic FP rates. Especially for the memtable, we expect speed to outweigh somewhat higher FP rates. For example, a negative table query would have to be 1000x slower than a BF query to justify doubling BF query time to shave 10% off FP rate (working assumption around 1% FP rate). While that seems likely for SSTs, my data suggests a speed factor of roughly 50x for the memtable (vs. BF; ~1.5% lower write throughput when enabling memtable Bloom filter, after this change). Thus, it's probably not worth even 5% more time in the Bloom filter to shave off 1/10th of the Bloom FP rate, or 0.1% in absolute terms, and it's probably at least 20% slower to recoup that much FP rate from this new implementation. Because of this, we do not see a need for a 'locality' option that affects the MemTable Bloom filter and have decoupled the MemTable Bloom filter from Options::bloom_locality. Note that just 3% more memory to the Bloom filter (10.3 bits per key vs. just 10) is able to make up for the ~12% FP rate drop in the new implementation: [] # Nearly "ideal" FP-wise but reasonably fast cache-local implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out 10000000 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out time: 3.29372 sampled_fp_rate: 0.00985956 ... [] # Close match to this new implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out 10000000 6 10.3 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.10072 sampled_fp_rate: 0.00985655 ... [] # Old locality=1 implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out 10000000 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out time: 3.95472 sampled_fp_rate: 0.00988943 ... Also note the dramatic speed improvement vs. alternatives. -- Performance unit test: DynamicBloomTest.concurrent_with_perf is updated to report more precise timing data. (Measure running time of each thread, not just longest running thread, etc.) Results averaged over various sizes enabled with --enable_perf and 20 runs each; old dynamic bloom refers to locality=1, the faster of the old: old dynamic bloom, avg add latency = 65.6468 new dynamic bloom, avg add latency = 44.3809 old dynamic bloom, avg query latency = 50.6485 new dynamic bloom, avg query latency = 43.2186 old avg parallel add latency = 41.678 new avg parallel add latency = 24.5238 old avg parallel hit latency = 14.6322 new avg parallel hit latency = 12.3939 old avg parallel miss latency = 16.7289 new avg parallel miss latency = 12.2134 Tested on a dedicated 64-bit production machine at Facebook. Significant improvement all around. Despite now using std::atomic<uint64_t>, quick before-and-after test on a 32-bit machine (Intel Atom N270, released 2008) shows no regression in performance, in some cases modest improvement. -- Performance integration test (synthetic): with DEBUG_LEVEL=0, used TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom,readmissing,readrandom,stats --num=2000000 and optionally with -memtable_whole_key_filtering -memtable_bloom_size_ratio=0.01 300 runs each configuration. Write throughput change by enabling memtable bloom: Old locality=0: -3.06% Old locality=1: -2.37% New: -1.50% conclusion -> seems to substantially close the gap Readmissing throughput change by enabling memtable bloom: Old locality=0: +34.47% Old locality=1: +34.80% New: +33.25% conclusion -> maybe a small new penalty from FP rate Readrandom throughput change by enabling memtable bloom: Old locality=0: +31.54% Old locality=1: +31.13% New: +30.60% conclusion -> maybe also from FP rate (after memtable flush) -- Another conclusion we can draw from this new implementation is that the existing 32-bit hash function is not inherently crippling the Bloom filter speed or accuracy, below about 5 million keys. For speed, the implementation is essentially the same whether starting with 32-bits or 64-bits of hash; it just determines whether the first multiplication after fastrange is a pseudorandom expansion or needed re-mix. Note that this multiplication can occur while memory is fetching. For accuracy, in a standard configuration, you need about 5 million keys before you have about a 1.1x FP penalty due to using a 32-bit hash vs. 64-bit: [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.52069 sampled_fp_rate: 0.0118267 ... [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out time: 2.43871 sampled_fp_rate: 0.0109059 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5762 Differential Revision: D17214194 Pulled By: pdillinger fbshipit-source-id: ad9da031772e985fd6b62a0e1db8e81892520595	2019-09-05 14:59:25 -07:00

... 2 3 4 5 6 ...

2100 Commits