rocksdb

Commit Graph

Author	SHA1	Message	Date
Yu Zhang	f411c8bc97	Update folly Github hash (#13017 ) Summary: The internal codebase is updated for the coro directory's graduation from experimental. Updating our build script for a newer version with this change too. Using this hash: `03041f014b` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13017 Reviewed By: nickbrekhus Differential Revision: D62763932 Pulled By: jowlyzhang fbshipit-source-id: 1b211707fbc7d974d6d6ceaf577e174424bb44ed	2024-09-17 17:47:10 -07:00
Peter Dillinger	4577b672d5	More valgrind fixes (#12990 ) Summary: * https://github.com/facebook/rocksdb/issues/12936 was insufficient to fix the std::optional false positives. Making a fix validated in CI this time (see https://github.com/facebook/rocksdb/issues/12991) * valgrind grinds to a halt on startup on my dev machine apparently because it expects internet access. Disable its attempts to access the internet when git is using a proxy. * Move PORTABLE=1 from CI job to the Makefile. Without it, valgrind complains about illegal instructions (too new) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12990 Test Plan: manual, watch nightly valgrind job Reviewed By: ltamasi Differential Revision: D62203242 Pulled By: pdillinger fbshipit-source-id: a611b08da7dbd173b0709ed7feb0578729553a17	2024-09-06 10:11:34 -07:00
Radek Hubner	b6c3495a71	Update snappy dependency for Java releases. (#12207 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12207 Reviewed By: hx235 Differential Revision: D59299915 Pulled By: cbi42 fbshipit-source-id: 3f5fa88b0c5e8366a08734f99db1d3de942cd60b	2024-07-05 09:30:28 -07:00
Andrew Kryczka	13549817af	Update pinned folly version (#12801 ) Summary: `843fd57679` fixed the URL for libsodium. Updated folly version to latest, which includes that commit. I am not sure the URL will be stable, but it still seems better than substituting the URL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12801 Reviewed By: cbi42 Differential Revision: D58921033 Pulled By: ajkr fbshipit-source-id: 442ea3ff83ced2679ea9bfd04945e9449ce2ff96	2024-06-24 10:46:29 -07:00
Andrew Kryczka	40944cbbdb	Fix folly build (#12795 ) Summary: - Updated pinned folly version to the latest - gcc/g++ 10 is required since https://github.com/facebook/folly/commit/2c1c617e9e so we had to modify the tests using gcc/g++ 7 - libsodium 1.0.17 is no longer downloadable from GitHub so I found it elsewhere. I will submit a PR for that upstream to folly - USE_FOLLY_LITE changes - added boost header dependency instead of commenting out the `#include`s since that approach stopped working - added "folly/lang/Exception.cpp" to the compilation Pull Request resolved: https://github.com/facebook/rocksdb/pull/12795 Reviewed By: hx235 Differential Revision: D58916693 Pulled By: ajkr fbshipit-source-id: b5f9bca2d929825846ac898b785972b071db62b1	2024-06-22 15:15:02 -07:00
Yu Zhang	c73cf7a878	Add CompactForTieringCollector to support automatically trigger compaction for tiering use case (#12760 ) Summary: This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier. A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following: 1) Setting the ratio to be equal to or smaller than 0 disables this collector 2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met. 3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction. For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760 Test Plan: Added unit tests Reviewed By: pdillinger Differential Revision: D58734976 Pulled By: jowlyzhang fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81	2024-06-18 10:51:29 -07:00
Peter Dillinger	d6979bda40	Verify public headers do not reference internal ones (#12774 ) Summary: This is not currently caught by our public CI so adding a form of this check to `make check-headers`, which is part of the build-linux-unity-and-headers GHA job. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12774 Test Plan: manually added a violation, which was caught. Also caught an existing trivial violation (fixed). CI will verify it plays nice with GHA. Reviewed By: jowlyzhang Differential Revision: D58616601 Pulled By: pdillinger fbshipit-source-id: e656ce82709660c088a3d3a5e41dd07655cb40e0	2024-06-14 16:32:28 -07:00
anand76	d8fb849b7e	Basic RocksDB follower implementation (#12540 ) Summary: A basic implementation of RocksDB follower mode, which opens a remote database (referred to as leader) on a distributed file system by tailing its MANIFEST. It leverages the secondary instance mode, but is different in some key ways - 1. It has its own directory with links to the leader's database 2. Periodically refreshes itself 3. (Future) Snapshot support 4. (Future) Garbage collection of obsolete links 5. (Long term) Memtable replication There are two main classes implementing this functionality - `DBImplFollower` and `OnDemandFileSystem`. The former is derived from `DBImplSecondary`. Similar to `DBImplSecondary`, it implements recovery and catch up through MANIFEST tailing using the `ReactiveVersionSet`, but does not consider logs. In a future PR, we will implement memtable replication, which will eliminate the need to catch up using logs. In addition, the recovery and catch-up tries to avoid directory listing as repeated metadata operations are expensive. The second main piece is the `OnDemandFileSystem`, which plugs in as an `Env` for the follower instance and creates the illusion of the follower directory as a clone of the leader directory. It creates links to SSTs on first reference. When the follower tails the MANIFEST and attempts to create a new `Version`, it calls `VerifyFileMetadata` to verify the size of the file, and optionally the unique ID of the file. During this process, links are created which prevent the underlying files from getting deallocated even if the leader deletes the files. TODOs: Deletion of obsolete links, snapshots, robust checking against misconfigurations, better observability etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12540 Reviewed By: jowlyzhang Differential Revision: D56315718 Pulled By: anand1976 fbshipit-source-id: d19e1aca43a6af4000cb8622a718031b69ebd97b	2024-04-19 19:13:31 -07:00
Yu Zhang	74d419be4d	Add support in SstFileReader to get a raw table iterator (#12385 ) Summary: This PR adds support to programmatically iterate a raw table file with an iterator returned by `SstFileReader::NewTableIterator`. For third party tools to use to observe SST files created by RocksDB. The original feature request was from this merge request: https://github.com/facebook/rocksdb/pull/12370 Since keys returned by raw table iterators are internal keys, this PR also adds a struct `ParsedEntryInfo` and util method `ParseEntry` to support user to parse internal key. `GetInternalKeyForSeek`, and `GetInternalKeyForSeekForPrev` to support users to create internal keys for seek operations with this raw table iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12385 Test Plan: Added unit tests Reviewed By: cbi42 Differential Revision: D55662855 Pulled By: jowlyzhang fbshipit-source-id: 0716a173ee95924fbd4e1f9b6cccf06525c40049	2024-04-02 21:23:06 -07:00
Jay Huh	3412195367	Introduce MultiCfIterator (#12153 ) Summary: This PR introduces a new implementation of `Iterator` via a new public API called `NewMultiCfIterator()`. The new API takes a vector of column family handles to build a cross-column-family iterator, which internally maintains multiple `DBIter`s as child iterators from a consistent database state. When a key exists in multiple column families, the iterator selects the value (and wide columns) from the first column family containing the key, following the order provided in the `column_families` parameter. Similar to the merging iterator, a min heap is used to iterate across the child iterators. Backward iteration and direction change functionalities will be implemented in future PRs. The comparator used to compare keys across different column families will be derived from the iterator of the first column family specified in `column_families`. This comparator will be checked against the comparators from all other column families that the iterator will traverse. If there's a mismatch with any of the comparators, the initialization of the iterator will fail. Please note that this PR is not enough for users to start using `MultiCfIterator`. The `MultiCfIterator` and related APIs are still marked as "DO NOT USE - UNDER CONSTRUCTION". This PR is just the first of many PRs that will follow soon. This PR includes the following: - Introduction and partial implementation of the `MultiCfIterator`, which implements the generic `Iterator` interface. The implementation includes the construction of the iterator, `SeekToFirst()`, `Next()`, `Valid()`, `key()`, `value()`, and `columns()`. - Unit tests to verify iteration across multiple column families in two distinct scenarios: (1) keys are unique across all column families, and (2) the same keys exist in multiple column families. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12153 Reviewed By: pdillinger Differential Revision: D52308697 Pulled By: jaykorean fbshipit-source-id: b03e69f13b40af5a8f0598d0f43a0bec01ef8294	2024-03-05 10:22:43 -08:00
Adam Retter	5458eda5f0	Pass build parallelism flag to Docker builds (#12392 ) Summary: Passed the `-j` flag through to builds happening inside Docker containers. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12392 Reviewed By: cbi42 Differential Revision: D54311937 Pulled By: ajkr fbshipit-source-id: 5cf1bfe4b9059cc2d078fb5331812f32cf9e89ab	2024-02-28 12:51:00 -08:00
Adam Retter	055b21ab11	Update ZLib to 1.3.1 (#12358 ) Summary: pdillinger This fixes the RocksJava build, is also needed in the 8.10.fb and 8.11.fb branches please? Pull Request resolved: https://github.com/facebook/rocksdb/pull/12358 Reviewed By: jaykorean Differential Revision: D53859743 Pulled By: pdillinger fbshipit-source-id: b8417fccfee931591805f9aecdfae7c086fee708	2024-02-16 10:26:32 -08:00
Peter Dillinger	54cb9c77d9	Prefer static_cast in place of most reinterpret_cast (#12308 ) Summary: The following are risks associated with pointer-to-pointer reinterpret_cast: * Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do. * Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally. I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement: * Going through `void` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have `struct Derived: public Base1, public Base2`. If we have `Derived` -> `void` -> `Base2` -> `Derived` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic. * Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived` -> `Base2` -> `Derived` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance. With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain. A couple of related interventions included here: Previously Cache::Handle was not actually derived from in the implementations and just used as a `void` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle. Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse). Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work. I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308 Test Plan: existing tests, CI Reviewed By: ltamasi Differential Revision: D53204947 Pulled By: pdillinger fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2	2024-02-07 10:44:11 -08:00
git-hulk	7f2c59e316	Fix gcc12 build failure caused by INT_MIN in NumberToHumanString (#12215 ) Summary: This closes https://github.com/facebook/rocksdb/issues/11619 and adds the test case for this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12215 Reviewed By: hx235 Differential Revision: D52629313 Pulled By: ajkr fbshipit-source-id: 86b51728d98cf6d9a642cd5993c55190aa7fe12b	2024-01-10 10:17:31 -08:00
Ludovic Henry	5502f06729	Add support for linux-riscv64 (#12139 ) Summary: Following https://github.com/evolvedbinary/docker-rocksjava/pull/2, we can now build rocksdb on riscv64. I've verified this works as expected with `make rocksdbjavastaticdockerriscv64`. Also fixes https://github.com/facebook/rocksdb/issues/10500 https://github.com/facebook/rocksdb/issues/11994 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12139 Reviewed By: jaykorean Differential Revision: D52128098 Pulled By: akankshamahajan15 fbshipit-source-id: 706d36a3f8a9e990b76f426bc450937a0cd1a537	2023-12-14 11:27:17 -08:00
Peter Dillinger	c74531b1d2	Fix a nuisance compiler warning from clang (#12144 ) Summary: Example: ``` cache/clock_cache.cc:56:7: error: fallthrough annotation in unreachable code [-Werror,-Wimplicit-fallthrough] FALLTHROUGH_INTENDED; ^ ./port/lang.h:10:30: note: expanded from macro 'FALLTHROUGH_INTENDED' ^ ``` In clang < 14, this is annoyingly generated from -Wimplicit-fallthrough, but was changed to -Wunreachable-code-fallthrough (implied by -Wunreachable-code) in clang 14. See https://reviews.llvm.org/D107933 for how this nuisance pattern generated false positives similar to ours in the Linux kernel. Just to underscore the ridiculousness of this warning, here an error is reported on the annotation, not the call to do_something(), depending on the constexpr value (https://godbolt.org/z/EvxqdPTdr): ``` #include <atomic> void do_something(); void test(int v) { switch (v) { case 1: if constexpr (std::atomic<long>::is_always_lock_free) { return; } else { do_something(); [[fallthrough]]; } case 2: return; } } ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12144 Test Plan: Added the warning to our Makefile for USE_CLANG, which reproduced the warning-as-error as shown above, but is now fixed. Reviewed By: jaykorean Differential Revision: D52139615 Pulled By: pdillinger fbshipit-source-id: ba967ae700c0916d1a478bc465cf917633e337d9	2023-12-13 15:58:46 -08:00
Guozhang Wu	c06309c832	Not to print unnecessary commands in Makefile (#11978 ) Summary: When I run `make check`, there is a command that should not be printed to screen, which is shown below. ```text ... ... Generating parallel test scripts for util_merge_operators_test Generating parallel test scripts for write_batch_with_index_test make[2]: Leaving directory '/home/z/rocksdb' make[1]: Leaving directory '/home/z/rocksdb' GEN check make[1]: Entering directory '/home/z/rocksdb' $DEBUG_LEVEL is 1, $LIB_MODE is shared Makefile:185: Warning: Compiling in debug mode. Don't use the resulting binary in production printf '%s\n' '' \ 'To monitor subtest <duration,pass/fail,name>,' \ ' run "make watch-log" in a separate window' ''; \ { \ printf './%s\n' db_bloom_filter_test deletefile_test env_test c_test; \ find t -name 'run-' -print; \ } \ \| perl -pe 's,(^.MySQLStyleTransactionTest.$\|^.SnapshotConcurrentAccessTest.$\|^.SeqAdvanceConcurrentTest.$\|^t/run-table_test-HarnessTest.Randomized$\|^t/run-db_test-.(?:FileCreationRandomFailure\|EncodeDecompressedBlockSizeTest)$\|^.RecoverFromCorruptedWALWithoutFlush$),100 $1,' \| sort -k1,1gr \| sed 's/^[.0-9] //' \ \| grep -E '.' \ \| grep -E -v '"^$"' \ \| build_tools/gnu_parallel -j100% --plain --joblog=LOG --eta --gnu \ --tmpdir=/dev/shm/rocksdb.6lop '{} >& t/log-{/} \|\| bash -c "cat t/log-{/}; exit $?"' ; \ parallel_retcode=$? ; \ awk '{ if ($7 != 0 \|\| $8 != 0) { if ($7 == "Exitval") { h = $0; } else { if (!f) print h; print; f = 1 } } } END { if(f) exit 1; }' < LOG ; \ awk_retcode=$?; \ if [ $parallel_retcode -ne 0 ] \|\| [ $awk_retcode -ne 0 ] ; then exit 1 ; fi To monitor subtest <duration,pass/fail,name>, run "make watch-log" in a separate window Computers / CPU cores / Max jobs to run 1:local / 16 / 16 ``` The `printf` command will make the output confusing. It would be better not to print it. Before Change ![image](https://github.com/facebook/rocksdb/assets/30565051/92cf681a-40b7-462e-ae5b-23eeacbb8f82) After Change ![image](https://github.com/facebook/rocksdb/assets/30565051/4a70b04b-e4ef-4bed-9ce0-d942ed9d132e) Test Plan Not applicable. This is a trivial change, only to add a `@` before a Makefile command, and it will not impact any workflows. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11978 Reviewed By: jaykorean Differential Revision: D51076606 Pulled By: cbi42 fbshipit-source-id: dc079ab8f60a5a5b9d04a83888884657b2e442ff	2023-11-07 11:44:20 -08:00
Adam Retter	e0c45c15a7	Fix the ZStd checksum (#12005 ) Summary: Somehow we had the wrong checksum when validating the ZStd 1.5.5 download for RocksJava in the previous Pull Request - https://github.com/facebook/rocksdb/pull/9304. This PR fixes that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12005 Reviewed By: jaykorean Differential Revision: D50840338 Pulled By: cbi42 fbshipit-source-id: 8a92779d3bef013d812eecb89aaaf33fc73991ec	2023-10-31 12:23:34 -07:00
Adam Retter	d7567d5eee	Update libs for RocksJava Static build (#9304 ) Summary: Updates ZStd and Snappy to the latest versions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9304 Reviewed By: ajkr Differential Revision: D33176708 Pulled By: cbi42 fbshipit-source-id: eb50db50557c433e19fcc7c2874329d1d6cba93f	2023-10-20 10:38:27 -07:00
Alan Paxton	2296c624fa	Perform java static checks in CI (#11221 ) Summary: Integrate pmd on the Java API to catch and report common Java coding problems; fix or suppress a basic set of PMD checks. Link pmd into java build / CI Add a pmd dependency to maven Add a jpmd target to Makefile which runs pmd Add a workflow to Circle CI which runs pmd Configure an initial default pmd for CI Repair lots of very simple PMD reports generated when we apply pmd-rules.xml Repair or exception for PMD rules in the CloseResource category, which finds unclosed AutoCloseable resources. We special-case the configuration of CloseResource and use the // NOPMD comment in source the avoid reports where we are the API creating an AutoCloseable, and therefore returning an unclosed resource is correct behaviour. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11221 Reviewed By: akankshamahajan15 Differential Revision: D50369930 Pulled By: jowlyzhang fbshipit-source-id: a41c36b44b3bab7644df3e9cc16afbdf33b84f6b	2023-10-17 10:04:35 -07:00
anand76	269478ee46	Support compressed and local flash secondary cache stacking (#11812 ) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b	2023-09-21 20:30:53 -07:00
Jay Huh	ea9a5b2914	Wide Column support in ldb (#11754 ) Summary: wide_columns can now be pretty-printed in the following commands - `./ldb dump_wal` - `./ldb dump` - `./ldb idump` - `./ldb dump_live_files` - `./ldb scan` - `./sst_dump --command=scan` There are opportunities to refactor to reduce some nearly identical code. This PR is initial change to add wide column support in `ldb` and `sst_dump` tool. More PRs to come for the refactor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11754 Test Plan: New Tests added - `WideColumnsHelperTest::DumpWideColumns` - `WideColumnsHelperTest::DumpSliceAsWideColumns` Changes added to existing tests - `ExternalSSTFileTest::BasicMixed` added to cover mixed case (This test should have been added in https://github.com/facebook/rocksdb/issues/11688). This test does not verify the ldb or sst_dump output. This test was used to create test SST files having some rows with wide columns and some without and the generated SST files were used to manually test sst_dump_tool. - `createSST()` in `sst_dump_test` now takes `wide_column_one_in` to add wide column value in SST dump_wal ``` ./ldb dump_wal --walfile=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/000004.log --print_value --header ``` ``` Sequence,Count,ByteSize,Physical Offset,Key(s) : value 1,1,59,0,PUT_ENTITY(0) : 0x:0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172 2,1,34,42,PUT_ENTITY(0) : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572 3,1,17,7d,PUT(0) : 0x7468697264 : 0x62617A ``` idump ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ idump ``` ``` 'first' seq:1, type:22 => :hello attr_name1:foo attr_name2:bar 'second' seq:2, type:22 => attr_one:two attr_three:four 'third' seq:3, type:1 => baz Internal keys in range: 3 ``` SST Dump from dump_live_files ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ compact ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump_live_files ``` ``` ... ============================== SST Files ============================== /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst level:1 ------------------------------ Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst Sst file format: block-based 'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar 'second' seq:0, type:22 => attr_one:two attr_three:four 'third' seq:0, type:1 => baz ... ``` dump ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump ``` ``` first ==> :hello attr_name1:foo attr_name2:bar second ==> attr_one:two attr_three:four third ==> baz Keys in range: 3 ``` scan ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ scan ``` ``` first : :hello attr_name1:foo attr_name2:bar second : attr_one:two attr_three:four third : baz ``` sst_dump ``` ./sst_dump --file=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst --command=scan ``` ``` options.env is 0x7ff54b296000 Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst Sst file format: block-based from [] to [] 'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar 'second' seq:0, type:22 => attr_one:two attr_three:four 'third' seq:0, type:1 => baz ``` Reviewed By: ltamasi Differential Revision: D48837999 Pulled By: jaykorean fbshipit-source-id: b0280f0589d2b9716bb9b50530ffcabb397d140f	2023-08-30 12:45:52 -07:00
nikoPLP	17b33c8b2f	fix CXX not initialized early enough in Makefile on openbsd + platform version 10.14 on macos (#11675 ) Summary: fixes https://github.com/facebook/rocksdb/issues/11220 fixes https://github.com/facebook/rocksdb/issues/11594 CXX is not initialized early enough in Makefile. On OpenBSD its value is `g++` at first, and this results in several `command not found`, notably during the tests for HAVE_POWER8 and HAS_ALTIVEC which results in the build problem mentionned in https://github.com/facebook/rocksdb/issues/11594 reordering the Makefile fixes the issue, by placing the creation of make_config.mk and its import before any use of `$(CXX)` Also, fixes the platofrm version for macos. it must be 10.14 now that rocksdb is using the C++17 standard Pull Request resolved: https://github.com/facebook/rocksdb/pull/11675 Reviewed By: cbi42 Differential Revision: D48101615 Pulled By: ajkr fbshipit-source-id: 1f1b4d4604480b31675140b92c6fe97dc55b8c75	2023-08-11 10:59:49 -07:00
Yu Zhang	11ebddb1d4	Add utils to use for handling user defined timestamp size record in WAL (#11451 ) Summary: Add a util method `HandleWriteBatchTimestampSizeDifference` to handle a `WriteBatch` read from WAL log when user-defined timestamp size record is written and read. Two check modes are added: `kVerifyConsistency` that just verifies the recorded timestamp size are consistent with the running ones. This mode is to be used by `db_impl_secondary` for opening a DB as secondary instance. It will also be used by `db_impl_open` before the user comparator switch support is added to make a column switch between enabling/disable UDT feature. The other mode `kReconcileInconsistency` will be used by `db_impl_open` later when user comparator can be changed. Another change is to extract a method `CollectColumnFamilyIdsFromWriteBatch` in db_secondary_impl.h into its standalone util file so it can be shared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11451 Test Plan: ``` make check ./udt_util_test ``` Reviewed By: ltamasi Differential Revision: D45894386 Pulled By: jowlyzhang fbshipit-source-id: b96790777f154cddab6d45d9ba2e5d20ebc6fe9d	2023-05-22 14:28:58 -07:00
mayue.fight	8d8eb0e77e	Support Clip DB to KeyRange (#11379 ) Summary: This PR is part of the request https://github.com/facebook/rocksdb/issues/11317. (Another part is https://github.com/facebook/rocksdb/pull/11378) ClipDB() will clip the entries in the CF according to the range [begin_key, end_key). All the entries outside this range will be completely deleted (including tombstones). This feature is mainly used to ensure that there is no overlapping Key when calling CreateColumnFamilyWithImports() to import multiple CFs. When Calling ClipDB [begin, end), there are the following steps 1. Quickly and directly delete files without overlap DeleteFilesInRanges(nullptr, begin) + DeleteFilesInRanges(end, nullptr) 2. Delete the Key outside the range Delete[smallest_key, begin) + Delete[end, largest_key] 3. Delete the tombstone through Manul Compact CompactRange(option, nullptr, nullptr) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11379 Reviewed By: ajkr Differential Revision: D45840358 Pulled By: cbi42 fbshipit-source-id: 54152e8a45fd8ede137f99787eb252f0b51440a4	2023-05-18 13:25:01 -07:00
Peter Dillinger	459969e993	Simplify detection of x86 CPU features (#11419 ) Summary: Background - runtime detection of certain x86 CPU features was added for optimizing CRC32c checksums, where performance is dramatically affected by the availability of certain CPU instructions and code using intrinsics for those instructions. And Java builds with native library try to be broadly compatible but performant. What has changed is that CRC32c is no longer the most efficient cheecksum on contemporary x86_64 hardware, nor the default checksum. XXH3 is generally faster and not as dramatically impacted by the availability of certain CPU instructions. For example, on my Skylake system using db_bench (similar on an older Skylake system without AVX512): PORTABLE=1 empty USE_SSE : xxh3->8 GB/s crc32c->0.8 GB/s (no SSE4.2 nor AVX2 instructions) PORTABLE=1 USE_SSE=1 : xxh3->19 GB/s crc32c->16 GB/s (with SSE4.2 and AVX2) PORTABLE=0 USE_SSE ignored: xxh3->28 GB/s crc32c->16 GB/s (also some AVX512) Testing a ~10 year old system, with SSE4.2 but without AVX2, crc32c is a similar speed to the new systems but xxh3 is only about half that speed, also 8GB/s like the non-AVX2 compile above. Given that xxh3 has specific optimization for AVX2, I think we can infer that that crc32c is only fastest for that ~2008-2013 period when SSE4.2 was included but not AVX2. And given that xxh3 is only about 2x slower on these systems (not like >10x slower for unoptimized crc32c), I don't think we need to invest too much in optimally adapting to these old cases. x86 hardware that doesn't support fast CRC32c is now extremely rare, so requiring a custom build to support such hardware is fine IMHO. This change does two related things: * Remove runtime CPU detection for optimizing CRC32c on x86. Maintaining this code is non-zero work, and compiling special code that doesn't work on the configured target instruction set for code generation is always dubious. (On the one hand we have to ensure the CRC32c code uses SSE4.2 but on the other hand we have to ensure nothing else does.) * Detect CPU features in source code, not in build scripts. Although there are some hypothetical advantages to detectiong in build scripts (compiler generality), RocksDB supports at least three build systems: make, cmake, and buck. It's not practical to support feature detection on all three, and we have suffered from missed optimization opportunities by relying on missing or incomplete detection in cmake and buck. We also depend on some components like xxhash that do source code detection anyway. In more detail: * `HAVE_SSE42`, `HAVE_AVX2`, and `HAVE_PCLMUL` replaced by standard macros `__SSE4_2__`, `__AVX2__`, and `__PCLMUL__`. * MSVC does not provide high fidelity defines for SSE, PCLMUL, or POPCNT, but we can infer those from `__AVX__` or `__AVX2__` in a compatibility header. In rare cases of false negative or false positive feature detection, a build engineer should be able to set defines to work around the issue. * `__POPCNT__` is another standard define, but we happen to only need it on MSVC, where it is set by that compatibility header, or can be set by the build engineer. * `PORTABLE` can be set to a CPU type, e.g. "haswell", to compile for that CPU type. * `USE_SSE` is deprecated, now equivalent to PORTABLE=haswell, which roughly approximates its old behavior. Notably, this change should enable more builds to use the AVX2-optimized Bloom filter implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11419 Test Plan: existing tests, CI Manual performance tests after the change match the before above (none expected with make build). We also see AVX2 optimized Bloom filter code enabled when expected, by injecting a compiler error. (Performance difference is not big on my current CPU.) Reviewed By: ajkr Differential Revision: D45489041 Pulled By: pdillinger fbshipit-source-id: 60ceb0dd2aa3b365c99ed08a8b2a087a9abb6a70	2023-05-09 22:25:45 -07:00
Jeff Palm	6b67b561bc	util/ribbon_test.cc: avoid ambiguous reversed operator error in c++20 (#11371 ) Summary: util/ribbon_test.cc: avoid ambiguous reversed operator error in c++20 (and enable checking for the error) Code would produce errors like this, when compiled with -Wambiguous-reversed-operator under c++20. ``` util/ribbon_test.cc:695:20: error: ISO C++20 considers use of overloaded operator '!=' (with operand types 'KeyGen' (aka '(anonymous namespace)::StandardKeyGen') and 'KeyGen') to be ambiguou s despite there being a unique best viable function with non-reversed arguments [-Werror,-Wambiguous-reversed-operator] while (cur != batch_end) { ~~~ ^ ~~~~~~~~~ util/ribbon_test.cc:111:8: note: candidate function with non-reversed arguments bool operator!=(const StandardKeyGen& other) { ^ util/ribbon_test.cc:107:8: note: ambiguous candidate function with reversed arguments bool operator==(const StandardKeyGen& other) { ^ ``` This will become a hard error in future standards. Confirmed that no errors were generated when building using clang and c++20: ``` USE_CLANG=1 USE_COROUTINES=1 make ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11371 Reviewed By: meyering Differential Revision: D44921027 Pulled By: cbi42 fbshipit-source-id: ef25b78260920a4d75a718310688d3a2487ffa87	2023-04-12 13:24:34 -07:00
leipeng	b2c4bc5f73	Makefile: fix a typo: PLATFORM_CFLAGS to PLATFORM_CCFLAGS (#11348 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11348 Reviewed By: hx235 Differential Revision: D44774863 Pulled By: ajkr fbshipit-source-id: ba4bd959650228a71fca6bf62840ae9d7373d6f0	2023-04-07 16:54:05 -07:00
Peter Dillinger	b5827c806c	Revert to LIB_MODE=static for optimized builds (#11195 ) Summary: Continuous performance testing indicates there's a small performance hit with shared library (-fPIC) builds, so while retaining the motivation for https://github.com/facebook/rocksdb/issues/11168, we set the default for DEBUG_LEVEL=0 Makefile builds back to LIB_MODE=static. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11195 Test Plan: CI, with some updated checks and removal of some now obsolete LIB_MODE overrides Reviewed By: cbi42 Differential Revision: D43090576 Pulled By: pdillinger fbshipit-source-id: 755fe5d07005f85caf24e16f90228ffd46a6e250	2023-02-07 11:26:55 -08:00
Peter Dillinger	54d72085b5	Fix regression_test.sh for LIB_MODE=shared default (#11194 ) Summary: Need to scp the .so files. Switched to tar+ssh to support symlinks, faster handling of multiple files, and compression. Also fixing some holes in 'make clean' as I've noticed files like 'librocksdb.so.7.7.0', 'librocksdb_test_debug.so', 'librocksdb_tools_debug.so' hanging around after `make clean` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11194 Test Plan: Manually triggered regression test runs with change, manual `make clean` https://fburl.com/sandcastle/gnxy5lvc https://fburl.com/sandcastle/4pxodwh7 Reviewed By: cbi42 Differential Revision: D43069065 Pulled By: pdillinger fbshipit-source-id: 48552b5980956784a1fdb40638d9e8ad6db51900	2023-02-06 19:44:25 -08:00
Peter Dillinger	cf756ed916	Use LIB_MODE=shared build by default with make (#11168 ) Summary: With https://github.com/facebook/rocksdb/issues/11150 this becomes a practical change that I think is overall good for developer efficiency. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11168 Test Plan: More efficient build of all unit tests and tools: ``` $ git clean -fdx $ du -sh . 522M . $ /usr/bin/time make -j32 LIB_MODE=static ... 14270.63user 1043.33system 11:19.85elapsed 2252%CPU (0avgtext+0avgdata 1929944maxresident)k ... $ du -sh . 62G . $ ``` Vs. ``` $ git clean -fdx $ du -sh . 522M . $ /usr/bin/time make -j32 LIB_MODE=shared ... 9479.87user 478.26system 7:20.82elapsed 2258%CPU (0avgtext+0avgdata 1929272maxresident)k ... $ du -sh . 5.4G . $ ``` So 1/3 less build time and >90% less space usage. Individual unit test edit-compile-run is not too different. Modifying an average unit test source file: ``` $ touch db/version_builder_test.cc $ /usr/bin/time make -j32 LIB_MODE=static version_builder_test ... 34.74user 3.37system 0:38.29elapsed 99%CPU (0avgtext+0avgdata 945520maxresident)k ``` Vs. ``` $ touch db/version_builder_test.cc $ /usr/bin/time make -j32 LIB_MODE=shared version_builder_test ... 116.26user 43.91system 0:28.65elapsed 559%CPU (0avgtext+0avgdata 675160maxresident)k ``` A little faster with shared. However, modifying an average DB implementation file has an extra linking step with shared lib: ``` $ touch db/db_impl/db_impl_files.cc $ /usr/bin/time make -j32 LIB_MODE=static version_builder_test ... 33.17user 5.13system 0:39.70elapsed 96%CPU (0avgtext+0avgdata 945544maxresident)k ``` Vs. ``` $ touch db/db_impl/db_impl_files.cc $ /usr/bin/time make -j32 LIB_MODE=shared version_builder_test ... 40.80user 4.66system 0:45.54elapsed 99%CPU (0avgtext+0avgdata 1056340maxresident)k ``` A little slower with shared. On the whole, should be faster and lighter weight because of the many unit test files case Reviewed By: cbi42 Differential Revision: D42894004 Pulled By: pdillinger fbshipit-source-id: 9e827e52ace79b86f849b6a24466e318b4b605a7	2023-02-03 15:28:52 -08:00
sdong	4720ba4391	Remove RocksDB LITE (#11147 ) Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE \| egrep '[.](cc\|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2	2023-01-27 13:14:19 -08:00
Wenlong Zhang	1cfe3528a2	support loongarch64 for rocksdb (#10036 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10036 Reviewed By: hx235 Differential Revision: D42424074 Pulled By: ajkr fbshipit-source-id: 004adb75005a26bd01c5d568d1ec6ac442cd59dd	2023-01-13 08:42:44 -08:00
mrambacher	559aaa3577	Add ability to have unit tests for ROCKSDB_PLUGINS (#11052 ) Summary: This is based on speedb PR [143](https://github.com/speedb-io/speedb/pull/143). This PR adds the ability to add a xxx_TESTS variable to the make or cmake files for a plugin. When set, those files will be added to the unit tests built and executed by the corresponding make system. Note that the rule for building plugin tests via make could be expanded to almost every other unit test in RocksDB. This expansion would allow for a much smaller/simpler Makefile and make it easier to add new test files to RocksDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11052 Reviewed By: cbi42 Differential Revision: D42212269 Pulled By: ajkr fbshipit-source-id: d02668f7f4466900d63c90bb4f7962d23fcc7114	2022-12-30 16:55:58 -08:00
xiaochenfan	0993c9225f	Fix broken dependency: update zlib from 1.2.12 to 1.2.13 (#10833 ) Summary: zlib(https://zlib.net/) has released v1.2.13. 1.2.12 is no longer available for downloading and Makefile for rocksdb will be broken due to can't find the source .tar.gz. https://nvd.nist.gov/vuln/detail/CVE-2022-37434 This pr update the version number and the shasum of new .tar.gz file. (1.2.13) Fixes https://github.com/facebook/rocksdb/issues/10876 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10833 Reviewed By: hx235 Differential Revision: D40575954 Pulled By: ajkr fbshipit-source-id: 3e560e453ddf58d045214fc4e64f83bef91f22e5	2022-11-14 11:49:06 -08:00
Peter Dillinger	e466173d5c	Print stack traces on frozen tests in CI (#10828 ) Summary: Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off. For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828 Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI Reviewed By: hx235 Differential Revision: D40447634 Pulled By: pdillinger fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1	2022-10-18 00:35:35 -07:00
Anatol Pomozov	4a83b16ce3	Use grep instead of obsolete egrep (#10701 ) Summary: It fixes "egrep: warning: egrep is obsolescent; using grep -E" warning at the systems with newer gnu grep. https://www.phoronix.com/news/GNU-Grep-3.8-Stop-egrep-fgrep Pull Request resolved: https://github.com/facebook/rocksdb/pull/10701 Reviewed By: ajkr Differential Revision: D39737908 Pulled By: cbi42 fbshipit-source-id: 13f3ccfc1a37d18541d156a6b4c2ba24f6c66589	2022-09-22 16:58:21 -07:00
anand76	fb9a025892	Fix platform 10 build with folly (#10708 ) Summary: Change the library order in PLATFORM_LDFLAGS to enable fbcode platform 10 build with folly. This PR also has a few fixes for platform 10 compiler errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10708 Test Plan: ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_COROUTINES=1 make -j64 check ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_FOLLY=1 make -j64 check Reviewed By: ajkr Differential Revision: D39666590 Pulled By: anand1976 fbshipit-source-id: 256a1127ef561399cd6299a6a392ca29bd68ca44	2022-09-21 14:43:44 -07:00
Hui Xiao	f79b3d19a7	Inject spurious wakeup and sleep before acquiring db mutex to expose race condition (#10291 ) Summary: Context/Summary: Previous experience with bugs and flaky tests taught us there exist features in RocksDB vulnerable to race condition caused by acquiring db mutex at a particular timing. This PR aggressively exposes those vulnerable features by injecting spurious wakeup and sleep to cause acquiring db mutex at various timing in order to expose such race condition Testing: - `COERCE_CONTEXT_SWITCH=1 make -j56 check / make -j56 db_stress` should reveal - flaky tests caused by db mutex related race condition - Reverted https://github.com/facebook/rocksdb/pull/9528 - A/B testing on `COMPILE_WITH_TSAN=1 make -j56 listener_test` w/ and w/o `COERCE_CONTEXT_SWITCH=1` followed by `./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10` - `COERCE_CONTEXT_SWITCH=1` can cause expected test failure (i.e, expose target TSAN data race error) within 10 run while the other couldn't. - This proves our injection can expose flaky tests caused by db mutex related race condition faster. - known or new race-condition-type of internal bug by continuously running this PR - Performance - High ops-threads time: COERCE_CONTEXT_SWITCH=1 regressed by 4 times slower (2:01.16 vs 0:22.10 elapsed ). This PR will be run as a separate CI job so this regression won't affect any existing job. ``` TEST_TMPDIR=$db /usr/bin/time ./db_stress \ --ops_per_thread=100000 --expected_values_dir=$exp --clear_column_family_one_in=0 \ --write_buffer_size=524288 —target_file_size_base=524288 —ingest_external_file_one_in=100 —compact_files_one_in=1000 —compact_range_one_in=1000 ``` - Start-up time: COERCE_CONTEXT_SWITCH=1 didn't regress by 25% (0:01.51 vs 0:01.29 elapsed) ``` TEST_TMPDIR=$db ./db_stress -ops_per_thread=100000000 -expected_values_dir=$exp --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress TEST_TMPDIR=$db /usr/bin/time ./db_stress \ --ops_per_thread=1 -reopen=0 --expected_values_dir=$exp --clear_column_family_one_in=0 --destroy_db_initially=0 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10291 Reviewed By: ajkr Differential Revision: D39231182 Pulled By: hx235 fbshipit-source-id: 7ab6695430460e0826727fd8c66679b32b3e44b6	2022-09-12 13:55:23 -07:00
anand76	be09943fb5	Build and link libfolly with RocksDB (#10103 ) Summary: The current integration with folly requires cherry-picking folly source files to include in RocksDB for external CI builds. Its not scaleable as we depend on more features in folly, such as coroutines. This PR adds a dependency from RocksDB to the folly library when ```USE_FOLLY``` or ```USE_COROUTINES``` are set. We build folly using the build scripts in ```third-party/folly```, relying on it to download and build its dependencies. A new ```Makefile``` target, ```build_folly```, is provided to make building folly easier. A new option, ```USE_FOLLY_LITE``` is added to retain the old model of compiling selected folly sources with RocksDB. This might be useful for short-term development. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10103 Reviewed By: pdillinger Differential Revision: D38426787 Pulled By: anand1976 fbshipit-source-id: 33bc84abd9fdc7e2567749f02aa1b2494eb62b2f	2022-09-11 21:40:11 -07:00
Jay Zhuang	d9e71fb2c5	Fix periodic_task unable to re-register the same task type (#10379 ) Summary: Timer has a limitation that it cannot re-register a task with the same name, because the cancel only mark the task as invalid and wait for the Timer thread to clean it up later, before the task is cleaned up, the same task name cannot be added. Which makes the task option update likely to fail, which basically cancel and re-register the same task name. Change the periodic task name to a random unique id and store it in periodic_task_scheduler. Also refactor the `periodic_work` to `periodic_task` to make each job function as a `task`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10379 Test Plan: unittests Reviewed By: ajkr Differential Revision: D38000615 Pulled By: jay-zhuang fbshipit-source-id: e4135f9422e3b53aaec8eda54f4e18ce633a279e	2022-08-25 18:52:37 -07:00
Jay Zhuang	a3acf2ef87	Add seqno to time mapping (#10338 ) Summary: Which will be used for tiered storage to preclude hot data from compacting to the cold tier (the last level). Internally, adding seqno to time mapping. A periodic_task is scheduled to record the current_seqno -> current_time in certain cadence. When memtable flush, the mapping informaiton is stored in sstable property. During compaction, the mapping information are merged and get the approximate time of sequence number, which is used to determine if a key is recently inserted or not and preclude it from the last level if it's recently inserted (within the `preclude_last_level_data_seconds`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338 Test Plan: CI Reviewed By: siying Differential Revision: D37810187 Pulled By: jay-zhuang fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f	2022-07-14 21:49:34 -07:00
Jay Zhuang	6ce0b2ca34	Tiered Compaction: per key placement support (#9964 ) Summary: Support per_key_placement for last level compaction, which will be used for tiered compaction. * compaction iterator reports which level a key should output to; * compaction get the output level information and check if it's safe to output the data to penultimate level; * all compaction output files will be installed. * extra internal compaction stats added for penultimate level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9964 Test Plan: * Unittest * db_bench, no significate difference: https://gist.github.com/jay-zhuang/3645f8fb97ec0ab47c10704bb39fd6e4 * microbench manual compaction no significate difference: https://gist.github.com/jay-zhuang/ba679b3e89e24992615ee9eef310e6dd * run the db_stress multiple times (not covering the new feature) looks good (internal: https://fburl.com/sandcastle/9w84pp2m) Reviewed By: ajkr Differential Revision: D36249494 Pulled By: jay-zhuang fbshipit-source-id: a96da57c8031c1df83e4a7a8567b657a112b80a3	2022-07-13 20:54:49 -07:00
Manuel Ung	071fe39c05	Allow user to pass git command to makefile (#10318 ) Summary: This allows users to pass their git command with extra options if necessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10318 Reviewed By: ajkr Differential Revision: D37661175 Pulled By: lth fbshipit-source-id: 2a7cf27626c74f167471e6ec57e3870630a582b0	2022-07-06 14:28:00 -07:00
Levi Tamasi	c73d2a9d18	Add API for writing wide-column entities (#10242 ) Summary: The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds a new API called `PutEntity` that can be used to write a wide-column entity to the database. The new API is added to both `DB` and `WriteBatch`. Note that currently there is no way to retrieve these entities; more precisely, all read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they encounter a wide-column entity that is required to answer a query. Read-side support (as well as other missing functionality like `Merge`, compaction filter, and timestamp support) will be added in later PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D37369748 Pulled By: ltamasi fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d	2022-06-25 15:30:47 -07:00
Gang Liao	c965c9ef65	Read blob from blob cache if exists when GetBlob() (#10178 ) Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. In this task, we added a new abstraction layer `BlobSource` to retrieve blobs from either blob cache or raw blob file. Note: For simplicity, the current PR only includes `GetBlob()`. `MultiGetBlob()` will be included in the next PR. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10178 Reviewed By: ltamasi Differential Revision: D37250507 Pulled By: gangliao fbshipit-source-id: 3fc4a55a0cea955a3147bdc7dba06430e377259b	2022-06-17 15:22:59 -07:00
Peter Dillinger	1aac814578	Use optimized folly DistributedMutex in LRUCache when available (#10179 ) Summary: folly DistributedMutex is faster than standard mutexes though imposes some static obligations on usage. See https://github.com/facebook/folly/blob/main/folly/synchronization/DistributedMutex.h for details. Here we use this alternative for our Cache implementations (especially LRUCache) for better locking performance, when RocksDB is compiled with folly. Also added information about which distributed mutex implementation is being used to cache_bench output and to DB LOG. Intended follow-up: * Use DMutex in more places, perhaps improving API to support non-scoped locking * Fix linking with fbcode compiler (needs ROCKSDB_NO_FBCODE=1 currently) Credit: Thanks Siying for reminding me about this line of work that was previously left unfinished. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10179 Test Plan: for correctness, existing tests. CircleCI config updated. Also Meta-internal buck build updated. For performance, ran simultaneous before & after cache_bench. Out of three comparison runs, the middle improvement to ops/sec was +21%: Baseline: USE_CLANG=1 DEBUG_LEVEL=0 make -j24 cache_bench (fbcode compiler) ``` Complete in 20.201 s; Rough parallel ops/sec = 1584062 Thread ops/sec = 107176 Operation latency (ns): Count: 32000000 Average: 9257.9421 StdDev: 122412.04 Min: 134 Median: 3623.0493 Max: 56918500 Percentiles: P50: 3623.05 P75: 10288.02 P99: 30219.35 P99.9: 683522.04 P99.99: 7302791.63 ``` New: (add USE_FOLLY=1) ``` Complete in 16.674 s; Rough parallel ops/sec = 1919135 (+21%) Thread ops/sec = 135487 Operation latency (ns): Count: 32000000 Average: 7304.9294 StdDev: 108530.28 Min: 132 Median: 3777.6012 Max: 91030902 Percentiles: P50: 3777.60 P75: 10169.89 P99: 24504.51 P99.9: 59721.59 P99.99: 1861151.83 ``` Reviewed By: anand1976 Differential Revision: D37182983 Pulled By: pdillinger fbshipit-source-id: a17eb05f25b832b6a2c1356f5c657e831a5af8d1	2022-06-17 13:08:45 -07:00
Peter Dillinger	126c223714	Remove deprecated block-based filter (#10184 ) Summary: In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using the public API, because of its inefficiency. Although we normally maintain read compatibility on old DBs forever, filters are not required for reading a DB, only for optimizing read performance. Thus, it should be acceptable to remove this code and the substantial maintenance burden it carries as useful features are developed and validated (such as user timestamp). This change completely removes the code for reading and writing the old block-based filters, net removing about 1370 lines of code no longer needed. Options removed from testing / benchmarking tools. The prior existence is only evident in a couple of places: * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in a major release to minimize source code incompatibilities. * A warning is logged when an old table file is opened that used the old block-based filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing should suffice. Unfortunately, sst_dump does not tell you whether a file uses block-based filter, and the structure of the code makes it very difficult to fix. * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`) for metaindex is maintained (for now). Other notes: * In some cases where numbers are associated with filter configurations, we have had to update the assigned numbers so that they all correspond to something that exists. * Fixed potential stat counting bug by assuming `filter_checked = false` for cases like `filter == nullptr` rather than assuming `filter_checked = true` * Removed obsolete `block_offset` and `prefix_extractor` parameters from several functions. * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)` because the caller guarantees the prefix extractor exists and is compatible Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184 Test Plan: tests updated, manually test new warning in LOG using base version to generate a DB Reviewed By: riversand963 Differential Revision: D37212647 Pulled By: pdillinger fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80	2022-06-16 15:51:33 -07:00
Yanqin Jin	1777e5f7e9	Snapshots with user-specified timestamps (#9879 ) Summary: In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable. It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29. This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with user-specified timestamps. Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps. In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot` object is created with the last published sequence number of the super-version. You can see that the reader actually has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called, an arbitrarily long period of time may have already elapsed since the last write, which is when the last published sequence number is written. This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will ensure any two snapshots with timestamps should satisfy the following: ``` snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts ``` If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create a snapshot with associated timestamp. Code example ```cpp // Create a timestamped snapshot when committing transaction. txn->SetCommitTimestamp(100); txn->SetSnapshotOnNextOperation(); txn->Commit(); // A wrapper API for convenience Status Transaction::CommitAndTryCreateSnapshot( std::shared_ptr<TransactionNotifier> notifier, TxnTimestamp ts, std::shared_ptr<const Snapshot>* ret); // Create a timestamped snapshot if caller guarantees no concurrent writes std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100); ``` The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp. ```cpp // Return the timestamped snapshot correponding to given timestamp. If ts is // kMaxTxnTimestamp, then we return the latest timestamped snapshot if present. // Othersise, we return the snapshot whose timestamp is equal to `ts`. If no // such snapshot exists, then we return null. std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const; // Return the latest timestamped snapshot if present. std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const; ``` We also provide two additional APIs for stats collection and reporting purposes. ```cpp Status TransactionDB::GetAllTimestampedSnapshots( std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; // Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`. Status TransactionDB::GetTimestampedSnapshots( TxnTimestamp ts_lb, TxnTimestamp ts_ub, std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; ``` To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release timestamped snapshots whose timestamps are older than or equal to a given threshold. ```cpp void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts); ``` Before shutdown, RocksDB will release all timestamped snapshots. Comparison with user-defined timestamp and how they can be combined: User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile mapping between snapshots (sequence numbers) and timestamps. Different internal keys with the same user key but different timestamps will be treated as different by compaction, thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection. In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent). The timestamped snapshot supports the semantics of reading at an exact point in time. Timestamped snapshots can also be used with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879 Test Plan: ``` make check TEST_TMPDIR=/dev/shm make crash_test_with_txn ``` Reviewed By: siying Differential Revision: D35783919 Pulled By: riversand963 fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4	2022-06-10 16:07:03 -07:00
Levi Tamasi	e9c74bc474	Add wide column serialization primitives (#9915 ) Summary: The patch adds some low-level logic that can be used to serialize/deserialize a sorted vector of wide columns to/from a simple binary searchable string representation. Currently, there is no user-facing API; this will be implemented in subsequent stages. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9915 Test Plan: `make check` Reviewed By: siying Differential Revision: D35978076 Pulled By: ltamasi fbshipit-source-id: 33f5f6628ec3bcd8c8beab363b1978ac047a8788	2022-06-03 20:54:48 -07:00

1 2 3 4 5 ...

920 Commits