rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-30 13:41:46 +00:00

Author	SHA1	Message	Date
Yi Wu	982cec22af	Fix TARGETS file tests list Summary: 1. The buckifier script assume each test "foo" comes with a .cc file of the same name (i.e. foo.cc). Update cassandra tests to follow this pattern so that the buckifier script can recognize them. 2. add blob_db_test Closes https://github.com/facebook/rocksdb/pull/2506 Differential Revision: D5331517 Pulled By: yiwu-arbug fbshipit-source-id: 86f3eba471fc621186ab44cbd073b6162cde8e57	2017-06-27 14:12:02 -07:00
Ewout Prangsma	51778612c9	Encryption at rest support Summary: This PR adds support for encrypting data stored by RocksDB when written to disk. It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files. The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done. Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!). The Counter operation mode uses an initial counter & random initialization vector (IV). Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize). The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there. To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests. Typically you would run it like this: ``` ENCRYPTED_ENV=1 make check_some ``` There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings. Closes https://github.com/facebook/rocksdb/pull/2424 Differential Revision: D5322178 Pulled By: sdwilsh fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f	2017-06-26 16:56:24 -07:00
Chen Shen	cbd825deea	Create a MergeOperator for Cassandra Row Value Summary: This PR implements the MergeOperator for Cassandra Row Values. Closes https://github.com/facebook/rocksdb/pull/2289 Differential Revision: D5055464 Pulled By: scv119 fbshipit-source-id: 45f276ef8cbc4704279202f6a20c64889bc1adef	2017-06-16 14:27:00 -07:00
Siying Dong	95b0e89b5d	Improve write buffer manager (and allow the size to be tracked in block cache) Summary: Improve write buffer manager in several ways: 1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower. 2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit. 3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable. Closes https://github.com/facebook/rocksdb/pull/2350 Differential Revision: D5110648 Pulled By: siying fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017	2017-06-02 14:26:56 -07:00
Maysam Yabandeh	5a9b4d7435	Retire memenv https://github.com/facebook/rocksdb/pull/2082 Summary: This is a manual commit of this PR: Retire InMemoryEnv in favor of MockEnv #2082 With MockEnv doing the same yet being more mature, InMemoryEnv is redundant. Reviewed By: IslamAbdelRahman Differential Revision: D5162323 fbshipit-source-id: 59fd0082a891dc99cc531e4da9d68bf891eae3f5	2017-06-01 15:41:20 -07:00
Tamir Duberstein	0dc3040d54	db: avoid `#include`ing malloc and jemalloc simultaneously Summary: This fixes a compilation failure on Linux when the system libc is not glibc. jemalloc's configure script incorrectly assumes that glibc is always used on Linux systems, producing glibc-style signatures; when the system libc is e.g. musl, the following error is observed: ``` [ 0%] Building CXX object CMakeFiles/rocksdb.dir/db/db_impl.cc.o In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/table/block.h:19:0, from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:77: /x-tools/x86_64-unknown-linux-musl/x86_64-unknown-linux-musl/sysroot/usr/include/malloc.h:19:8: error: declaration of 'size_t malloc_usable_size(void)' has a different exception specifier size_t malloc_usable_size(void ); ^~~~~~~~~~~~~~~~~~ In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:20:0: /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:78:33: note: from previous declaration 'size_t malloc_usable_size(void*) throw ()' # define je_malloc_usable_size malloc_usable_size ^ /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:239:41: note: in expansion of macro 'je_malloc_usable_size' JEMALLOC_EXPORT size_t JEMALLOC_NOTHROW je_malloc_usable_size( ^~~~~~~~~~~~~~~~~~~~~ CMakeFiles/rocksdb.dir/build.make:350: recipe for target 'CMakeFiles/rocksdb.dir/db/db_impl.cc.o' failed ``` This works around the issue by rearranging the sources such that jemalloc's headers are never in the same scope as the system's malloc header. The jemalloc issue has been reported as well, see: https://github.com/jemalloc/jemalloc/issues/778. cc tschottdorf Closes https://github.com/facebook/rocksdb/pull/2188 Differential Revision: D5163048 Pulled By: siying fbshipit-source-id: c553125458892def175c1be5682b0330d80b2a0d	2017-05-31 22:43:02 -07:00
Yi Wu	ad19eb8686	Fixing blob db sequence number handling Summary: Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch. Stacking on #2375. Closes https://github.com/facebook/rocksdb/pull/2385 Differential Revision: D5148358 Pulled By: yiwu-arbug fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33	2017-05-31 10:56:45 -07:00
Yi Wu	345878a7fb	update blob_db_test Summary: Re-enable blob_db_test with some update: * Commented out delay at the end of GC tests. Will update the logic later with sync point to properly trigger GC. * Added some helper functions. Also update make files to include blob_dump tool. Closes https://github.com/facebook/rocksdb/pull/2375 Differential Revision: D5133793 Pulled By: yiwu-arbug fbshipit-source-id: 95470b26d0c1f9592ba4b7637e027fdd263f425c	2017-05-30 22:26:13 -07:00
Yi Wu	578fb0b1dc	Simple blob file dumper Summary: A simple blob file dumper. Closes https://github.com/facebook/rocksdb/pull/2242 Differential Revision: D5097553 Pulled By: yiwu-arbug fbshipit-source-id: c6e00d949fcd3658f9f68da9352f06339fac418d	2017-05-23 10:42:59 -07:00
Adam Retter	88c818e437	Replace deprecated RocksDB#addFile with RocksDB#ingestExternalFile Summary: Previously the Java implementation of `RocksDB#addFile` was both incomplete and not inline with the C++ API. Rather than fix it, as I see that `rocksdb::DB::AddFile` is now deprecated in favour of `rocksdb::DB::IngestExternalFile`, I have removed the old broken implementation and implemented `RocksDB#ingestExternalFile`. Closes https://github.com/facebook/rocksdb/issues/2261 Closes https://github.com/facebook/rocksdb/pull/2291 Differential Revision: D5061264 Pulled By: sagar0 fbshipit-source-id: 85df0899fa1b1fc3535175cac4f52353511d4104	2017-05-22 10:27:23 -07:00
Yi Wu	86d5492530	Fix build error with blob DB. Summary: snprintf is in <stdio.h> and not in namespace std. Closes https://github.com/facebook/rocksdb/pull/2287 Reviewed By: anirbanr-fb Differential Revision: D5054752 Pulled By: yiwu-arbug fbshipit-source-id: 356807ec38f3c7d95951cdb41f31a3d3ae0714d4	2017-05-15 14:05:46 -07:00
Andrew Kryczka	3fa9a39c68	Add GetAllKeyVersions API Summary: - Introduced an include/ file dedicated to db-related debug functions to avoid making db.h more complex - Added debugging function, `GetAllKeyVersions()`, to return a listing of internal data for a range of user keys. The new `struct KeyVersion` exposes data similar to internal key without exposing any internal type. - Migrated the "ldb idump" subcommand to use this function - The API takes an inclusive-exclusive range to match behavior of "ldb idump". This will be quite annoying for users who want to query a single user key's versions :(. Closes https://github.com/facebook/rocksdb/pull/2232 Differential Revision: D4976007 Pulled By: ajkr fbshipit-source-id: cab375da53a7595d6575af2b7e3b776aa3ad793e	2017-05-12 15:54:06 -07:00
Anirban Rahut	d85ff4953c	Blob storage pr Summary: The final pull request for Blob Storage. Closes https://github.com/facebook/rocksdb/pull/2269 Differential Revision: D5033189 Pulled By: yiwu-arbug fbshipit-source-id: 6356b683ccd58cbf38a1dc55e2ea400feecd5d06	2017-05-10 15:14:44 -07:00
Andrew Kryczka	f6a27d0bce	Extract statistics tests into separate file Summary: I'm going to add more DB tests for statistics as currently we have very few. I started a file dedicated to this purpose and moved the existing stats-specific tests there. Closes https://github.com/facebook/rocksdb/pull/2211 Differential Revision: D4951558 Pulled By: ajkr fbshipit-source-id: 05d11c35079c40ecabdfd2cf5556ccb761f694a4	2017-04-26 14:47:23 -07:00
Andrew Kryczka	e5e545a021	Reunite checkpoint and backup core logic Summary: These code paths forked when checkpoint was introduced by copy/pasting the core backup logic. Over time they diverged and bug fixes were sometimes applied to one but not the other (like fix to include all relevant WALs for 2PC), or it required extra effort to fix both (like fix to forge CURRENT file). This diff reunites the code paths by extracting the core logic into a function, CreateCustomCheckpoint(), that is customizable via callbacks to implement both checkpoint and backup. Related changes: - flush_before_backup is now forcibly enabled when 2PC is enabled - Extracted CheckpointImpl class definition into a header file. This is so the function, CreateCustomCheckpoint(), can be called by internal rocksdb code but not exposed to users. - Implemented more functions in DummyDB/DummyLogFile (in backupable_db_test.cc) that are used by CreateCustomCheckpoint(). Closes https://github.com/facebook/rocksdb/pull/1932 Differential Revision: D4622986 Pulled By: ajkr fbshipit-source-id: 157723884236ee3999a682673b64f7457a7a0d87	2017-04-24 15:06:46 -07:00
Siying Dong	ff97287016	Refactor compaction picker code Summary: 1. Move universal compaction picker to separate files compaction_picker_universal.cc and compaction_picker_universal.h. 2. Rename some functions to make the code easier to understand. 3. Move leveled compaction picking code to a dedicated class, so that we we don't need to pass some common variable around when calling functions. It also allowed us to break down LevelCompactionPicker::PickCompaction() to smaller functions. Closes https://github.com/facebook/rocksdb/pull/2100 Differential Revision: D4845948 Pulled By: siying fbshipit-source-id: efa0ab4	2017-04-06 20:09:34 -07:00
Sagar Vemuri	343b59d6ee	Move various string utility functions into string_util Summary: This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper. Check the sub-commits for easier reviewing. Closes https://github.com/facebook/rocksdb/pull/2094 Differential Revision: D4837730 Pulled By: sagar0 fbshipit-source-id: 344278a	2017-04-06 14:54:12 -07:00
Yi Wu	df6f5a3772	Move memtable related files into memtable directory Summary: Move memtable related files into memtable directory. Closes https://github.com/facebook/rocksdb/pull/2087 Differential Revision: D4829242 Pulled By: yiwu-arbug fbshipit-source-id: ca70ab6	2017-04-06 14:09:13 -07:00
Siying Dong	d2dce5611a	Move some files under util/ to separate dirs Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef	2017-04-05 19:09:16 -07:00
Siying Dong	ce64b8b719	Divide db/db_impl.cc Summary: db_impl.cc is too large to manage. Divide db_impl.cc into db/db_impl.cc, db/db_impl_compaction_flush.cc, db/db_impl_files.cc, db/db_impl_open.cc and db/db_impl_write.cc. Closes https://github.com/facebook/rocksdb/pull/2095 Differential Revision: D4838188 Pulled By: siying fbshipit-source-id: c5f3059	2017-04-05 17:24:19 -07:00
Andrew Kryczka	e2c6c06366	add TimedEnv Summary: I've needed Env timing measurements a few times now, so finally built something for it. Closes https://github.com/facebook/rocksdb/pull/2073 Differential Revision: D4811231 Pulled By: ajkr fbshipit-source-id: 218a249	2017-04-04 11:24:12 -07:00
Siying Dong	6ef8c620d3	Move auto_roll_logger and filename out of db/ Summary: It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency. Closes https://github.com/facebook/rocksdb/pull/2080 Differential Revision: D4821141 Pulled By: siying fbshipit-source-id: ca7d768	2017-04-03 18:39:14 -07:00
Adam Retter	0ee7f04039	Added missing options to RocksJava Summary: This adds almost all missing options to RocksJava Closes https://github.com/facebook/rocksdb/pull/2039 Differential Revision: D4779991 Pulled By: siying fbshipit-source-id: 4a1bf28	2017-03-30 12:09:21 -07:00
Maysam Yabandeh	a2f7a514d1	Refactoring Summary: This is the first split of https://github.com/facebook/rocksdb/pull/1891 and will be needed for the upcoming partitioned filter patch. Closes https://github.com/facebook/rocksdb/pull/1949 Differential Revision: D4652152 Pulled By: maysamyabandeh fbshipit-source-id: 9801778	2017-03-03 18:24:12 -08:00
Siying Dong	ba4c77bd6b	Divide external_sst_file_test Summary: Separate the platform dependent tests from external_sst_file_test. Only those tests need to run on platforms like OSX Closes https://github.com/facebook/rocksdb/pull/1923 Differential Revision: D4622461 Pulled By: siying fbshipit-source-id: d2d6f04	2017-02-28 14:24:11 -08:00
Siying Dong	8ad0fcdf99	Separate small subset tests in DBTest Summary: Separate a smal subset of tests in DBTest to DBBasicTest. Tests in DBTest don't have to run in CI tests on platforms like OSX, as long as they are covered by Linux. Closes https://github.com/facebook/rocksdb/pull/1924 Differential Revision: D4616702 Pulled By: siying fbshipit-source-id: 13e6549	2017-02-27 12:24:11 -08:00
Siying Dong	1ba2804b7f	Remove XFunc tests Summary: Xfunc is hardly used. Remove it to keep the code simple. Closes https://github.com/facebook/rocksdb/pull/1905 Differential Revision: D4603220 Pulled By: siying fbshipit-source-id: 731f96d	2017-02-23 12:09:11 -08:00
Islam AbdelRahman	574b543f80	Rename merger.h -> merging_iterator.h Summary: merger.h was always a confusing name for me, simply give the file a better name Closes https://github.com/facebook/rocksdb/pull/1836 Differential Revision: D4505357 Pulled By: IslamAbdelRahman fbshipit-source-id: 07b28d8	2017-02-02 16:54:19 -08:00
Andrew Kryczka	17c1180603	Generalize Env registration framework Summary: The Env registration framework supports registering client Envs and selecting which one to instantiate according to a text field. This enabled things like adding the -env_uri argument to db_bench, so the same binary could be reused with different Envs just by changing CLI config. Now this problem has come up again in a non-Env context, as I want to instantiate a client Statistics implementation from db_bench, which is configured entirely via text parameters. Also, in the future we may wish to use it for deserializing client objects when loading OPTIONS file. This diff generalizes the Env registration logic to work with arbitrary types. - Generalized registration and instantiation code by templating them - The entire implementation is in a header file as that's Google style guide's recommendation for template definitions - Pattern match with std::regex_match rather than checking prefix, which was the previous behavior - Rename functions/files to be non-Env-specific Closes https://github.com/facebook/rocksdb/pull/1776 Differential Revision: D4421933 Pulled By: ajkr fbshipit-source-id: 34647d1	2017-01-25 16:09:14 -08:00
Yi Wu	c270735861	Iterator should be in corrupted status if merge operator return false Summary: Iterator should be in corrupted status if merge operator return false. Also add test to make sure if max_successive_merges is hit during write, data will not be lost. Closes https://github.com/facebook/rocksdb/pull/1665 Differential Revision: D4322695 Pulled By: yiwu-arbug fbshipit-source-id: b327b05	2016-12-16 11:09:16 -08:00
Igor Canadi	3f407b065c	Kill flashcache code in RocksDB Summary: Now that we have userspace persisted cache, we don't need flashcache anymore. Closes https://github.com/facebook/rocksdb/pull/1588 Differential Revision: D4245114 Pulled By: igorcanadi fbshipit-source-id: e2c1c72	2016-12-01 10:09:22 -08:00
Andrew Kryczka	e333528991	DeleteRange write path end-to-end tests Summary: Closes https://github.com/facebook/rocksdb/pull/1578 Differential Revision: D4241171 Pulled By: ajkr fbshipit-source-id: ce5fd83	2016-11-29 11:09:22 -08:00
Islam AbdelRahman	86eb2b9ad9	Fix src.mk	2016-11-16 18:05:19 -08:00
Islam AbdelRahman	869ae5d786	Support IngestExternalFile (remove AddFile restrictions) Summary: Changes in the diff API changes: - Introduce IngestExternalFile to replace AddFile (I think this make the API more clear) - Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file) - Deprecate AddFile() API Logic changes: - If our file overlap with the memtable we will flush the memtable - We will find the first level in the LSM tree that our file key range overlap with the keys in it - We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it - We will assign a global sequence number to our new file - Remove AddFile restrictions by using global sequence numbers Other changes: - Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob Test Plan: unit tests (still need to add more) addfile_stress (https://reviews.facebook.net/D65037) Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong Reviewed By: sdong Subscribers: jkedgar, hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65061	2016-10-20 17:05:32 -07:00
Andrew Kryczka	6fbe96baf8	Compaction Support for Range Deletion Summary: This diff introduces RangeDelAggregator, which takes ownership of iterators provided to it via AddTombstones(). The tombstones are organized in a two-level map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data copy by holding Slices returned by the iterator, which remain valid thanks to pinning. For compaction, we create a hierarchical range tombstone iterator with structure matching the iterator over compaction input data. An aggregator based on that iterator is used by CompactionIterator to determine which keys are covered by range tombstones. In case of merge operand, the same aggregator is used by MergeHelper. Upon finishing each file in the compaction, relevant range tombstones are added to the output file's range tombstone metablock and file boundaries are updated accordingly. To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete() considers tombstones in the key's snapshot stripe. When this function is used outside of compaction, it also checks newer stripes, which can contain covering tombstones. Currently the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges within a stripe such that binary search can be used. RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range to a new table's range tombstone meta-block. Since range tombstones may fall in the gap between files, we may need to extend some files' key-ranges. The strategy is (1) first file extends as far left as possible and other files do not extend left, (2) all files extend right until either the start of the next file or the end of the last range tombstone in the gap, whichever comes first. One other notable change is adding release/move semantics to ScopedArenaIterator such that it can be used to transfer ownership of an arena-allocated iterator, similar to how unique_ptr is used for malloc'd data. Depends on D61473 Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927 Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62205	2016-10-18 12:04:56 -07:00
Injun Song	7260662b39	Add Java API for SstFileWriter Add Java API for SstFileWriter. Closes https://github.com/facebook/rocksdb/issues/1248	2016-09-29 17:04:41 -04:00
Yi Wu	9ed928e7a9	Split DBOptions into ImmutableDBOptions and MutableDBOptions Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs. Test Plan: make all check Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64065	2016-09-23 16:34:04 -07:00
Islam AbdelRahman	52ee07b021	Move AddFile() tests to external_sst_file_test.cc Summary: Simply move the tests Test Plan: make check -j64 Reviewers: andrewkr, lightmark, yiwu, yhchiang, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62529	2016-09-07 15:41:54 -07:00
Injun Song	ce1be2ce37	Fix build error on Windows (AppVeyor) (#1315 ) Add 'cf_options' to source list and db_imple.cc fix casting	2016-09-06 08:41:43 -07:00
Yi Wu	a88677d2cf	Remove ImmutableCFOptions from public API Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (`5665e5e285`) but later its reference in /include/rocksdb/table.h is removed. Test Plan: make all check Reviewers: IslamAbdelRahman, sdong, yhchiang Reviewed By: yhchiang Subscribers: yhchiang, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63177	2016-09-02 14:16:31 -07:00
Islam AbdelRahman	e9b2af87f8	Expose ThreadPool under include/rocksdb/threadpool.h Summary: This diff split ThreadPool to -ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h) -ThreadPoolImpl (actual implementation in util/threadpool_imp.h) This allow us to expose ThreadPool to the user so we can use it as an option later Test Plan: existing unit tests Reviewers: andrewkr, yiwu, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62085	2016-08-26 10:41:35 -07:00
Adam Retter	ffdf6eee19	Add Status to RocksDBException so that meaningful function result Status from the C++ API isn't lost (#1273 )	2016-08-22 11:01:42 -07:00
Yi Wu	4cc37f59e5	Introduce ClockCache Summary: Clock-based cache implemenetation aim to have better concurreny than default LRU cache. See inline comments for implementation details. Test Plan: Update cache_test to run on both LRUCache and ClockCache. Adding some new tests to catch some of the bugs that I fixed while implementing the cache. Reviewers: kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61647	2016-08-19 12:28:19 -07:00
sdong	8b79422b52	[Proof-Of-Concept] RocksDB Blob Storage with a blob log file. Summary: This is a proof of concept of a RocksDB blob log file. The actual value of the Put() is appended to a blob log using normal data block format, and the handle of the block is written as the value of the key in RocksDB. The prototype only supports Put() and Get(). It doesn't support DB restart, garbage collection, Write() call, iterator, snapshots, etc. Test Plan: Add unit tests. Reviewers: arahut Reviewed By: arahut Subscribers: kradhakrishnan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61485	2016-08-10 17:05:17 -07:00
omegaga	44f5cc57a5	Add time series database (resubmitted) Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows). Test Plan: Add `date_tiered_test`. Reviewers: dhruba, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61653	2016-08-05 15:56:22 -07:00
sdong	7c4615cf1f	A utility function to help users migrate DB after options change Summary: Add a utility function that trigger necessary full compaction and put output to the correct level by looking at new options and old options. Test Plan: Add unit tests for it. Reviewers: andrewkr, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: muthu, sumeet, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60783	2016-08-05 15:39:55 -07:00
omegaga	d51dc96a79	Experiments on column-aware encodings Summary: Experiments on column-aware encodings. Supported features: 1) extract data blocks from SST file and encode with specified encodings; 2) Decode encoded data back into row format; 3) Directly extract data blocks and write in row format (without prefix encoding); 4) Get column distribution statistics for column format; 5) Dump data blocks separated by columns in human-readable format. There is still on-going work on this diff. More refactoring is necessary. Test Plan: Wrote tests in `column_aware_encoding_test.cc`. More tests should be added. Reviewers: sdong Reviewed By: sdong Subscribers: arahut, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60027	2016-08-01 14:50:19 -07:00
krad	c116b47804	Persistent Read Cache (part 6) Block Cache Tier Implementation Summary: The patch is a continuation of part 5. It glues the abstraction for file layout and metadata, and flush out the implementation of the API. It adds unit tests for the implementation. Test Plan: Run unit tests Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57549	2016-08-01 14:15:14 -07:00
Islam AbdelRahman	f8061a237e	Fix Statistics TickersNameMap miss match with Tickers enum Summary: TickersNameMap is not consistent with Tickers enum. this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics) Test Plan: added new unit test Reviewers: sdong, kradhakrishnan Reviewed By: kradhakrishnan Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61083	2016-07-25 16:05:50 -07:00
Yi Wu	32604e6601	Fix flush not being commit while writing manifest Summary: Fix flush not being commit while writing manifest, which is a recent bug introduced by D60075. The issue: # Options.max_background_flushes > 1 # Background thread A pick up a flush job, flush, then commit to manifest. (Note that mutex is released before writing manifest.) # Background thread B pick up another flush job, flush. When it gets to `MemTableList::InstallMemtableFlushResults`, it notices another thread is commiting, so it quit. # After the first commit, thread A doesn't double check if there are more flush result need to commit, leaving the second flush uncommitted. Test Plan: run the test. Also verify the new test hit deadlock without the fix. Reviewers: sdong, igor, lightmark Reviewed By: lightmark Subscribers: andrewkr, omegaga, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60969	2016-07-21 10:10:41 -07:00
krad	d9cfaa2b16	Persistent Read Cache (6) Persistent cache tier implentation - File layout Summary: Persistent cache tier is the tier abstraction that can work for any block device based device mounted on a file system. The design/implementation can handle any generic block device. Any generic block support is achieved by generalizing the access patten as {io-size, q-depth, direct-io/buffered}. We have specifically tested and adapted the IO path for NVM and SSD. Persistent cache tier consists of there parts : 1) File layout Provides the implementation for handling IO path for reading and writing data (key/value pair). 2) Meta-data Provides the implementation for handling the index for persistent read cache. 3) Implementation It binds (1) and (2) and flushed out the PersistentCacheTier interface This patch provides implementation for (1)(2). Follow up patch will provide (3) and tests. Test Plan: Compile and run check Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57117	2016-07-19 12:01:46 -07:00
Yi Wu	4b95253587	Refactor cache.cc Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc. Test Plan: make check -j64 Reviewers: lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D59655	2016-07-15 10:41:36 -07:00
Yi Wu	6ea41f8527	Fix deadlock when trying update options when write stalls Summary: When write stalls because of auto compaction is disabled, or stop write trigger is reached, user may change these two options to unblock writes. Unfortunately we had issue where the write thread will block the attempt to persist the options, thus creating a deadlock. This diff fix the issue and add two test cases to detect such deadlock. Test Plan: Run unit tests. Also, revert db_impl.cc to master (but don't revert `DBImpl::BackgroundCompaction:Finish` sync point) and run db_options_test. Both tests should hit deadlock. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60627	2016-07-12 15:30:38 -07:00
omegaga	e6f68faf99	Update Makefile to fix dependency Summary: In D33849 we updated Makefile to generate .d files for all .cc sources. Since we have more types of source files now, this needs to be updated so that this mechanism can work for new files. Test Plan: change a dependent .h file, re-make and see if .o file is recompiled. Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60591	2016-07-11 13:55:46 -07:00
Islam AbdelRahman	fa813f7478	Update DB::AddFile() to ingest the file to the lowest possible level Summary: DB::AddFile() right now always add the ingested file to L0 update the logic to add the file to the lowest possible level Test Plan: unit tests Reviewers: jkedgar, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D59637	2016-06-21 17:57:59 -07:00
sdong	6faddd7c55	Merge db/slice.cc into util/slice.cc Summary: It confuses some compilers to have slice.cc under multiple directories. Merge them. Test Plan: Run existing tests Reviewers: andrewkr, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59409	2016-06-10 16:37:36 -07:00
sdong	b2973eaaeb	Remove options builder Summary: AFIK, options builder is not used by anyone. Remove it. Test Plan: Run all existing tests. Reviewers: IslamAbdelRahman, andrewkr, igor Reviewed By: igor Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59319	2016-06-09 16:56:17 -07:00
krad	d755c62f92	Persistent Read Cache (5) Volatile cache tier implementation Summary: This provides provides an implementation of PersistentCacheTier that is specialized for RAM. This tier does not persist data though. Why do we need this tier ? This is ideal as tier 0. This tier can host data that is too hot. Why can't we use Cache variants ? Yes you can use them instead. This tier can potentially outperform BlockCache in RAW mode by virtue of compression and compressed cache in block cache doesn't seem very popular. Potentially this tier can be modified to under stand the disadvantage of the tier below and retain data that the tier below is bad at handling (for example index and bloom data that is huge in size) Test Plan: Run unit tests added Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57069	2016-06-07 11:10:44 -07:00
krad	3070ed9021	Persistent Read Cache (4) Interface definitions Summary: This diff provides the basic interface definitions of persistent read cache system PersistentCacheOptions captures the persistent read cache options used to configure and control the system PersistentCacheTier provides the basic building block for constructing tiered cache PersistentTieredCache provides a logical abstraction of tiers of cache layered over one another Test Plan: Compile Reviewers: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57051	2016-06-06 13:17:09 -07:00
Andrew Kryczka	6e6622abb9	Create env_basic_test [pluggable Env part 2] Summary: Extracted basic Env-related tests from mock_env_test and memenv_test into a parameterized test for Envs: env_basic_test. Depends on D58449. (The dependency is here only so I can keep this series of diffs in a chain -- there is no dependency on that diff's code.) Test Plan: ran tests Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D58635	2016-06-03 15:13:03 -07:00
Andrew Kryczka	af0c9ac01d	Env registry for URI-based Env selection [pluggable Env part 1] Summary: This enables configurable Envs without recompiling. For example, my next diff will make env_test test an Env created by NewEnvFromUri(). Then, users can determine which Env is tested simply by providing the URI for NewEnvFromUri() (e.g., through a CLI argument or environment variable). The registration process allows us to register any Env that is linked with the RocksDB library, so we can register our internal Envs as well. The registration code is inspired by our internal InitRegistry. Test Plan: new unit test Reviewers: IslamAbdelRahman, lightmark, ldemailly, sdong Reviewed By: sdong Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D58449	2016-06-03 08:15:16 -07:00
Yueh-Hsuan Chiang	88acd932f6	Allows db_bench to take an options file Summary: This patch allows db_bench to initialize it's RocksDB Options via a options file, specified by the --options_file flag. Note that if --options_file flag is set, then it has higher priority than the command-line argument. Test Plan: db_bench_tool_test Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D58533	2016-06-02 16:24:14 -07:00
Aaron Gao	5d660258e7	add simulator Cache as class SimCache/SimLRUCache(with test) Summary: add class SimCache(base class with instrumentation api) and SimLRUCache(derived class with detailed implementation) which is used as an instrumented block cache that can predict hit rate for different cache size Test Plan: Add a test case in `db_block_cache_test.cc` called `SimCacheTest` to test basic logic of SimCache. Also add option `-simcache_size` in db_bench. if set with a value other than -1, then the benchmark will use this value as the size of the simulator cache and finally output the simulation result. ``` [gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 1000000 RocksDB: version 4.8 Date: Tue May 17 16:56:16 2016 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 0 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 6.809 micros/op 146874 ops/sec; 16.2 MB/s DB path: [/tmp/rocksdbtest-112628/dbbench] readrandom : 6.343 micros/op 157665 ops/sec; 17.4 MB/s (1000000 of 1000000 found) SIMULATOR CACHE STATISTICS: SimCache LOOKUPs: 986559 SimCache HITs: 264760 SimCache HITRATE: 26.84% [gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 10000000 RocksDB: version 4.8 Date: Tue May 17 16:57:10 2016 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 0 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 5.066 micros/op 197394 ops/sec; 21.8 MB/s DB path: [/tmp/rocksdbtest-112628/dbbench] readrandom : 6.457 micros/op 154870 ops/sec; 17.1 MB/s (1000000 of 1000000 found) SIMULATOR CACHE STATISTICS: SimCache LOOKUPs: 1059764 SimCache HITs: 374501 SimCache HITRATE: 35.34% [gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 100000000 RocksDB: version 4.8 Date: Tue May 17 16:57:32 2016 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 0 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 5.632 micros/op 177572 ops/sec; 19.6 MB/s DB path: [/tmp/rocksdbtest-112628/dbbench] readrandom : 6.892 micros/op 145094 ops/sec; 16.1 MB/s (1000000 of 1000000 found) SIMULATOR CACHE STATISTICS: SimCache LOOKUPs: 1150767 SimCache HITs: 1034535 SimCache HITRATE: 89.90% ``` Reviewers: IslamAbdelRahman, andrewkr, sdong Reviewed By: sdong Subscribers: MarkCallaghan, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57999	2016-05-23 23:35:23 -07:00
sdong	1d725ca51d	Deprecate BlockBasedTableOptions.hash_index_allow_collision=false. Summary: Deprecate this one option and delete code and tests that are now superfluous. Test Plan: all tests pass Reviewers: igor, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: msalib, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55317	2016-05-20 17:52:27 -07:00
Islam AbdelRahman	1f2dca0eaa	Add MaxOperator to utilities/merge_operators/ Summary: Introduce MaxOperator a simple merge operator that return the max of all operands. This merge operand help me in benchmarking Test Plan: Add new unitttests Reviewers: sdong, andrewkr, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57873	2016-05-19 15:51:29 -07:00
omegaga	3c69f77c67	Move IO failure test to separate file Summary: This is a part of effort to reduce the size of db_test.cc. We move the following tests to a separate file `db_io_failure_test.cc`: * DropWrites * DropWritesFlush * NoSpaceCompactRange * NonWritableFileSystem * ManifestWriteError * PutFailsParanoid Test Plan: Run `make check` to see if the tests are working properly. Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58341	2016-05-18 17:09:20 -07:00
krad	a08c8c851a	Added PersistentCache abstraction Summary: Added a new abstraction to cache page to RocksDB designed for the read cache use. RocksDB current block cache is more of an object cache. For the persistent read cache project, what we need is a page cache equivalent. This changes adds a cache abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache uncompressed pages or raw pages (content as in filesystem). The user can choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode. Blame Rev: Test Plan: Run unit tests Reviewers: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55707	2016-05-15 22:17:18 -07:00
Dmitri Smirnov	aab91b8d8f	Use generic threadpool for Windows environment (#1120 ) Conditionally retrofit thread_posix for use with std::thread and reuse the same logic. Posix users continue using Posix interfaces. Enable XPRESS compression in test runs. Fix master introduced signed/unsigned mismatch.	2016-05-12 18:34:04 -07:00
Reid Horuff	a657ee9a9c	[rocksdb] Recovery path sequence miscount fix Summary: Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert. [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)] [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)] [4: COMMIT(a)] [7: COMMIT(b)] The first two batches do not consume any sequence numbers so are both prefixed with seq=1. For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL. We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries. This is a valid WAL. Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains. We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b)) So, now upon recovery we must track the actual consumption of sequence numbers. In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct? Test Plan: provided test. Reviewers: sdong Subscribers: andrewkr, leveldb, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D57645	2016-05-10 14:06:07 -07:00
Andrew Kryczka	3f16a836a4	Introduce chroot Env Summary: For testing backups, we needed an Env that is fully isolated from other Envs on the same machine. Our in-memory Envs (MockEnv and InMemoryEnv) were insufficient because they don't implement most directory operations. This diff introduces a new Env, "ChrootEnv", that translates paths such that the chroot directory appears to be the root directory. This way, multiple Envs can be isolated in the filesystem by using different chroot directories. Since we use the filesystem, all directory operations are trivially supported. Test Plan: I parameterized the existing EnvPosixTest so it runs tests on ChrootEnv except the ioctl-related cases. Reviewers: sdong, lightmark, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57543	2016-05-06 17:42:50 -07:00
Yi Wu	792762c42c	Split db_test.cc Summary: Split db_test.cc into several files. Moving several helper functions into DBTestBase. Test Plan: make check Reviewers: sdong, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, andrewkr, kradhakrishnan, yhchiang, leveldb, sdong Differential Revision: https://reviews.facebook.net/D56715	2016-04-18 09:42:50 -07:00
Siying Dong	774922c680	Merge pull request #1026 from SherlockNoMad/Hist Histogram Concurrency Improvement and Time-Windowing Support	2016-03-15 11:27:54 -07:00
SherlockNoMad	54f6b9e162	Histogram Concurrency Improvement and Time-Windowing Support	2016-03-11 16:54:25 -08:00
agiardullo	790252805d	Add multithreaded transaction test Summary: Refactored db_bench transaction stress tests so that they can be called from unit tests as well. Test Plan: run new unit test as well as db_bench Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55203	2016-03-11 15:16:52 -08:00
Yi Wu	2568985ab3	IOStatsContext::ToString() add option to exclude zero counters Summary: similar to D52809 add option to exclude zero counters. Test Plan: [yiwu@dev4504.prn1 ~/rocksdb] ./iostats_context_test [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from IOStatsContextTest [ RUN ] IOStatsContextTest.ToString [ OK ] IOStatsContextTest.ToString (0 ms) [----------] 1 test from IOStatsContextTest (0 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (0 ms total) [ PASSED ] 1 test. Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54591	2016-02-23 10:26:24 -08:00
Jonathan Wiepert	7bd284c374	Separeate main from bench functionality to allow cusomizations Summary: Isolate db_bench functionality from main so custom benchmark code can be written and managed Test Plan: Tested commands ./build_tools/regression_build_test.sh ./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 ./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --reads=500 --writes=500 ./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --merge_keys=100 --numdistinct=100 --num_column_families=3 --num_hot_column_families=1 ./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5 ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5 ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5 ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5 Test Results - https://phabricator.fb.com/P56130387 Additional tests for: ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5 --key_size=8 --merge_operator=put ./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5 --merge_operator=uint64add Results: https://phabricator.fb.com/P56130607 Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53991	2016-02-16 06:17:31 -08:00
Jonathan Wiepert	14a322033f	Remove references to files deleted in commit `abb4052278` Summary: Remove obolete references to files in src.mk Fix incorrect path for reference in source.mk Test Plan: Ran build to ensure changes do not break anything. Reviewers: leveldb, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53733	2016-02-03 16:47:45 -08:00
Islam AbdelRahman	d6c838f1e1	Add SstFileManager (component tracking all SST file in DBs and control the deletion rate) Summary: Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler SstFileTracker can be used later to abort writes when we exceed a specific size Test Plan: unit tests Reviewers: rven, anthony, yhchiang, sdong Reviewed By: sdong Subscribers: igor, lovro, march, dhruba Differential Revision: https://reviews.facebook.net/D50469	2016-01-28 18:35:01 -08:00
Andrew Kryczka	167bd8856d	[directory includes cleanup] Finish removing util->db dependencies	2016-01-26 10:49:24 -08:00
Andrew Kryczka	46f9cd46af	[directory includes cleanup] Move cross-function test points Summary: I split the db-specific test points out into a separate file under db/ directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it from compiling previously; see https://reviews.facebook.net/D36825. Test Plan: compilation works now, below command works, will also run "make xfunc". $ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32 Reviewers: sdong, yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53343	2016-01-26 10:49:05 -08:00
Nathan Bronson	7d87f02799	support for concurrent adds to memtable Summary: This diff adds support for concurrent adds to the skiplist memtable implementations. Memory allocation is made thread-safe by the addition of a spinlock, with small per-core buffers to avoid contention. Concurrent memtable writes are made via an additional method and don't impose a performance overhead on the non-concurrent case, so parallelism can be selected on a per-batch basis. Write thread synchronization is an increasing bottleneck for higher levels of concurrency, so this diff adds --enable_write_thread_adaptive_yield (default off). This feature causes threads joining a write batch group to spin for a short time (default 100 usec) using sched_yield, rather than going to sleep on a mutex. If the timing of the yield calls indicates that another thread has actually run during the yield then spinning is avoided. This option improves performance for concurrent situations even without parallel adds, although it has the potential to increase CPU usage (and the heuristic adaptation is not yet mature). Parallel writes are not currently compatible with inplace updates, update callbacks, or delete filtering. Enable it with --allow_concurrent_memtable_write (and --enable_write_thread_adaptive_yield). Parallel memtable writes are performance neutral when there is no actual parallelism, and in my experiments (SSD server-class Linux and varying contention and key sizes for fillrandom) they are always a performance win when there is more than one thread. Statistics are updated earlier in the write path, dropping the number of DB mutex acquisitions from 2 to 1 for almost all cases. This diff was motivated and inspired by Yahoo's cLSM work. It is more conservative than cLSM: RocksDB's write batch group leader role is preserved (along with all of the existing flush and write throttling logic) and concurrent writers are blocked until all memtable insertions have completed and the sequence number has been advanced, to preserve linearizability. My test config is "db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write" on a two-socket Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1 thread I get ~440Kops/sec. Peak performance for 1 socket (numactl -N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance across both sockets happens at 30 threads, and is ~900Kops/sec, although with fewer threads there is less performance loss when the system has background work. Test Plan: 1. concurrent stress tests for InlineSkipList and DynamicBloom 2. make clean; make check 3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench 4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench 5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench 6. make clean; OPT=-DROCKSDB_LITE make check 7. verify no perf regressions when disabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba Differential Revision: https://reviews.facebook.net/D50589	2015-12-25 11:03:40 -08:00
Igor Canadi	64fa43843b	Merge pull request #862 from ceph/wip-env implement EnvMirror	2015-12-10 18:45:07 -08:00
Sage Weil	2074ddd625	env: add EnvMirror This is an Env implementation that mirrors all storage-related methods on two different backend Env's and verifies that they return the same results (return status and read results). This is useful for implementing a new Env and verifying its correctness. Signed-off-by: Sage Weil <sage@redhat.com>	2015-12-10 21:32:45 -05:00
Javier González	b2863017b1	Move posix threads into a library Summary: This patch moves all posix thread logic to a separate library. The motivation is to allow another environments to easily reuse posix threads. HDFS wraps already posix threads; this split would simplify this code. Test Plan: No new functionality is added to posix Env or the threading library, thus the current tests should suffice.	2015-12-07 12:03:38 +01:00
Nathan Bronson	78812ec6bf	InlineSkipList - part 1/3 Summary: This diff is 1/3 in a sequence that introduces a skip list optimized for a key that is a freshly-allocated const char*. The diff is broken into pieces to make it easier to review. This piece only introduces the new type by copying the existing SkipList, with mechanical naming changes and reformatting. Test Plan: new unit test Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D51279	2015-11-24 14:30:22 -08:00
sdong	c4ebb66d61	Not to build forward_iterator_bench now Summary: forward_iterator_bench is not stable enough for build. Remove it for now. Test Plan: Build it with both of CLANG and non-CLANG and make sure it builds. Reviewers: rven, kradhakrishnan, anthony, yhchiang, igor, IslamAbdelRahman Reviewed By: igor, IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D50991	2015-11-18 15:42:06 -08:00
Venkatesh Radhakrishnan	7824444bfc	Reuse file iterators in tailing iterator when memtable is flushed Summary: Under a tailing workload, there were increased block cache misses when a memtable was flushed because we were rebuilding iterators in that case since the version set changed. This was exacerbated in the case of iterate_upper_bound, since file iterators which were over the iterate_upper_bound would have been deleted and are now brought back as part of the Rebuild, only to be deleted again. We now renew the iterators and only build iterators for files which are added and delete file iterators for files which are deleted. Refer to https://reviews.facebook.net/D50463 for previous version Test Plan: DBTestTailingIterator.TailingIteratorTrimSeekToNext Reviewers: anthony, IslamAbdelRahman, igor, tnovak, yhchiang, sdong Reviewed By: sdong Subscribers: yhchiang, march, dhruba, leveldb, lovro Differential Revision: https://reviews.facebook.net/D50679	2015-11-13 15:50:59 -08:00
Yueh-Hsuan Chiang	e11f676e34	Add OptionsUtil::LoadOptionsFromFile() API Summary: This patch adds OptionsUtil::LoadOptionsFromFile() and OptionsUtil::LoadLatestOptionsFromDB(), which allow developers to construct DBOptions and ColumnFamilyOptions from a RocksDB options file. Note that most pointer-typed options such as merge_operator will not be constructed. With this API, developers no longer need to remember all the options in order to reopen an existing rocksdb instance like the following: DBOptions db_options; std::vector<std::string> cf_names; std::vector<ColumnFamilyOptions> cf_opts; // Load primitive-typed options from an existing DB OptionsUtil::LoadLatestOptionsFromDB( dbname, &db_options, &cf_names, &cf_opts); // Initialize necessary pointer-typed options cf_opts[0].merge_operator.reset(new MyMergeOperator()); ... // Construct the vector of ColumnFamilyDescriptor std::vector<ColumnFamilyDescriptor> cf_descs; for (size_t i = 0; i < cf_opts.size(); ++i) { cf_descs.emplace_back(cf_names[i], cf_opts[i]); } // Open the DB DB* db = nullptr; std::vector<ColumnFamilyHandle*> cf_handles; auto s = DB::Open(db_options, dbname, cf_descs, &handles, &db); Test Plan: Augment existing tests in column_family_test options_test db_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49095	2015-11-12 06:52:43 -08:00
Yueh-Hsuan Chiang	e114f0abb8	Enable RocksDB to persist Options file. Summary: This patch allows rocksdb to persist options into a file on DB::Open, SetOptions, and Create / Drop ColumnFamily. Options files are created under the same directory as the rocksdb instance. In addition, this patch also adds a fail_if_missing_options_file in DBOptions that makes any function call return non-ok status when it is not able to persist options properly. // If true, then DB::Open / CreateColumnFamily / DropColumnFamily // / SetOptions will fail if options file is not detected or properly // persisted. // // DEFAULT: false bool fail_if_missing_options_file; Options file names are formatted as OPTIONS-<number>, and RocksDB will always keep the latest two options files. Test Plan: Add options_file_test. options_test column_family_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48285	2015-11-10 22:58:01 -08:00
Nathan Bronson	b81b430987	Switch to thread-local random for skiplist Summary: Using a TLS random instance for skiplist makes it smaller (useful for hash_skiplist_rep) and prepares skiplist for concurrent adds. This diff also modifies the branching factor math to avoid an unnecessary division. This diff has the effect of changing the sequence of skip list node height choices made by tests, so it has the potential to cause unit test failures for tests that implicitly rely on the exact structure of the skip list. Tests that try to exactly trigger a compaction are likely suspects for this problem (these tests have always been brittle to changes in the skiplist details). I've minimizes this risk by reseeding the main thread's Random at the beginning of each test, increasing the universal compaction size_ratio limit from 101% to 105% for some tests, and verifying that the tests pass many times. Test Plan: for i in `seq 0 9`; do make check; done Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D50439	2015-11-09 19:25:22 -08:00
Yueh-Hsuan Chiang	183cadfc87	Add OptionsSanityCheckLevel Summary: This patch introduces OptionsSanityCheckLevel internally to enable sanity check rocksdb options. Utilities API will be added in the follow-up diffs. Test Plan: Added more tests in options_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49515	2015-11-04 18:53:30 -08:00
Yueh-Hsuan Chiang	7d7ee2b654	Add Memory Insight support to utilities Summary: This patch introduces utilities/memory, which currently includes GetApproximateMemoryUsageByType that reports different types of rocksdb memory usage given a list of input DBs. The API also take care of the case where Cache could be shared across multiple column families / multiple db instances. Currently, it reports memory usage of memtable, table-readers and cache. Test Plan: utilities/memory/memory_test.cc Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49257	2015-11-03 17:52:17 -08:00
Javier González	6e6dd5f6f9	Split posix storage backend into Env and library Summary: This patch splits the posix storage backend into Env and the actual *File implementations. The motivation is to allow other Envs to use posix as a library. This enables a storage backend different from posix to split its secondary storage between a normal file system partition managed by posix, and it own media. Test Plan: No new functionality is added to posix Env or the library, thus the current tests should suffice.	2015-10-22 17:31:31 +02:00
Alexey Maykov	e1a09a7703	Implementation for GetPropertiesOfTablesInRange Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB. Test Plan: ran the GetPropertiesOfTablesInRange Reviewers: rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48651	2015-10-17 13:34:43 -07:00
Venkatesh Radhakrishnan	a98fbacfa0	Moving memtable related files from util to a new directory memtable Summary: We are cleaning up dependencies. This diff takes a first step at moving memtable files to their own directory called memtable. In future diffs, we will move other memtable files from db to memtable. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48915	2015-10-16 14:10:33 -07:00
Venkatesh Radhakrishnan	63e507c59c	Move ldb and sst_dump from utils to tools. Summary: As part of cleaning up dependencies for tech debt week, we are moving ldb and sst_dump tools from util to tools, since they are tools. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48747	2015-10-14 17:08:28 -07:00
Venkatesh Radhakrishnan	e587dbe03a	Move manual_compaction_test.cc from util to db Summary: manual_compaction_test.cc incorrectly in util. Moved to db. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48687	2015-10-14 11:06:27 -07:00
Venkatesh Radhakrishnan	f1fdf5205b	Clean up dependency: Move db_test_util.* to db directory Summary: As part of tech debt week, we are cleaning up dependencies. This diff moves db_test_util.[h,cc] from util to db directory. Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48543	2015-10-12 13:05:42 -07:00
Islam AbdelRahman	9babaeed16	Update dump_tool and undump_tool to accept Options Summary: Refactor dump_tool and undump_tool so that it's possible to use them with customized options for example setting a specific comparator similar to what Dragon is doing with the LdbTool https://phabricator.fb.com/diffusion/FBCODE/browse/master/dragon/tools/Ldb.cpp Test Plan: compiles used it to dump / undump a dragon shard Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba, adsharma Differential Revision: https://reviews.facebook.net/D47853	2015-10-05 19:49:48 -07:00
Evan Shaw	7a23e4d8ca	New amalgamation target This commit adds two new targets to the Makefile: rocksdb.cc and rocksdb.h These files, when combined with the c.h header, are a self-contained RocksDB source distribution called an amalgamation. (The name comes from SQLite's, which is similar in concept.) The main benefit of an amalgamation is that it's very easy to drop into a new project. It also compiles faster compared to compiling individual source files and potentially gives the compiler more opportunity to make optimizations since it can see all functions at once. rocksdb.cc and rocksdb.h are generated by a new script, amalgamate.py. A detailed description of how amalgamate.py works is in a comment at the top of the file. There are also some small changes to existing files to enable the amalgamation: * Use quotes for includes in unity build * Fix an old header inclusion in util/xfunc.cc * Move some includes outside ifdef in util/env_hdfs.cc * Separate out tool sources in Makefile so they won't be included in unity.cc * Unity build now produces a static library Closes #733	2015-10-01 08:29:31 +13:00

1 2 3 4

197 commits