rocksdb

Commit Graph

Author	SHA1	Message	Date
Yi Wu	5f5fddabc7	port folly::JemallocNodumpAllocator (#4534 ) Summary: Introduce `JemallocNodumpAllocator`, which allow exclusion of block cache usage from core dump. It utilize custom hook of jemalloc arena, and when jemalloc arena request memory from system, the allocator use the hook to set `MADV_DONTDUMP ` to the memory. The implementation is basically the same as `folly::JemallocNodumpAllocator`, except for some minor difference: 1. It only support jemalloc >= 5.0 2. When the allocator destruct, it explicitly destruct the corresponding arena via `arena.<i>.destroy` via `mallctl`. Depending on #4502. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4534 Differential Revision: D10435474 Pulled By: yiwu-arbug fbshipit-source-id: e80edea755d3853182485d2be710376384ce0bb4	2018-10-26 17:29:18 -07:00
Abhishek Madan	8c78348c77	Use only "local" range tombstones during Get (#4449 ) Summary: Previously, range tombstones were accumulated from every level, which was necessary if a range tombstone in a higher level covered a key in a lower level. However, RangeDelAggregator::AddTombstones's complexity is based on the number of tombstones that are currently stored in it, which is wasteful in the Get case, where we only need to know the highest sequence number of range tombstones that cover the key from higher levels, and compute the highest covering sequence number at the current level. This change introduces this optimization, and removes the use of RangeDelAggregator from the Get path. In the benchmark results, the following command was used to initialize the database: ``` ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8 ``` ...and the following command was used to measure read throughput: ``` ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32 ``` The filluniquerandom command was only run once, and the resulting database was used to measure read performance before and after the PR. Both binaries were compiled with `DEBUG_LEVEL=0`. Readrandom results before PR: ``` readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found) ``` Readrandom results after PR: ``` readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found) ``` So it's actually slower right now, but this PR paves the way for future optimizations (see #4493). ---- Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449 Differential Revision: D10370575 Pulled By: abhimadan fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d	2018-10-24 12:31:12 -07:00
Jigar Bhati	a4d9aa6b18	Plumb WriteBufferManager through JNI (#4492 ) Summary: Allow rocks java to explicitly create WriteBufferManager by plumbing it to the native code through JNI. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4492 Differential Revision: D10428506 Pulled By: sagar0 fbshipit-source-id: cd9dd8c2ef745a0303416b44e2080547bdcca1fd	2018-10-17 11:49:57 -07:00
Ben Clay	c9048021ad	RocksJava: memory_util support (#4446 ) Summary: JNI passthrough for utilities/memory/memory_util.cc sagar0 adamretter Pull Request resolved: https://github.com/facebook/rocksdb/pull/4446 Differential Revision: D10174578 Pulled By: sagar0 fbshipit-source-id: d1d196d771dff22afb7ef7500f308233675696f8	2018-10-08 11:05:27 -07:00
Yi Wu	d6f2ecf49c	Utility to run task periodically in a thread (#4423 ) Summary: Introduce `RepeatableThread` utility to run task periodically in a separate thread. It is basically the same as the the same class in fbcode, and in addition provide a helper method to let tests mock time and trigger execution one at a time. We can use this class to replace `TimerQueue` in #4382 and `BlobDB`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4423 Differential Revision: D10020932 Pulled By: yiwu-arbug fbshipit-source-id: 3616bef108c39a33c92eedb1256de424b7c04087	2018-09-27 15:28:00 -07:00
Abhishek Madan	1626f6ab6b	Add RangeDelAggregator microbenchmarks (#4363 ) Summary: To measure the results of upcoming DeleteRange v2 work, this commit adds simple benchmarks for RangeDelAggregator. It measures the average time for AddTombstones and ShouldDelete calls. Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results: Before #4014: ``` ======================= Results: ======================= AddTombstones: 1356.28 us ShouldDelete: 0.401732 us ``` Latest master: ``` ======================= Results: ======================= AddTombstones: 740.82 us ShouldDelete: 0.383271 us ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4363 Differential Revision: D9881676 Pulled By: abhimadan fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442	2018-09-17 14:58:31 -07:00
Sagar Vemuri	f46dd5cbeb	Remove trace_analyzer_tool from LIB_SOURCES (#4331 ) Summary: trace_analyzer_tool should only be in ANALYZER_LIB_SOURCES and not in LIB_SOURCES. This fixes java_test travis build failures seen in jtest. Blame: `a6d3de4e7a` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4331 Differential Revision: D9560377 Pulled By: sagar0 fbshipit-source-id: 6b9636201a920b56ee0f61e367fee5d3dca692b0	2018-08-29 21:28:40 -07:00
Yi Wu	a6d3de4e7a	BlobDB: Implement DisableFileDeletions (#4314 ) Summary: `DB::DiableFileDeletions` and `DB::EnableFileDeletions` are used for applications to stop RocksDB background jobs to delete files while they are doing replication. Implement these methods for BlobDB. `DeleteObsolteFiles` now needs to check `disable_file_deletions_` before starting, and will hold `delete_file_mutex_` the whole time while it is running. `DisableFileDeletions` needs to wait on `delete_file_mutex_` for running `DeleteObsolteFiles` job and set `disable_file_deletions_` flag. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4314 Differential Revision: D9501373 Pulled By: yiwu-arbug fbshipit-source-id: 81064c1228f1724eff46da22b50ff765b16292cd	2018-08-27 10:58:29 -07:00
Zhichao Cao	9e2d5ab6bf	Adjusted the Makefile of trace_analyzer to isolate the Gflags from other (#4290 ) Summary: Previously, the trace_analyzer_tool will be complied with other libobjects, which let the GFLAGS of trace_analyzer appear in other tools (e.g., db_bench, rocksdb_dump, and etc.). When using '--help', the help information of trace_analyzer will appear in other tool help information, which will cause confusion issues. Currently, trace_analyzer_tool is built and used only by trace_analyzer and trace_analyzer_test to avoid the issues. Tested with make asan_check. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4290 Differential Revision: D9413163 Pulled By: zhichao-cao fbshipit-source-id: ed5d20c4575a53ca15ff62a2ffe601d5cf278cc4	2018-08-21 10:47:24 -07:00
Christian Esken	c7cf981a85	Add CompactRangeOptions for Java (#4220 ) Summary: Closes https://github.com/facebook/rocksdb/issues/4195 CompactRangeOptions are available the CPP API, but not in the Java API. This PR adds CompactRangeOptions to the Java API and adds an overloaded compactRange() method. See https://github.com/facebook/rocksdb/issues/4195 for the original discussion. This change supports all fields of CompactRangeOptions, including the required enum converters in the JNI portal. Significant changes: - Make CompactRangeOptions available in the compactRange() for Java. - Deprecate other compactRange() methods that have individual option params, like in the CPP code. - Migrate rocksdb_compactrange_helper() to CompactRangeOptions. - Add Java unit tests for CompactRangeOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4220 Differential Revision: D9380007 Pulled By: sagar0 fbshipit-source-id: 6af6c334f221427f1997b33fb24c3986b092fed6	2018-08-17 10:57:25 -07:00
Fenggang Wu	19ec44fd39	Improve point-lookup performance using a data block hash index (#4174 ) Summary: Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.` The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174 Differential Revision: D8965914 Pulled By: fgwu fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198	2018-08-15 14:30:03 -07:00
Zhichao Cao	999d955e4f	RocksDB Trace Analyzer (#4091 ) Summary: A framework of trace analyzing for RocksDB After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload. Input: 1. trace file 2. Whole keys space file Statistics: 1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family. 2. Key hotness (access count) of each one 3. Key space separation based on given prefix 4. Key size distribution 5. Value size distribution if appliable 6. Top K accessed keys 7. QPS statistics including the average QPS and peak QPS 8. Top K accessed prefix 9. The query correlation analyzing, output the number of X after Y and the corresponding average time intervals Output: 1. key access heat map (either in the accessed key space or whole key space) 2. trace sequence file (interpret the raw trace file to line base text file for future use) 3. Time serial (The key space ID and its access time) 4. Key access count distritbution 5. Key size distribution 6. Value size distribution (in each intervals) 7. whole key space separation by the prefix 8. Accessed key space separation by the prefix 9. QPS of each operation and each column family 10. Top K QPS and their accessed prefix range Test: 1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge 2. Generated the trace and analyze the trace Implemented but not tested (due to the limitation of trace_replay): 1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing 2. Analyzing the number of Key found by Get Future Work: 1. Support execution time analyzing of each requests 2. Support cache hit situation and block read situation of Get Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091 Differential Revision: D9256157 Pulled By: zhichao-cao fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6	2018-08-13 11:44:02 -07:00
Yi Wu	140f256da2	BlobDB: Cleanup TTLExtractor interface (#4229 ) Summary: Cleanup TTLExtractor interface. The original purpose of it is to allow our users keep using existing `Write()` interface but allow it to accept TTL via `TTLExtractor`. However the interface is confusing. Will replace it with something like `WriteWithTTL(batch, ttl)` in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4229 Differential Revision: D9174390 Pulled By: yiwu-arbug fbshipit-source-id: 68201703d784408b851336ab4dd9b84188245b2d	2018-08-06 11:58:05 -07:00
Sagar Vemuri	12b6cdeed3	Trace and Replay for RocksDB (#3837 ) Summary: A framework for tracing and replaying RocksDB operations. A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench. - Column-families are supported - Multi-threaded tracing is supported. - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it. - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations. Currently supported DB operations: - Writes: -- Put -- Merge -- Delete -- SingleDelete -- DeleteRange -- Write - Reads: -- Get (point lookups) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837 Differential Revision: D7974837 Pulled By: sagar0 fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72	2018-08-01 00:27:08 -07:00
Fenggang Wu	8805ec2f49	DataBlockHashIndex: Standalone Implementation with Unit Test (#4139 ) Summary: The first step of the `DataBlockHashIndex` implementation. A string based hash table is implemented and unit-tested. `DataBlockHashIndexBuilder`: `Add()` takes pairs of `<key, restart_index>`, and formats it into a string when `Finish()` is called. `DataBlockHashIndex`: initialized by the formatted string, and can interpret it as a hash table. Lookup for a key is supported by iterator operation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4139 Reviewed By: sagar0 Differential Revision: D8866764 Pulled By: fgwu fbshipit-source-id: 7f015f0098632c65979a22898a50424384730b10	2018-07-24 11:43:37 -07:00
Chang Su	374c37da5b	move static msgs out of Status class (#4144 ) Summary: The member msgs of class Status contains all types of status messages. When users dump a Status object, msgs will confuse users. So move it out of class Status by making it as file-local static variable. Closes #3831 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/4144 Differential Revision: D8941419 Pulled By: sagar0 fbshipit-source-id: 56b0510258465ff26db15aa6b04e01532e053e3d	2018-07-23 15:44:16 -07:00
Siying Dong	ddc07b40fc	Remove managed iterator Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4124 Differential Revision: D8829910 Pulled By: siying fbshipit-source-id: f3e952ccf3a631071a5d77c48e327046f8abb560	2018-07-17 14:43:18 -07:00
Anand Ananthabhotla	52d4c9b7f6	Allow DB resume after background errors (#3997 ) Summary: Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts - 1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not 2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place 3. Provide an API for the user to clear the error and resume the DB instance This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors. Closes https://github.com/facebook/rocksdb/pull/3997 Differential Revision: D8653831 Pulled By: anand1976 fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd	2018-06-28 12:34:40 -07:00
Dmitri Smirnov	f4b72d7056	Provide a way to override windows memory allocator with jemalloc for ZSTD Summary: Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably. For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance. The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression. Closes https://github.com/facebook/rocksdb/pull/3838 Differential Revision: D8229794 Pulled By: miasantreble fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d	2018-06-04 12:12:48 -07:00
Manuel Ung	01e3c30def	Extend existing unit tests to run with WriteUnprepared as well Summary: As titled. I have not extended the Compatibility tests because the new WAL markers are still unimplemented. Closes https://github.com/facebook/rocksdb/pull/3941 Differential Revision: D8238394 Pulled By: lth fbshipit-source-id: 980e3d44837bbf2cfa64047f9738f559dfac4b1d	2018-06-01 14:58:41 -07:00
Manuel Ung	aaac6cd16f	Add write unprepared classes by inheriting from write prepared Summary: Closes https://github.com/facebook/rocksdb/pull/3907 Differential Revision: D8218325 Pulled By: lth fbshipit-source-id: ff32d8dab4a159cd2762876cba4b15e3dc51ff3b	2018-05-31 10:47:42 -07:00
Andrew Kryczka	e410501eeb	Add missing test files to src.mk Summary: We only generate the header dependency (".cc.d") files for files mentioned in "src.mk". When we don't generate them, changes to header dependencies do not cause `make` to recompile the dependent ".o". Then it takes a while for developers (or maybe just me) to realize `make clean` is necessary. Closes https://github.com/facebook/rocksdb/pull/3876 Differential Revision: D8065389 Pulled By: ajkr fbshipit-source-id: 0f62eee7bcab15b0215791564e6ab3775d46996b	2018-05-21 09:43:29 -07:00
Mike Kolupaev	8bf555f487	Change and clarify the relationship between Valid(), status() and Seek() for all iterators. Also fix some bugs Summary: Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are: If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted. * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not. However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours). This PR changes the convention to: * If status() is not ok, Valid() always returns false. * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.) This does sacrifice the two use cases listed above, but siying said it's ok. Overview of the changes: * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario. * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed. * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator. To find the iterator types that needed changes I searched for "public .Iterator" in the code. Here's an overview of all 27 iterator types: Iterators that didn't need changes: status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator. * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator. Iterators with changes (see inline comments for details): * DBIter - an overhaul: - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back. - It had a few code paths silently discarding subiterator's status. The stress test caught a few. - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example. - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption. - It used to not reset status on seek for some types of errors. - Some simplifications and better comments. - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer. * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status. * ForwardIterator - changed to the new convention, also slightly simplified. * ForwardLevelIterator - fixed some bugs and simplified. * LevelIterator - simplified. * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_. * BlockBasedTableIterator - minor changes. * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid. * PlainTableIterator - some seeks used to not reset status. * CuckooTableIterator - tiny code cleanup. * ManagedIterator - fixed some bugs. * BaseDeltaIterator - changed to the new convention and fixed a bug. * BlobDBIterator - seeks used to not reset status. * KeyConvertingIterator - some small change. Closes https://github.com/facebook/rocksdb/pull/3810 Differential Revision: D7888019 Pulled By: al13n321 fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d	2018-05-17 02:56:56 -07:00
Siying Dong	d59549298f	Skip deleted WALs during recovery Summary: This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction) This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2. Closes https://github.com/facebook/rocksdb/pull/3765 Differential Revision: D7747618 Pulled By: siying fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729	2018-05-03 15:43:09 -07:00
Adam Retter	ca87aef82d	Added support for SstFileManager to RocksJava Summary: Closes https://github.com/facebook/rocksdb/pull/3666 Differential Revision: D7457634 Pulled By: sagar0 fbshipit-source-id: 47741e2ee66e9255c580f4e38cfb86b284c27c2f	2018-04-06 21:26:32 -07:00
Yanqin Jin	1f5def1653	Fix race condition causing double deletion of ssts Summary: Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files, but this PR targets sst. Also add a unit test to verify that this PR fixes the issue. The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows ``` timeline user_thread background_compaction thread t1 \| FindObsoleteFiles(full_scan=false) t2 \| FindObsoleteFiles(full_scan=true) t3 \| PurgeObsoleteFiles t4 \| PurgeObsoleteFiles V ``` When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`. To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files. ajkr could you take a look and comment? Thanks! Closes https://github.com/facebook/rocksdb/pull/3638 Differential Revision: D7384372 Pulled By: riversand963 fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3	2018-03-28 10:29:59 -07:00
Dmitri Smirnov	53d66df0c4	Refactor sync_point to make implementation either customizable or replaceable Summary: Closes https://github.com/facebook/rocksdb/pull/3637 Differential Revision: D7354373 Pulled By: ajkr fbshipit-source-id: 6816c7bbc192ed0fb944942b11c7074bf24eddf1	2018-03-23 12:56:52 -07:00
Adam Retter	c5302a8a58	Java wrapper for Native Comparators Summary: This is an abstraction for working with custom Comparators implemented in native C++ code from Java. Native code must directly extend `rocksdb::Comparator`. When the native code comparator is compiled into the RocksDB codebase, you can then create a Java Class, and JNI stub to wrap it. Useful if the C++/JNI barrier overhead is too much for your applications comparator performance. An example is provided in `java/rocksjni/native_comparator_wrapper_test.cc` and `java/src/main/java/org/rocksdb/NativeComparatorWrapperTest.java`. Closes https://github.com/facebook/rocksdb/pull/3334 Differential Revision: D7172605 Pulled By: miasantreble fbshipit-source-id: e24b7eb267a3bcb6afa214e0379a1d5e8a2ceabe	2018-03-08 11:27:42 -08:00
Yi Wu	b864bc9b5b	Blob DB: Improve FIFO eviction Summary: Improving blob db FIFO eviction with the following changes, * Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size. * FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion. * FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files. * If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO. * Compaction filter also filter those blob indexes where corresponding blob file is gone. * Add an event listener to listen compaction/flush event and update sst file size. * Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db. * More blob db statistics around FIFO. * Fix some locking issue when accessing a blob file. Closes https://github.com/facebook/rocksdb/pull/3556 Differential Revision: D7139328 Pulled By: yiwu-arbug fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa	2018-03-06 11:57:42 -08:00
Pooya Shareghi	0a2354ca8f	Added bytes XOR merge operator Summary: Closes https://github.com/facebook/rocksdb/pull/575 I fixed the merge conflicts etc. Closes https://github.com/facebook/rocksdb/pull/3065 Differential Revision: D7128233 Pulled By: sagar0 fbshipit-source-id: 2c23a48c9f0432c290b0cd16a12fb691bb37820c	2018-03-06 10:27:36 -08:00
Adam Retter	2ac988c67e	Add TransactionDB and OptimisticTransactionDB to the Java API Summary: Closes https://github.com/facebook/rocksdb/issues/697 Closes https://github.com/facebook/rocksdb/issues/1151 Closes https://github.com/facebook/rocksdb/pull/1298 Differential Revision: D7131402 Pulled By: sagar0 fbshipit-source-id: bcd34ce95ed88cc641786089ff4232df7b2f089f	2018-03-02 10:34:13 -08:00
Siying Dong	2f1a3a4d74	Refactor ReadBlockContents() Summary: Divide ReadBlockContents() to multiple sub-functions. Maintaining the input and intermediate data in a new class BlockFetcher. I hope in general it makes the code easier to maintain. Another motivation to do it is to clearly divide the logic before file reading and after file reading. The refactor will help us evaluate how can we make I/O async in the future. Closes https://github.com/facebook/rocksdb/pull/3244 Differential Revision: D6520983 Pulled By: siying fbshipit-source-id: 338d90bc0338472d46be7a7682028dc9114b12e9	2017-12-11 15:27:32 -08:00
Yi Wu	dd49f89466	Fix TARGETS lint warnings. Summary: Fix buckifier script and regenerate TARGETS file with no lint warnings. Closes https://github.com/facebook/rocksdb/pull/3170 Differential Revision: D6328993 Pulled By: yiwu-arbug fbshipit-source-id: 17d0e4ed92f676f35fed76659386611cc72b00b2	2017-11-15 14:28:34 -08:00
Yi Wu	42564ada53	Blob DB: not using PinnableSlice move assignment Summary: The current implementation of PinnableSlice move assignment have an issue #3163. We are moving away from it instead of try to get the move assignment right, since it is too tricky. Closes https://github.com/facebook/rocksdb/pull/3164 Differential Revision: D6319201 Pulled By: yiwu-arbug fbshipit-source-id: 8f3279021f3710da4a4caa14fd238ed2df902c48	2017-11-13 18:12:20 -08:00
Maysam Yabandeh	60d83df23d	WritePrepared Txn: Move DB class to its own file Summary: Move WritePreparedTxnDB from pessimistic_transaction_db.h to its own header, write_prepared_txn_db.h Closes https://github.com/facebook/rocksdb/pull/3114 Differential Revision: D6220987 Pulled By: maysamyabandeh fbshipit-source-id: 18893fb4fdc6b809fe117dabb544080f9b4a301b	2017-11-02 11:14:30 -07:00
Yi Wu	31d3e41810	PinnableSlice move assignment Summary: Allow `std::move(pinnable_slice)`. Closes https://github.com/facebook/rocksdb/pull/2997 Differential Revision: D6036782 Pulled By: yiwu-arbug fbshipit-source-id: 583fb0419a97e437ff530f4305822341cd3381fa	2017-10-12 18:28:24 -07:00
Adam Retter	560e984995	Added CompactionFilterFactory support to RocksJava Summary: This PR also includes some cleanup, bugfixes and refactoring of the Java API. However these are really pre-cursors on the road to CompactionFilterFactory support. Closes https://github.com/facebook/rocksdb/pull/1241 Differential Revision: D6012778 Pulled By: sagar0 fbshipit-source-id: 0774465940ee99001a78906e4fed4ef57068ad5c	2017-10-12 11:12:16 -07:00
Yi Wu	d1b74b0c82	WritePrepared Txn: Compaction/Flush Summary: Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator: * check if key has been committed. If not, output uncommitted keys AS-IS. * use SnapshotChecker to check if key is visible to a snapshot when in need. * do not output key with seq = 0 if the key is not committed. Closes https://github.com/facebook/rocksdb/pull/2926 Differential Revision: D5902907 Pulled By: yiwu-arbug fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320	2017-10-06 10:41:53 -07:00
Sagar Vemuri	c8f3606731	Expose LoadLatestOptions, LoadOptionsFromFile and GetLatestOptionsFileName APIs in RocksJava Summary: JNI wrappers for LoadLatestOptions, LoadOptionsFromFile and GetLatestOptionsFileName APIs. Closes https://github.com/facebook/rocksdb/pull/2898 Differential Revision: D5857934 Pulled By: sagar0 fbshipit-source-id: 68b79e83eab8de9416e3f1fef73e11cf7947e90a	2017-09-21 17:29:13 -07:00
Kamalalochana Subbaiah	e612e31740	Updated CRC32 Power Optimization Changes Summary: Support for PowerPC Architecture Detecting AltiVec Support Closes https://github.com/facebook/rocksdb/pull/2716 Differential Revision: D5606836 Pulled By: siying fbshipit-source-id: 720262453b1546e5fdbbc668eff56848164113f3	2017-08-31 14:16:30 -07:00
Pengchao Wang	825a22c00c	garbage collect tombstones in merge operator Summary: Remove cassandra tombstone when reaching the max compaction level (full merge). if all columns collected key will be removed in next compaction via compaction filter Closes https://github.com/facebook/rocksdb/pull/2791 Reviewed By: sagar0 Differential Revision: D5722465 Pulled By: wpc fbshipit-source-id: 61e9898a5686551653a16383255aeaab3197e65e	2017-08-31 10:11:54 -07:00
Maysam Yabandeh	26ac24f199	Add more unit test to write_prepared txns Summary: Closes https://github.com/facebook/rocksdb/pull/2798 Differential Revision: D5724173 Pulled By: maysamyabandeh fbshipit-source-id: fb6b782d933fb4be315b1a231a6a67a66fdc9c96	2017-08-31 09:41:27 -07:00
Maysam Yabandeh	bdc056f8aa	Refactor PessimisticTransaction Summary: This patch splits Commit and Prepare into lock-related logic and db-write-related logic. It moves lock-related logic to PessimisticTransaction to be reused by all children classes and movies the existing impl of db-write-related to PrepareInternal, CommitSingleInternal, and CommitInternal in WriteCommittedTxnImpl. Closes https://github.com/facebook/rocksdb/pull/2691 Differential Revision: D5569464 Pulled By: maysamyabandeh fbshipit-source-id: d1b8698e69801a4126c7bc211745d05c636f5325	2017-08-07 16:12:29 -07:00
Maysam Yabandeh	c9804e007a	Refactor TransactionDBImpl Summary: This opens space for the new implementations of TransactionDBImpl such as WritePreparedTxnDBImpl that has a different policy of how to write to DB. Closes https://github.com/facebook/rocksdb/pull/2689 Differential Revision: D5568918 Pulled By: maysamyabandeh fbshipit-source-id: f7eac866e175daf3793ae79da108f65cc7dc7b25	2017-08-05 17:26:15 -07:00
Maysam Yabandeh	c3d5c4d38a	Refactor TransactionImpl Summary: This patch refactors TransactionImpl by separating the logic for pessimistic concurrency control from the implementation of how to write the data to rocksdb. The existing implementation is named WriteCommittedTxnImpl as it writes committed data to the db. A template named WritePreparedTxnImpl is also added which will be later completed to provide a an alternative implementation. Closes https://github.com/facebook/rocksdb/pull/2676 Differential Revision: D5549998 Pulled By: maysamyabandeh fbshipit-source-id: 16298e86b43ca4849324c1f35c731913c6d17bec	2017-08-03 08:57:22 -07:00
Yi Wu	1900771bd2	Dump Blob DB options to info log Summary: * Dump blob db options to info log * Remove BlobDBOptionsImpl to disallow dynamic cast BlobDBOptions into BlobDBOptionsImpl. Move options there to be constants or into BlobDBOptions. The dynamic cast is broken after #2645 * Change some of the default options * Remove blob_db_options.min_blob_size, which is unimplemented. Will implement it soon. Closes https://github.com/facebook/rocksdb/pull/2671 Differential Revision: D5529912 Pulled By: yiwu-arbug fbshipit-source-id: dcd58ca981db5bcc7f123b65a0d6f6ae0dc703c7	2017-08-01 13:01:47 -07:00
Yi Wu	6083bc79f8	Blob DB TTL extractor Summary: Introducing blob_db::TTLExtractor to replace extract_ttl_fn. The TTL extractor can be use to extract TTL from keys insert with Put or WriteBatch. Change over existing extract_ttl_fn are: * If value is changed, it will be return via std::string* (rather than Slice). With Slice the new value has to be part of the existing value. With std::string* the limitation is removed. * It can optionally return TTL or expiration. Other changes in this PR: * replace `std::chrono::system_clock` with `Env::NowMicros` so that I can mock time in tests. * add several TTL tests. * other minor naming change. Closes https://github.com/facebook/rocksdb/pull/2659 Differential Revision: D5512627 Pulled By: yiwu-arbug fbshipit-source-id: 0dfcb00d74d060b8534c6130c808e4d5d0a54440	2017-07-27 23:26:04 -07:00
Siying Dong	c281b44829	Revert "CRC32 Power Optimization Changes" Summary: This reverts commit `2289d38115`. Closes https://github.com/facebook/rocksdb/pull/2652 Differential Revision: D5506163 Pulled By: siying fbshipit-source-id: 105e31dd9d99090453a6b9f32c165206cd3affa3	2017-07-26 19:31:36 -07:00
Kamalalochana Subbaiah	2289d38115	CRC32 Power Optimization Changes Summary: Support for PowerPC Architecture Detecting AltiVec Support Closes https://github.com/facebook/rocksdb/pull/2353 Differential Revision: D5210948 Pulled By: siying fbshipit-source-id: 859a8c063d37697addd89ba2b8a14e5efd5d24bf	2017-07-26 09:42:29 -07:00
Pengchao Wang	534c255c7a	Cassandra compaction filter for purge expired columns and rows Summary: Major changes in this PR: * Implement CassandraCompactionFilter to remove expired columns and rows (if all column expired) * Move cassandra related code from utilities/merge_operators/cassandra to utilities/cassandra/* * Switch to use shared_ptr<> from uniqu_ptr for Column membership management in RowValue. Since columns do have multiple owners in Merge and GC process, use shared_ptr helps make RowValue immutable. * Rename cassandra_merge_test to cassandra_functional_test and add two TTL compaction related tests there. Closes https://github.com/facebook/rocksdb/pull/2588 Differential Revision: D5430010 Pulled By: wpc fbshipit-source-id: 9566c21e06de17491d486a68c70f52d501f27687	2017-07-21 14:57:44 -07:00
Adam Retter	000bf0af38	Improve the design and native object management of Stats in RocksJava Summary: Closes https://github.com/facebook/rocksdb/pull/2551 Differential Revision: D5399288 Pulled By: sagar0 fbshipit-source-id: dd3df2ed6cc5ae612db0998ea746cc29fccf568e	2017-07-11 23:31:00 -07:00
Aaron Gao	a43c053ad8	remove duplicated utilities/merge_operators/cassandra/test_utils.cc in src.mk Summary: target `utilities/merge_operators/cassandra/test_utils.d' given more than once in the same rule remove the duplicate one Closes https://github.com/facebook/rocksdb/pull/2542 Differential Revision: D5379570 Pulled By: lightmark fbshipit-source-id: 353f38d085e9f627c1b3c53acef7a2c4bc71014d	2017-07-06 17:28:57 -07:00
Yi Wu	982cec22af	Fix TARGETS file tests list Summary: 1. The buckifier script assume each test "foo" comes with a .cc file of the same name (i.e. foo.cc). Update cassandra tests to follow this pattern so that the buckifier script can recognize them. 2. add blob_db_test Closes https://github.com/facebook/rocksdb/pull/2506 Differential Revision: D5331517 Pulled By: yiwu-arbug fbshipit-source-id: 86f3eba471fc621186ab44cbd073b6162cde8e57	2017-06-27 14:12:02 -07:00
Ewout Prangsma	51778612c9	Encryption at rest support Summary: This PR adds support for encrypting data stored by RocksDB when written to disk. It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files. The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done. Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!). The Counter operation mode uses an initial counter & random initialization vector (IV). Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize). The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there. To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests. Typically you would run it like this: ``` ENCRYPTED_ENV=1 make check_some ``` There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings. Closes https://github.com/facebook/rocksdb/pull/2424 Differential Revision: D5322178 Pulled By: sdwilsh fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f	2017-06-26 16:56:24 -07:00
Chen Shen	cbd825deea	Create a MergeOperator for Cassandra Row Value Summary: This PR implements the MergeOperator for Cassandra Row Values. Closes https://github.com/facebook/rocksdb/pull/2289 Differential Revision: D5055464 Pulled By: scv119 fbshipit-source-id: 45f276ef8cbc4704279202f6a20c64889bc1adef	2017-06-16 14:27:00 -07:00
Siying Dong	95b0e89b5d	Improve write buffer manager (and allow the size to be tracked in block cache) Summary: Improve write buffer manager in several ways: 1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower. 2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit. 3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable. Closes https://github.com/facebook/rocksdb/pull/2350 Differential Revision: D5110648 Pulled By: siying fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017	2017-06-02 14:26:56 -07:00
Maysam Yabandeh	5a9b4d7435	Retire memenv https://github.com/facebook/rocksdb/pull/2082 Summary: This is a manual commit of this PR: Retire InMemoryEnv in favor of MockEnv #2082 With MockEnv doing the same yet being more mature, InMemoryEnv is redundant. Reviewed By: IslamAbdelRahman Differential Revision: D5162323 fbshipit-source-id: 59fd0082a891dc99cc531e4da9d68bf891eae3f5	2017-06-01 15:41:20 -07:00
Tamir Duberstein	0dc3040d54	db: avoid `#include`ing malloc and jemalloc simultaneously Summary: This fixes a compilation failure on Linux when the system libc is not glibc. jemalloc's configure script incorrectly assumes that glibc is always used on Linux systems, producing glibc-style signatures; when the system libc is e.g. musl, the following error is observed: ``` [ 0%] Building CXX object CMakeFiles/rocksdb.dir/db/db_impl.cc.o In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/table/block.h:19:0, from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:77: /x-tools/x86_64-unknown-linux-musl/x86_64-unknown-linux-musl/sysroot/usr/include/malloc.h:19:8: error: declaration of 'size_t malloc_usable_size(void)' has a different exception specifier size_t malloc_usable_size(void ); ^~~~~~~~~~~~~~~~~~ In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:20:0: /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:78:33: note: from previous declaration 'size_t malloc_usable_size(void*) throw ()' # define je_malloc_usable_size malloc_usable_size ^ /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:239:41: note: in expansion of macro 'je_malloc_usable_size' JEMALLOC_EXPORT size_t JEMALLOC_NOTHROW je_malloc_usable_size( ^~~~~~~~~~~~~~~~~~~~~ CMakeFiles/rocksdb.dir/build.make:350: recipe for target 'CMakeFiles/rocksdb.dir/db/db_impl.cc.o' failed ``` This works around the issue by rearranging the sources such that jemalloc's headers are never in the same scope as the system's malloc header. The jemalloc issue has been reported as well, see: https://github.com/jemalloc/jemalloc/issues/778. cc tschottdorf Closes https://github.com/facebook/rocksdb/pull/2188 Differential Revision: D5163048 Pulled By: siying fbshipit-source-id: c553125458892def175c1be5682b0330d80b2a0d	2017-05-31 22:43:02 -07:00
Yi Wu	ad19eb8686	Fixing blob db sequence number handling Summary: Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch. Stacking on #2375. Closes https://github.com/facebook/rocksdb/pull/2385 Differential Revision: D5148358 Pulled By: yiwu-arbug fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33	2017-05-31 10:56:45 -07:00
Yi Wu	345878a7fb	update blob_db_test Summary: Re-enable blob_db_test with some update: * Commented out delay at the end of GC tests. Will update the logic later with sync point to properly trigger GC. * Added some helper functions. Also update make files to include blob_dump tool. Closes https://github.com/facebook/rocksdb/pull/2375 Differential Revision: D5133793 Pulled By: yiwu-arbug fbshipit-source-id: 95470b26d0c1f9592ba4b7637e027fdd263f425c	2017-05-30 22:26:13 -07:00
Yi Wu	578fb0b1dc	Simple blob file dumper Summary: A simple blob file dumper. Closes https://github.com/facebook/rocksdb/pull/2242 Differential Revision: D5097553 Pulled By: yiwu-arbug fbshipit-source-id: c6e00d949fcd3658f9f68da9352f06339fac418d	2017-05-23 10:42:59 -07:00
Adam Retter	88c818e437	Replace deprecated RocksDB#addFile with RocksDB#ingestExternalFile Summary: Previously the Java implementation of `RocksDB#addFile` was both incomplete and not inline with the C++ API. Rather than fix it, as I see that `rocksdb::DB::AddFile` is now deprecated in favour of `rocksdb::DB::IngestExternalFile`, I have removed the old broken implementation and implemented `RocksDB#ingestExternalFile`. Closes https://github.com/facebook/rocksdb/issues/2261 Closes https://github.com/facebook/rocksdb/pull/2291 Differential Revision: D5061264 Pulled By: sagar0 fbshipit-source-id: 85df0899fa1b1fc3535175cac4f52353511d4104	2017-05-22 10:27:23 -07:00
Yi Wu	86d5492530	Fix build error with blob DB. Summary: snprintf is in <stdio.h> and not in namespace std. Closes https://github.com/facebook/rocksdb/pull/2287 Reviewed By: anirbanr-fb Differential Revision: D5054752 Pulled By: yiwu-arbug fbshipit-source-id: 356807ec38f3c7d95951cdb41f31a3d3ae0714d4	2017-05-15 14:05:46 -07:00
Andrew Kryczka	3fa9a39c68	Add GetAllKeyVersions API Summary: - Introduced an include/ file dedicated to db-related debug functions to avoid making db.h more complex - Added debugging function, `GetAllKeyVersions()`, to return a listing of internal data for a range of user keys. The new `struct KeyVersion` exposes data similar to internal key without exposing any internal type. - Migrated the "ldb idump" subcommand to use this function - The API takes an inclusive-exclusive range to match behavior of "ldb idump". This will be quite annoying for users who want to query a single user key's versions :(. Closes https://github.com/facebook/rocksdb/pull/2232 Differential Revision: D4976007 Pulled By: ajkr fbshipit-source-id: cab375da53a7595d6575af2b7e3b776aa3ad793e	2017-05-12 15:54:06 -07:00
Anirban Rahut	d85ff4953c	Blob storage pr Summary: The final pull request for Blob Storage. Closes https://github.com/facebook/rocksdb/pull/2269 Differential Revision: D5033189 Pulled By: yiwu-arbug fbshipit-source-id: 6356b683ccd58cbf38a1dc55e2ea400feecd5d06	2017-05-10 15:14:44 -07:00
Andrew Kryczka	f6a27d0bce	Extract statistics tests into separate file Summary: I'm going to add more DB tests for statistics as currently we have very few. I started a file dedicated to this purpose and moved the existing stats-specific tests there. Closes https://github.com/facebook/rocksdb/pull/2211 Differential Revision: D4951558 Pulled By: ajkr fbshipit-source-id: 05d11c35079c40ecabdfd2cf5556ccb761f694a4	2017-04-26 14:47:23 -07:00
Andrew Kryczka	e5e545a021	Reunite checkpoint and backup core logic Summary: These code paths forked when checkpoint was introduced by copy/pasting the core backup logic. Over time they diverged and bug fixes were sometimes applied to one but not the other (like fix to include all relevant WALs for 2PC), or it required extra effort to fix both (like fix to forge CURRENT file). This diff reunites the code paths by extracting the core logic into a function, CreateCustomCheckpoint(), that is customizable via callbacks to implement both checkpoint and backup. Related changes: - flush_before_backup is now forcibly enabled when 2PC is enabled - Extracted CheckpointImpl class definition into a header file. This is so the function, CreateCustomCheckpoint(), can be called by internal rocksdb code but not exposed to users. - Implemented more functions in DummyDB/DummyLogFile (in backupable_db_test.cc) that are used by CreateCustomCheckpoint(). Closes https://github.com/facebook/rocksdb/pull/1932 Differential Revision: D4622986 Pulled By: ajkr fbshipit-source-id: 157723884236ee3999a682673b64f7457a7a0d87	2017-04-24 15:06:46 -07:00
Siying Dong	ff97287016	Refactor compaction picker code Summary: 1. Move universal compaction picker to separate files compaction_picker_universal.cc and compaction_picker_universal.h. 2. Rename some functions to make the code easier to understand. 3. Move leveled compaction picking code to a dedicated class, so that we we don't need to pass some common variable around when calling functions. It also allowed us to break down LevelCompactionPicker::PickCompaction() to smaller functions. Closes https://github.com/facebook/rocksdb/pull/2100 Differential Revision: D4845948 Pulled By: siying fbshipit-source-id: efa0ab4	2017-04-06 20:09:34 -07:00
Sagar Vemuri	343b59d6ee	Move various string utility functions into string_util Summary: This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper. Check the sub-commits for easier reviewing. Closes https://github.com/facebook/rocksdb/pull/2094 Differential Revision: D4837730 Pulled By: sagar0 fbshipit-source-id: 344278a	2017-04-06 14:54:12 -07:00
Yi Wu	df6f5a3772	Move memtable related files into memtable directory Summary: Move memtable related files into memtable directory. Closes https://github.com/facebook/rocksdb/pull/2087 Differential Revision: D4829242 Pulled By: yiwu-arbug fbshipit-source-id: ca70ab6	2017-04-06 14:09:13 -07:00
Siying Dong	d2dce5611a	Move some files under util/ to separate dirs Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef	2017-04-05 19:09:16 -07:00
Siying Dong	ce64b8b719	Divide db/db_impl.cc Summary: db_impl.cc is too large to manage. Divide db_impl.cc into db/db_impl.cc, db/db_impl_compaction_flush.cc, db/db_impl_files.cc, db/db_impl_open.cc and db/db_impl_write.cc. Closes https://github.com/facebook/rocksdb/pull/2095 Differential Revision: D4838188 Pulled By: siying fbshipit-source-id: c5f3059	2017-04-05 17:24:19 -07:00
Andrew Kryczka	e2c6c06366	add TimedEnv Summary: I've needed Env timing measurements a few times now, so finally built something for it. Closes https://github.com/facebook/rocksdb/pull/2073 Differential Revision: D4811231 Pulled By: ajkr fbshipit-source-id: 218a249	2017-04-04 11:24:12 -07:00
Siying Dong	6ef8c620d3	Move auto_roll_logger and filename out of db/ Summary: It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency. Closes https://github.com/facebook/rocksdb/pull/2080 Differential Revision: D4821141 Pulled By: siying fbshipit-source-id: ca7d768	2017-04-03 18:39:14 -07:00
Adam Retter	0ee7f04039	Added missing options to RocksJava Summary: This adds almost all missing options to RocksJava Closes https://github.com/facebook/rocksdb/pull/2039 Differential Revision: D4779991 Pulled By: siying fbshipit-source-id: 4a1bf28	2017-03-30 12:09:21 -07:00
Maysam Yabandeh	a2f7a514d1	Refactoring Summary: This is the first split of https://github.com/facebook/rocksdb/pull/1891 and will be needed for the upcoming partitioned filter patch. Closes https://github.com/facebook/rocksdb/pull/1949 Differential Revision: D4652152 Pulled By: maysamyabandeh fbshipit-source-id: 9801778	2017-03-03 18:24:12 -08:00
Siying Dong	ba4c77bd6b	Divide external_sst_file_test Summary: Separate the platform dependent tests from external_sst_file_test. Only those tests need to run on platforms like OSX Closes https://github.com/facebook/rocksdb/pull/1923 Differential Revision: D4622461 Pulled By: siying fbshipit-source-id: d2d6f04	2017-02-28 14:24:11 -08:00
Siying Dong	8ad0fcdf99	Separate small subset tests in DBTest Summary: Separate a smal subset of tests in DBTest to DBBasicTest. Tests in DBTest don't have to run in CI tests on platforms like OSX, as long as they are covered by Linux. Closes https://github.com/facebook/rocksdb/pull/1924 Differential Revision: D4616702 Pulled By: siying fbshipit-source-id: 13e6549	2017-02-27 12:24:11 -08:00
Siying Dong	1ba2804b7f	Remove XFunc tests Summary: Xfunc is hardly used. Remove it to keep the code simple. Closes https://github.com/facebook/rocksdb/pull/1905 Differential Revision: D4603220 Pulled By: siying fbshipit-source-id: 731f96d	2017-02-23 12:09:11 -08:00
Islam AbdelRahman	574b543f80	Rename merger.h -> merging_iterator.h Summary: merger.h was always a confusing name for me, simply give the file a better name Closes https://github.com/facebook/rocksdb/pull/1836 Differential Revision: D4505357 Pulled By: IslamAbdelRahman fbshipit-source-id: 07b28d8	2017-02-02 16:54:19 -08:00
Andrew Kryczka	17c1180603	Generalize Env registration framework Summary: The Env registration framework supports registering client Envs and selecting which one to instantiate according to a text field. This enabled things like adding the -env_uri argument to db_bench, so the same binary could be reused with different Envs just by changing CLI config. Now this problem has come up again in a non-Env context, as I want to instantiate a client Statistics implementation from db_bench, which is configured entirely via text parameters. Also, in the future we may wish to use it for deserializing client objects when loading OPTIONS file. This diff generalizes the Env registration logic to work with arbitrary types. - Generalized registration and instantiation code by templating them - The entire implementation is in a header file as that's Google style guide's recommendation for template definitions - Pattern match with std::regex_match rather than checking prefix, which was the previous behavior - Rename functions/files to be non-Env-specific Closes https://github.com/facebook/rocksdb/pull/1776 Differential Revision: D4421933 Pulled By: ajkr fbshipit-source-id: 34647d1	2017-01-25 16:09:14 -08:00
Yi Wu	c270735861	Iterator should be in corrupted status if merge operator return false Summary: Iterator should be in corrupted status if merge operator return false. Also add test to make sure if max_successive_merges is hit during write, data will not be lost. Closes https://github.com/facebook/rocksdb/pull/1665 Differential Revision: D4322695 Pulled By: yiwu-arbug fbshipit-source-id: b327b05	2016-12-16 11:09:16 -08:00
Igor Canadi	3f407b065c	Kill flashcache code in RocksDB Summary: Now that we have userspace persisted cache, we don't need flashcache anymore. Closes https://github.com/facebook/rocksdb/pull/1588 Differential Revision: D4245114 Pulled By: igorcanadi fbshipit-source-id: e2c1c72	2016-12-01 10:09:22 -08:00
Andrew Kryczka	e333528991	DeleteRange write path end-to-end tests Summary: Closes https://github.com/facebook/rocksdb/pull/1578 Differential Revision: D4241171 Pulled By: ajkr fbshipit-source-id: ce5fd83	2016-11-29 11:09:22 -08:00
Islam AbdelRahman	86eb2b9ad9	Fix src.mk	2016-11-16 18:05:19 -08:00
Islam AbdelRahman	869ae5d786	Support IngestExternalFile (remove AddFile restrictions) Summary: Changes in the diff API changes: - Introduce IngestExternalFile to replace AddFile (I think this make the API more clear) - Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file) - Deprecate AddFile() API Logic changes: - If our file overlap with the memtable we will flush the memtable - We will find the first level in the LSM tree that our file key range overlap with the keys in it - We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it - We will assign a global sequence number to our new file - Remove AddFile restrictions by using global sequence numbers Other changes: - Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob Test Plan: unit tests (still need to add more) addfile_stress (https://reviews.facebook.net/D65037) Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong Reviewed By: sdong Subscribers: jkedgar, hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65061	2016-10-20 17:05:32 -07:00
Andrew Kryczka	6fbe96baf8	Compaction Support for Range Deletion Summary: This diff introduces RangeDelAggregator, which takes ownership of iterators provided to it via AddTombstones(). The tombstones are organized in a two-level map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data copy by holding Slices returned by the iterator, which remain valid thanks to pinning. For compaction, we create a hierarchical range tombstone iterator with structure matching the iterator over compaction input data. An aggregator based on that iterator is used by CompactionIterator to determine which keys are covered by range tombstones. In case of merge operand, the same aggregator is used by MergeHelper. Upon finishing each file in the compaction, relevant range tombstones are added to the output file's range tombstone metablock and file boundaries are updated accordingly. To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete() considers tombstones in the key's snapshot stripe. When this function is used outside of compaction, it also checks newer stripes, which can contain covering tombstones. Currently the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges within a stripe such that binary search can be used. RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range to a new table's range tombstone meta-block. Since range tombstones may fall in the gap between files, we may need to extend some files' key-ranges. The strategy is (1) first file extends as far left as possible and other files do not extend left, (2) all files extend right until either the start of the next file or the end of the last range tombstone in the gap, whichever comes first. One other notable change is adding release/move semantics to ScopedArenaIterator such that it can be used to transfer ownership of an arena-allocated iterator, similar to how unique_ptr is used for malloc'd data. Depends on D61473 Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927 Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62205	2016-10-18 12:04:56 -07:00
Injun Song	7260662b39	Add Java API for SstFileWriter Add Java API for SstFileWriter. Closes https://github.com/facebook/rocksdb/issues/1248	2016-09-29 17:04:41 -04:00
Yi Wu	9ed928e7a9	Split DBOptions into ImmutableDBOptions and MutableDBOptions Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs. Test Plan: make all check Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64065	2016-09-23 16:34:04 -07:00
Islam AbdelRahman	52ee07b021	Move AddFile() tests to external_sst_file_test.cc Summary: Simply move the tests Test Plan: make check -j64 Reviewers: andrewkr, lightmark, yiwu, yhchiang, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62529	2016-09-07 15:41:54 -07:00
Injun Song	ce1be2ce37	Fix build error on Windows (AppVeyor) (#1315 ) Add 'cf_options' to source list and db_imple.cc fix casting	2016-09-06 08:41:43 -07:00
Yi Wu	a88677d2cf	Remove ImmutableCFOptions from public API Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (`5665e5e285`) but later its reference in /include/rocksdb/table.h is removed. Test Plan: make all check Reviewers: IslamAbdelRahman, sdong, yhchiang Reviewed By: yhchiang Subscribers: yhchiang, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63177	2016-09-02 14:16:31 -07:00
Islam AbdelRahman	e9b2af87f8	Expose ThreadPool under include/rocksdb/threadpool.h Summary: This diff split ThreadPool to -ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h) -ThreadPoolImpl (actual implementation in util/threadpool_imp.h) This allow us to expose ThreadPool to the user so we can use it as an option later Test Plan: existing unit tests Reviewers: andrewkr, yiwu, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62085	2016-08-26 10:41:35 -07:00
Adam Retter	ffdf6eee19	Add Status to RocksDBException so that meaningful function result Status from the C++ API isn't lost (#1273 )	2016-08-22 11:01:42 -07:00
Yi Wu	4cc37f59e5	Introduce ClockCache Summary: Clock-based cache implemenetation aim to have better concurreny than default LRU cache. See inline comments for implementation details. Test Plan: Update cache_test to run on both LRUCache and ClockCache. Adding some new tests to catch some of the bugs that I fixed while implementing the cache. Reviewers: kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61647	2016-08-19 12:28:19 -07:00
sdong	8b79422b52	[Proof-Of-Concept] RocksDB Blob Storage with a blob log file. Summary: This is a proof of concept of a RocksDB blob log file. The actual value of the Put() is appended to a blob log using normal data block format, and the handle of the block is written as the value of the key in RocksDB. The prototype only supports Put() and Get(). It doesn't support DB restart, garbage collection, Write() call, iterator, snapshots, etc. Test Plan: Add unit tests. Reviewers: arahut Reviewed By: arahut Subscribers: kradhakrishnan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61485	2016-08-10 17:05:17 -07:00
omegaga	44f5cc57a5	Add time series database (resubmitted) Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows). Test Plan: Add `date_tiered_test`. Reviewers: dhruba, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61653	2016-08-05 15:56:22 -07:00
sdong	7c4615cf1f	A utility function to help users migrate DB after options change Summary: Add a utility function that trigger necessary full compaction and put output to the correct level by looking at new options and old options. Test Plan: Add unit tests for it. Reviewers: andrewkr, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: muthu, sumeet, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60783	2016-08-05 15:39:55 -07:00
omegaga	d51dc96a79	Experiments on column-aware encodings Summary: Experiments on column-aware encodings. Supported features: 1) extract data blocks from SST file and encode with specified encodings; 2) Decode encoded data back into row format; 3) Directly extract data blocks and write in row format (without prefix encoding); 4) Get column distribution statistics for column format; 5) Dump data blocks separated by columns in human-readable format. There is still on-going work on this diff. More refactoring is necessary. Test Plan: Wrote tests in `column_aware_encoding_test.cc`. More tests should be added. Reviewers: sdong Reviewed By: sdong Subscribers: arahut, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60027	2016-08-01 14:50:19 -07:00
krad	c116b47804	Persistent Read Cache (part 6) Block Cache Tier Implementation Summary: The patch is a continuation of part 5. It glues the abstraction for file layout and metadata, and flush out the implementation of the API. It adds unit tests for the implementation. Test Plan: Run unit tests Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57549	2016-08-01 14:15:14 -07:00
Islam AbdelRahman	f8061a237e	Fix Statistics TickersNameMap miss match with Tickers enum Summary: TickersNameMap is not consistent with Tickers enum. this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics) Test Plan: added new unit test Reviewers: sdong, kradhakrishnan Reviewed By: kradhakrishnan Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61083	2016-07-25 16:05:50 -07:00
Yi Wu	32604e6601	Fix flush not being commit while writing manifest Summary: Fix flush not being commit while writing manifest, which is a recent bug introduced by D60075. The issue: # Options.max_background_flushes > 1 # Background thread A pick up a flush job, flush, then commit to manifest. (Note that mutex is released before writing manifest.) # Background thread B pick up another flush job, flush. When it gets to `MemTableList::InstallMemtableFlushResults`, it notices another thread is commiting, so it quit. # After the first commit, thread A doesn't double check if there are more flush result need to commit, leaving the second flush uncommitted. Test Plan: run the test. Also verify the new test hit deadlock without the fix. Reviewers: sdong, igor, lightmark Reviewed By: lightmark Subscribers: andrewkr, omegaga, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60969	2016-07-21 10:10:41 -07:00
krad	d9cfaa2b16	Persistent Read Cache (6) Persistent cache tier implentation - File layout Summary: Persistent cache tier is the tier abstraction that can work for any block device based device mounted on a file system. The design/implementation can handle any generic block device. Any generic block support is achieved by generalizing the access patten as {io-size, q-depth, direct-io/buffered}. We have specifically tested and adapted the IO path for NVM and SSD. Persistent cache tier consists of there parts : 1) File layout Provides the implementation for handling IO path for reading and writing data (key/value pair). 2) Meta-data Provides the implementation for handling the index for persistent read cache. 3) Implementation It binds (1) and (2) and flushed out the PersistentCacheTier interface This patch provides implementation for (1)(2). Follow up patch will provide (3) and tests. Test Plan: Compile and run check Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57117	2016-07-19 12:01:46 -07:00
Yi Wu	4b95253587	Refactor cache.cc Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc. Test Plan: make check -j64 Reviewers: lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D59655	2016-07-15 10:41:36 -07:00
Yi Wu	6ea41f8527	Fix deadlock when trying update options when write stalls Summary: When write stalls because of auto compaction is disabled, or stop write trigger is reached, user may change these two options to unblock writes. Unfortunately we had issue where the write thread will block the attempt to persist the options, thus creating a deadlock. This diff fix the issue and add two test cases to detect such deadlock. Test Plan: Run unit tests. Also, revert db_impl.cc to master (but don't revert `DBImpl::BackgroundCompaction:Finish` sync point) and run db_options_test. Both tests should hit deadlock. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60627	2016-07-12 15:30:38 -07:00
omegaga	e6f68faf99	Update Makefile to fix dependency Summary: In D33849 we updated Makefile to generate .d files for all .cc sources. Since we have more types of source files now, this needs to be updated so that this mechanism can work for new files. Test Plan: change a dependent .h file, re-make and see if .o file is recompiled. Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60591	2016-07-11 13:55:46 -07:00
Islam AbdelRahman	fa813f7478	Update DB::AddFile() to ingest the file to the lowest possible level Summary: DB::AddFile() right now always add the ingested file to L0 update the logic to add the file to the lowest possible level Test Plan: unit tests Reviewers: jkedgar, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D59637	2016-06-21 17:57:59 -07:00
sdong	6faddd7c55	Merge db/slice.cc into util/slice.cc Summary: It confuses some compilers to have slice.cc under multiple directories. Merge them. Test Plan: Run existing tests Reviewers: andrewkr, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59409	2016-06-10 16:37:36 -07:00
sdong	b2973eaaeb	Remove options builder Summary: AFIK, options builder is not used by anyone. Remove it. Test Plan: Run all existing tests. Reviewers: IslamAbdelRahman, andrewkr, igor Reviewed By: igor Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59319	2016-06-09 16:56:17 -07:00
krad	d755c62f92	Persistent Read Cache (5) Volatile cache tier implementation Summary: This provides provides an implementation of PersistentCacheTier that is specialized for RAM. This tier does not persist data though. Why do we need this tier ? This is ideal as tier 0. This tier can host data that is too hot. Why can't we use Cache variants ? Yes you can use them instead. This tier can potentially outperform BlockCache in RAW mode by virtue of compression and compressed cache in block cache doesn't seem very popular. Potentially this tier can be modified to under stand the disadvantage of the tier below and retain data that the tier below is bad at handling (for example index and bloom data that is huge in size) Test Plan: Run unit tests added Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57069	2016-06-07 11:10:44 -07:00
krad	3070ed9021	Persistent Read Cache (4) Interface definitions Summary: This diff provides the basic interface definitions of persistent read cache system PersistentCacheOptions captures the persistent read cache options used to configure and control the system PersistentCacheTier provides the basic building block for constructing tiered cache PersistentTieredCache provides a logical abstraction of tiers of cache layered over one another Test Plan: Compile Reviewers: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57051	2016-06-06 13:17:09 -07:00
Andrew Kryczka	6e6622abb9	Create env_basic_test [pluggable Env part 2] Summary: Extracted basic Env-related tests from mock_env_test and memenv_test into a parameterized test for Envs: env_basic_test. Depends on D58449. (The dependency is here only so I can keep this series of diffs in a chain -- there is no dependency on that diff's code.) Test Plan: ran tests Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D58635	2016-06-03 15:13:03 -07:00
Andrew Kryczka	af0c9ac01d	Env registry for URI-based Env selection [pluggable Env part 1] Summary: This enables configurable Envs without recompiling. For example, my next diff will make env_test test an Env created by NewEnvFromUri(). Then, users can determine which Env is tested simply by providing the URI for NewEnvFromUri() (e.g., through a CLI argument or environment variable). The registration process allows us to register any Env that is linked with the RocksDB library, so we can register our internal Envs as well. The registration code is inspired by our internal InitRegistry. Test Plan: new unit test Reviewers: IslamAbdelRahman, lightmark, ldemailly, sdong Reviewed By: sdong Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D58449	2016-06-03 08:15:16 -07:00
Yueh-Hsuan Chiang	88acd932f6	Allows db_bench to take an options file Summary: This patch allows db_bench to initialize it's RocksDB Options via a options file, specified by the --options_file flag. Note that if --options_file flag is set, then it has higher priority than the command-line argument. Test Plan: db_bench_tool_test Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D58533	2016-06-02 16:24:14 -07:00
Aaron Gao	5d660258e7	add simulator Cache as class SimCache/SimLRUCache(with test) Summary: add class SimCache(base class with instrumentation api) and SimLRUCache(derived class with detailed implementation) which is used as an instrumented block cache that can predict hit rate for different cache size Test Plan: Add a test case in `db_block_cache_test.cc` called `SimCacheTest` to test basic logic of SimCache. Also add option `-simcache_size` in db_bench. if set with a value other than -1, then the benchmark will use this value as the size of the simulator cache and finally output the simulation result. ``` [gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 1000000 RocksDB: version 4.8 Date: Tue May 17 16:56:16 2016 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 0 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 6.809 micros/op 146874 ops/sec; 16.2 MB/s DB path: [/tmp/rocksdbtest-112628/dbbench] readrandom : 6.343 micros/op 157665 ops/sec; 17.4 MB/s (1000000 of 1000000 found) SIMULATOR CACHE STATISTICS: SimCache LOOKUPs: 986559 SimCache HITs: 264760 SimCache HITRATE: 26.84% [gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 10000000 RocksDB: version 4.8 Date: Tue May 17 16:57:10 2016 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 0 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 5.066 micros/op 197394 ops/sec; 21.8 MB/s DB path: [/tmp/rocksdbtest-112628/dbbench] readrandom : 6.457 micros/op 154870 ops/sec; 17.1 MB/s (1000000 of 1000000 found) SIMULATOR CACHE STATISTICS: SimCache LOOKUPs: 1059764 SimCache HITs: 374501 SimCache HITRATE: 35.34% [gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 100000000 RocksDB: version 4.8 Date: Tue May 17 16:57:32 2016 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 0 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 5.632 micros/op 177572 ops/sec; 19.6 MB/s DB path: [/tmp/rocksdbtest-112628/dbbench] readrandom : 6.892 micros/op 145094 ops/sec; 16.1 MB/s (1000000 of 1000000 found) SIMULATOR CACHE STATISTICS: SimCache LOOKUPs: 1150767 SimCache HITs: 1034535 SimCache HITRATE: 89.90% ``` Reviewers: IslamAbdelRahman, andrewkr, sdong Reviewed By: sdong Subscribers: MarkCallaghan, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57999	2016-05-23 23:35:23 -07:00
sdong	1d725ca51d	Deprecate BlockBasedTableOptions.hash_index_allow_collision=false. Summary: Deprecate this one option and delete code and tests that are now superfluous. Test Plan: all tests pass Reviewers: igor, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: msalib, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55317	2016-05-20 17:52:27 -07:00
Islam AbdelRahman	1f2dca0eaa	Add MaxOperator to utilities/merge_operators/ Summary: Introduce MaxOperator a simple merge operator that return the max of all operands. This merge operand help me in benchmarking Test Plan: Add new unitttests Reviewers: sdong, andrewkr, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57873	2016-05-19 15:51:29 -07:00
omegaga	3c69f77c67	Move IO failure test to separate file Summary: This is a part of effort to reduce the size of db_test.cc. We move the following tests to a separate file `db_io_failure_test.cc`: * DropWrites * DropWritesFlush * NoSpaceCompactRange * NonWritableFileSystem * ManifestWriteError * PutFailsParanoid Test Plan: Run `make check` to see if the tests are working properly. Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58341	2016-05-18 17:09:20 -07:00
krad	a08c8c851a	Added PersistentCache abstraction Summary: Added a new abstraction to cache page to RocksDB designed for the read cache use. RocksDB current block cache is more of an object cache. For the persistent read cache project, what we need is a page cache equivalent. This changes adds a cache abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache uncompressed pages or raw pages (content as in filesystem). The user can choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode. Blame Rev: Test Plan: Run unit tests Reviewers: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55707	2016-05-15 22:17:18 -07:00
Dmitri Smirnov	aab91b8d8f	Use generic threadpool for Windows environment (#1120 ) Conditionally retrofit thread_posix for use with std::thread and reuse the same logic. Posix users continue using Posix interfaces. Enable XPRESS compression in test runs. Fix master introduced signed/unsigned mismatch.	2016-05-12 18:34:04 -07:00
Reid Horuff	a657ee9a9c	[rocksdb] Recovery path sequence miscount fix Summary: Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert. [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)] [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)] [4: COMMIT(a)] [7: COMMIT(b)] The first two batches do not consume any sequence numbers so are both prefixed with seq=1. For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL. We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries. This is a valid WAL. Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains. We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b)) So, now upon recovery we must track the actual consumption of sequence numbers. In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct? Test Plan: provided test. Reviewers: sdong Subscribers: andrewkr, leveldb, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D57645	2016-05-10 14:06:07 -07:00
Andrew Kryczka	3f16a836a4	Introduce chroot Env Summary: For testing backups, we needed an Env that is fully isolated from other Envs on the same machine. Our in-memory Envs (MockEnv and InMemoryEnv) were insufficient because they don't implement most directory operations. This diff introduces a new Env, "ChrootEnv", that translates paths such that the chroot directory appears to be the root directory. This way, multiple Envs can be isolated in the filesystem by using different chroot directories. Since we use the filesystem, all directory operations are trivially supported. Test Plan: I parameterized the existing EnvPosixTest so it runs tests on ChrootEnv except the ioctl-related cases. Reviewers: sdong, lightmark, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57543	2016-05-06 17:42:50 -07:00
Yi Wu	792762c42c	Split db_test.cc Summary: Split db_test.cc into several files. Moving several helper functions into DBTestBase. Test Plan: make check Reviewers: sdong, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, andrewkr, kradhakrishnan, yhchiang, leveldb, sdong Differential Revision: https://reviews.facebook.net/D56715	2016-04-18 09:42:50 -07:00
Siying Dong	774922c680	Merge pull request #1026 from SherlockNoMad/Hist Histogram Concurrency Improvement and Time-Windowing Support	2016-03-15 11:27:54 -07:00
SherlockNoMad	54f6b9e162	Histogram Concurrency Improvement and Time-Windowing Support	2016-03-11 16:54:25 -08:00
agiardullo	790252805d	Add multithreaded transaction test Summary: Refactored db_bench transaction stress tests so that they can be called from unit tests as well. Test Plan: run new unit test as well as db_bench Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55203	2016-03-11 15:16:52 -08:00
Yi Wu	2568985ab3	IOStatsContext::ToString() add option to exclude zero counters Summary: similar to D52809 add option to exclude zero counters. Test Plan: [yiwu@dev4504.prn1 ~/rocksdb] ./iostats_context_test [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from IOStatsContextTest [ RUN ] IOStatsContextTest.ToString [ OK ] IOStatsContextTest.ToString (0 ms) [----------] 1 test from IOStatsContextTest (0 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (0 ms total) [ PASSED ] 1 test. Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54591	2016-02-23 10:26:24 -08:00
Jonathan Wiepert	7bd284c374	Separeate main from bench functionality to allow cusomizations Summary: Isolate db_bench functionality from main so custom benchmark code can be written and managed Test Plan: Tested commands ./build_tools/regression_build_test.sh ./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 ./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --reads=500 --writes=500 ./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --merge_keys=100 --numdistinct=100 --num_column_families=3 --num_hot_column_families=1 ./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5 ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5 ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5 ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5 Test Results - https://phabricator.fb.com/P56130387 Additional tests for: ./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5 --key_size=8 --merge_operator=put ./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5 --merge_operator=uint64add Results: https://phabricator.fb.com/P56130607 Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53991	2016-02-16 06:17:31 -08:00
Jonathan Wiepert	14a322033f	Remove references to files deleted in commit `abb4052278` Summary: Remove obolete references to files in src.mk Fix incorrect path for reference in source.mk Test Plan: Ran build to ensure changes do not break anything. Reviewers: leveldb, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53733	2016-02-03 16:47:45 -08:00
Islam AbdelRahman	d6c838f1e1	Add SstFileManager (component tracking all SST file in DBs and control the deletion rate) Summary: Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler SstFileTracker can be used later to abort writes when we exceed a specific size Test Plan: unit tests Reviewers: rven, anthony, yhchiang, sdong Reviewed By: sdong Subscribers: igor, lovro, march, dhruba Differential Revision: https://reviews.facebook.net/D50469	2016-01-28 18:35:01 -08:00
Andrew Kryczka	167bd8856d	[directory includes cleanup] Finish removing util->db dependencies	2016-01-26 10:49:24 -08:00
Andrew Kryczka	46f9cd46af	[directory includes cleanup] Move cross-function test points Summary: I split the db-specific test points out into a separate file under db/ directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it from compiling previously; see https://reviews.facebook.net/D36825. Test Plan: compilation works now, below command works, will also run "make xfunc". $ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32 Reviewers: sdong, yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53343	2016-01-26 10:49:05 -08:00
Nathan Bronson	7d87f02799	support for concurrent adds to memtable Summary: This diff adds support for concurrent adds to the skiplist memtable implementations. Memory allocation is made thread-safe by the addition of a spinlock, with small per-core buffers to avoid contention. Concurrent memtable writes are made via an additional method and don't impose a performance overhead on the non-concurrent case, so parallelism can be selected on a per-batch basis. Write thread synchronization is an increasing bottleneck for higher levels of concurrency, so this diff adds --enable_write_thread_adaptive_yield (default off). This feature causes threads joining a write batch group to spin for a short time (default 100 usec) using sched_yield, rather than going to sleep on a mutex. If the timing of the yield calls indicates that another thread has actually run during the yield then spinning is avoided. This option improves performance for concurrent situations even without parallel adds, although it has the potential to increase CPU usage (and the heuristic adaptation is not yet mature). Parallel writes are not currently compatible with inplace updates, update callbacks, or delete filtering. Enable it with --allow_concurrent_memtable_write (and --enable_write_thread_adaptive_yield). Parallel memtable writes are performance neutral when there is no actual parallelism, and in my experiments (SSD server-class Linux and varying contention and key sizes for fillrandom) they are always a performance win when there is more than one thread. Statistics are updated earlier in the write path, dropping the number of DB mutex acquisitions from 2 to 1 for almost all cases. This diff was motivated and inspired by Yahoo's cLSM work. It is more conservative than cLSM: RocksDB's write batch group leader role is preserved (along with all of the existing flush and write throttling logic) and concurrent writers are blocked until all memtable insertions have completed and the sequence number has been advanced, to preserve linearizability. My test config is "db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write" on a two-socket Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1 thread I get ~440Kops/sec. Peak performance for 1 socket (numactl -N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance across both sockets happens at 30 threads, and is ~900Kops/sec, although with fewer threads there is less performance loss when the system has background work. Test Plan: 1. concurrent stress tests for InlineSkipList and DynamicBloom 2. make clean; make check 3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench 4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench 5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench 6. make clean; OPT=-DROCKSDB_LITE make check 7. verify no perf regressions when disabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba Differential Revision: https://reviews.facebook.net/D50589	2015-12-25 11:03:40 -08:00
Igor Canadi	64fa43843b	Merge pull request #862 from ceph/wip-env implement EnvMirror	2015-12-10 18:45:07 -08:00
Sage Weil	2074ddd625	env: add EnvMirror This is an Env implementation that mirrors all storage-related methods on two different backend Env's and verifies that they return the same results (return status and read results). This is useful for implementing a new Env and verifying its correctness. Signed-off-by: Sage Weil <sage@redhat.com>	2015-12-10 21:32:45 -05:00
Javier González	b2863017b1	Move posix threads into a library Summary: This patch moves all posix thread logic to a separate library. The motivation is to allow another environments to easily reuse posix threads. HDFS wraps already posix threads; this split would simplify this code. Test Plan: No new functionality is added to posix Env or the threading library, thus the current tests should suffice.	2015-12-07 12:03:38 +01:00
Nathan Bronson	78812ec6bf	InlineSkipList - part 1/3 Summary: This diff is 1/3 in a sequence that introduces a skip list optimized for a key that is a freshly-allocated const char*. The diff is broken into pieces to make it easier to review. This piece only introduces the new type by copying the existing SkipList, with mechanical naming changes and reformatting. Test Plan: new unit test Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D51279	2015-11-24 14:30:22 -08:00
sdong	c4ebb66d61	Not to build forward_iterator_bench now Summary: forward_iterator_bench is not stable enough for build. Remove it for now. Test Plan: Build it with both of CLANG and non-CLANG and make sure it builds. Reviewers: rven, kradhakrishnan, anthony, yhchiang, igor, IslamAbdelRahman Reviewed By: igor, IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D50991	2015-11-18 15:42:06 -08:00
Venkatesh Radhakrishnan	7824444bfc	Reuse file iterators in tailing iterator when memtable is flushed Summary: Under a tailing workload, there were increased block cache misses when a memtable was flushed because we were rebuilding iterators in that case since the version set changed. This was exacerbated in the case of iterate_upper_bound, since file iterators which were over the iterate_upper_bound would have been deleted and are now brought back as part of the Rebuild, only to be deleted again. We now renew the iterators and only build iterators for files which are added and delete file iterators for files which are deleted. Refer to https://reviews.facebook.net/D50463 for previous version Test Plan: DBTestTailingIterator.TailingIteratorTrimSeekToNext Reviewers: anthony, IslamAbdelRahman, igor, tnovak, yhchiang, sdong Reviewed By: sdong Subscribers: yhchiang, march, dhruba, leveldb, lovro Differential Revision: https://reviews.facebook.net/D50679	2015-11-13 15:50:59 -08:00
Yueh-Hsuan Chiang	e11f676e34	Add OptionsUtil::LoadOptionsFromFile() API Summary: This patch adds OptionsUtil::LoadOptionsFromFile() and OptionsUtil::LoadLatestOptionsFromDB(), which allow developers to construct DBOptions and ColumnFamilyOptions from a RocksDB options file. Note that most pointer-typed options such as merge_operator will not be constructed. With this API, developers no longer need to remember all the options in order to reopen an existing rocksdb instance like the following: DBOptions db_options; std::vector<std::string> cf_names; std::vector<ColumnFamilyOptions> cf_opts; // Load primitive-typed options from an existing DB OptionsUtil::LoadLatestOptionsFromDB( dbname, &db_options, &cf_names, &cf_opts); // Initialize necessary pointer-typed options cf_opts[0].merge_operator.reset(new MyMergeOperator()); ... // Construct the vector of ColumnFamilyDescriptor std::vector<ColumnFamilyDescriptor> cf_descs; for (size_t i = 0; i < cf_opts.size(); ++i) { cf_descs.emplace_back(cf_names[i], cf_opts[i]); } // Open the DB DB* db = nullptr; std::vector<ColumnFamilyHandle*> cf_handles; auto s = DB::Open(db_options, dbname, cf_descs, &handles, &db); Test Plan: Augment existing tests in column_family_test options_test db_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49095	2015-11-12 06:52:43 -08:00
Yueh-Hsuan Chiang	e114f0abb8	Enable RocksDB to persist Options file. Summary: This patch allows rocksdb to persist options into a file on DB::Open, SetOptions, and Create / Drop ColumnFamily. Options files are created under the same directory as the rocksdb instance. In addition, this patch also adds a fail_if_missing_options_file in DBOptions that makes any function call return non-ok status when it is not able to persist options properly. // If true, then DB::Open / CreateColumnFamily / DropColumnFamily // / SetOptions will fail if options file is not detected or properly // persisted. // // DEFAULT: false bool fail_if_missing_options_file; Options file names are formatted as OPTIONS-<number>, and RocksDB will always keep the latest two options files. Test Plan: Add options_file_test. options_test column_family_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48285	2015-11-10 22:58:01 -08:00
Nathan Bronson	b81b430987	Switch to thread-local random for skiplist Summary: Using a TLS random instance for skiplist makes it smaller (useful for hash_skiplist_rep) and prepares skiplist for concurrent adds. This diff also modifies the branching factor math to avoid an unnecessary division. This diff has the effect of changing the sequence of skip list node height choices made by tests, so it has the potential to cause unit test failures for tests that implicitly rely on the exact structure of the skip list. Tests that try to exactly trigger a compaction are likely suspects for this problem (these tests have always been brittle to changes in the skiplist details). I've minimizes this risk by reseeding the main thread's Random at the beginning of each test, increasing the universal compaction size_ratio limit from 101% to 105% for some tests, and verifying that the tests pass many times. Test Plan: for i in `seq 0 9`; do make check; done Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D50439	2015-11-09 19:25:22 -08:00
Yueh-Hsuan Chiang	183cadfc87	Add OptionsSanityCheckLevel Summary: This patch introduces OptionsSanityCheckLevel internally to enable sanity check rocksdb options. Utilities API will be added in the follow-up diffs. Test Plan: Added more tests in options_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49515	2015-11-04 18:53:30 -08:00
Yueh-Hsuan Chiang	7d7ee2b654	Add Memory Insight support to utilities Summary: This patch introduces utilities/memory, which currently includes GetApproximateMemoryUsageByType that reports different types of rocksdb memory usage given a list of input DBs. The API also take care of the case where Cache could be shared across multiple column families / multiple db instances. Currently, it reports memory usage of memtable, table-readers and cache. Test Plan: utilities/memory/memory_test.cc Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49257	2015-11-03 17:52:17 -08:00
Javier González	6e6dd5f6f9	Split posix storage backend into Env and library Summary: This patch splits the posix storage backend into Env and the actual *File implementations. The motivation is to allow other Envs to use posix as a library. This enables a storage backend different from posix to split its secondary storage between a normal file system partition managed by posix, and it own media. Test Plan: No new functionality is added to posix Env or the library, thus the current tests should suffice.	2015-10-22 17:31:31 +02:00
Alexey Maykov	e1a09a7703	Implementation for GetPropertiesOfTablesInRange Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB. Test Plan: ran the GetPropertiesOfTablesInRange Reviewers: rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48651	2015-10-17 13:34:43 -07:00
Venkatesh Radhakrishnan	a98fbacfa0	Moving memtable related files from util to a new directory memtable Summary: We are cleaning up dependencies. This diff takes a first step at moving memtable files to their own directory called memtable. In future diffs, we will move other memtable files from db to memtable. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48915	2015-10-16 14:10:33 -07:00
Venkatesh Radhakrishnan	63e507c59c	Move ldb and sst_dump from utils to tools. Summary: As part of cleaning up dependencies for tech debt week, we are moving ldb and sst_dump tools from util to tools, since they are tools. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48747	2015-10-14 17:08:28 -07:00
Venkatesh Radhakrishnan	e587dbe03a	Move manual_compaction_test.cc from util to db Summary: manual_compaction_test.cc incorrectly in util. Moved to db. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48687	2015-10-14 11:06:27 -07:00
Venkatesh Radhakrishnan	f1fdf5205b	Clean up dependency: Move db_test_util.* to db directory Summary: As part of tech debt week, we are cleaning up dependencies. This diff moves db_test_util.[h,cc] from util to db directory. Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48543	2015-10-12 13:05:42 -07:00
Islam AbdelRahman	9babaeed16	Update dump_tool and undump_tool to accept Options Summary: Refactor dump_tool and undump_tool so that it's possible to use them with customized options for example setting a specific comparator similar to what Dragon is doing with the LdbTool https://phabricator.fb.com/diffusion/FBCODE/browse/master/dragon/tools/Ldb.cpp Test Plan: compiles used it to dump / undump a dragon shard Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba, adsharma Differential Revision: https://reviews.facebook.net/D47853	2015-10-05 19:49:48 -07:00
Evan Shaw	7a23e4d8ca	New amalgamation target This commit adds two new targets to the Makefile: rocksdb.cc and rocksdb.h These files, when combined with the c.h header, are a self-contained RocksDB source distribution called an amalgamation. (The name comes from SQLite's, which is similar in concept.) The main benefit of an amalgamation is that it's very easy to drop into a new project. It also compiles faster compared to compiling individual source files and potentially gives the compiler more opportunity to make optimizations since it can see all functions at once. rocksdb.cc and rocksdb.h are generated by a new script, amalgamate.py. A detailed description of how amalgamate.py works is in a comment at the top of the file. There are also some small changes to existing files to enable the amalgamation: * Use quotes for includes in unity build * Fix an old header inclusion in util/xfunc.cc * Move some includes outside ifdef in util/env_hdfs.cc * Separate out tool sources in Makefile so they won't be included in unity.cc * Unity build now produces a static library Closes #733	2015-10-01 08:29:31 +13:00
Yueh-Hsuan Chiang	74b100ac17	RocksDB Options file format and its serialization / deserialization. Summary: This patch defines the format of RocksDB options file, which follows the INI file format, and implements functions for its serialization and deserialization. An example RocksDB options file can be found in examples/rocksdb_option_file_example.ini. A typical RocksDB options file has three sections, which are Version, DBOptions, and more than one CFOptions. The RocksDB options file in general follows the basic INI file format with the following extensions / modifications: * Escaped characters We escaped the following characters: - \n -- line feed - new line - \r -- carriage return - \\ -- backslash \ - \: -- colon symbol : - \# -- hash tag # * Comments We support # style comments. Comments can appear at the ending part of a line. * Statements A statement is of the form option_name = value. Each statement contains a '=', where extra white-spaces are supported. However, we don't support multi-lined statement. Furthermore, each line can only contain at most one statement. * Section Sections are of the form [SecitonTitle "SectionArgument"], where section argument is optional. * List We use colon-separated string to represent a list. For instance, n1:n2:n3:n4 is a list containing four values. Below is an example of a RocksDB options file: [Version] rocksdb_version=4.0.0 options_file_version=1.0 [DBOptions] max_open_files=12345 max_background_flushes=301 [CFOptions "default"] [CFOptions "the second column family"] [CFOptions "the third column family"] Test Plan: Added many tests in options_test.cc Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: maykov, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46059	2015-09-29 14:42:40 -07:00
Assaf Sela	4805fa0eae	Remove ldb HexToString method's usage of sscanf Summary: Fix hex2String performance issues by removing sscanf dependency. Also fixed some edge case handling (odd length, bad input). Test Plan: Created a test file which called old and new implementation, and validated results are the same. I'll paste results in the phabricator diff. Reviewers: igor, rven, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong Reviewed By: sdong Subscribers: thatsafunnyname, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D46785	2015-09-23 14:25:46 -07:00
Islam AbdelRahman	f03b5c987b	Add experimental DB::AddFile() to plug sst files into empty DB Summary: This is an initial version of bulk load feature This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are (1) Memtables are empty (2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0 (3) Added sst files values are not overlapping Test Plan: unit testing Reviewers: igor, ott, sdong Reviewed By: sdong Subscribers: leveldb, ott, dhruba Differential Revision: https://reviews.facebook.net/D39081	2015-09-23 12:42:43 -07:00
Andres Noetzli	8aa1f15197	Refactored common code of Builder/CompactionJob out into a CompactionIterator Summary: Builder and CompactionJob share a lot of fairly complex code. This patch refactors this code into a separate class, the CompactionIterator. Because the shared code is fairly complex, this patch hopefully improves maintainability. While there are is a lot of potential for further improvements, the patch is intentionally pretty close to the original structure because the change is already complex enough. Test Plan: make clean all check && ./db_stress Reviewers: rven, anthony, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46197	2015-09-10 14:35:25 -07:00
agiardullo	5e94f68f35	TransactionDB Custom Locking API Summary: Prototype of API to allow MyRocks to override default Mutex/CondVar used by transactions with their own implementations. They would simply need to pass their own implementations of Mutex/CondVar to the templated TransactionDB::Open(). Default implementation of TransactionDBMutex/TransactionDBCondVar provided (but the code is not currently changed to use this). Let me know if this API makes sense or if it should be changed Test Plan: n/a Reviewers: yhchiang, rven, igor, sdong, spetrunia Reviewed By: spetrunia Subscribers: maykov, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43761	2015-09-08 17:03:57 -07:00
agiardullo	77a28615ec	Support static Status messages Summary: Provide a way to specify a detailed static error message for a Status without incurring a memcpy. Let me know what people think of this approach. Test Plan: added simple test Reviewers: igor, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D44259	2015-08-31 16:13:29 -07:00
agiardullo	20d1e547d1	Common base class for transactions Summary: As I keep adding new features to transactions, I keep creating more duplicate code. This diff cleans this up by creating a base implementation class for Transaction and OptimisticTransaction to inherit from. The code in TransactionBase.h/.cc is all just copied from elsewhere. The only entertaining part of this class worth looking at is the virtual TryLock method which allows OptimisticTransactions and Transactions to share the same common code for Put/Get/etc. The rest of this diff is mostly red and easy on the eyes. Test Plan: No functionality change. existing tests pass. Reviewers: sdong, jkedgar, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45135	2015-08-24 19:09:43 -07:00
agiardullo	c2f2cb0214	Pessimistic Transactions Summary: Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it. MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues. Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable. Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing. Test Plan: Unit tests, db_bench parallel testing. Reviewers: igor, rven, sdong, yhchiang, yoshinorim Reviewed By: sdong Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40869	2015-08-11 17:52:23 -07:00
agiardullo	16ea1c7d1c	simple ManagedSnapshot wrapper Summary: Implemented this simple wrapper for something else I was working on. Seemed like it makes sense to expose it instead of burying it in some random code. Test Plan: added test Reviewers: rven, kradhakrishnan, sdong, yhchiang Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43293	2015-08-06 17:59:05 -07:00
Poornima Chozhiyath Raman	960d936e83	Add function 'GetInfoLogList()' Summary: The list of info log files of a db can be obtained using the new function. Test Plan: New test in db_test.cc passed. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: IslamAbdelRahman, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41715	2015-08-05 16:16:46 -07:00
sdong	7ccd1c80a7	Add two unit tests for SyncWAL() Summary: Add two unit tests for SyncWAL(). One makes sure SyncWAL() doesn't block writes in the other thread. Another one makes sure SyncWAL() doesn't wait ongoing writes to finish before being executed. Create a new test file db_wal_test and move two WAL related tests from db_test to here. Test Plan: Run the new tests Reviewers: IslamAbdelRahman, rven, kradhakrishnan, kolmike, tnovak, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D43605	2015-08-05 14:27:02 -07:00
Islam AbdelRahman	c45a57b41e	Support delete rate limiting Summary: Introduce DeleteScheduler that allow enforcing a rate limit on file deletion Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed. I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile Test Plan: added delete_scheduler_test existing unit tests Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43221	2015-08-04 20:45:27 -07:00
Yueh-Hsuan Chiang	f39cbcb0a5	Merge pull request #654 from adamretter/remove-emptyvalue-compactionfilter RemoveEmptyValueCompactionFilter	2015-08-04 16:54:14 -07:00
Yueh-Hsuan Chiang	ce21afd205	Expose the BackupEngine from the Java API Summary: Merge pull request #665 by adamretter Exposes BackupEngine from C++ to the Java API. Previously only BackupableDB was available Test Plan: BackupEngineTest.java Reviewers: fyrz, igor, ankgup87, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D42873	2015-08-04 16:08:54 -07:00
Yueh-Hsuan Chiang	26894303c1	Add CompactOnDeletionCollector in utilities/table_properties_collectors. Summary: This diff adds CompactOnDeletionCollector in utilities/table_properties_collectors, which applies a sliding window to a sst file and mark this file as need-compaction when it observe enough deletion entries within the consecutive keys covered by the sliding window. Test Plan: compact_on_deletion_collector_test Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yoshinorim, sdong Reviewed By: sdong Subscribers: maykov, dhruba Differential Revision: https://reviews.facebook.net/D41175	2015-08-03 20:42:55 -07:00
Yueh-Hsuan Chiang	7219088cda	Move general compaction tests from db_test.cc to db_compaction_test.cc Summary: Move general compaction tests from db_test.cc to db_compaction_test.cc Test Plan: db_test db_compaction_test Reviewers: igor, sdong, IslamAbdelRahman, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42651	2015-07-21 03:05:57 -07:00
Yueh-Hsuan Chiang	7462286d33	Move in-place-update related tests from db_test.cc to db_inplace_update_test.cc Summary: Move in-place-update related tests from db_test.cc to db_inplace_update_test.cc Test Plan: db_test db_inplace_update_test Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D42657	2015-07-20 16:05:28 -07:00
sdong	6e9fbeb27c	Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future. Test Plan: Run all existing unit tests. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42321	2015-07-17 16:58:18 -07:00
Adam Retter	91bf1b80ef	Java facility to use the RemoveEmptyValueCompactionFilter	2015-07-16 11:50:11 +01:00
Adam Retter	3d00271e40	The ability to specify a compaction filter via the Java API	2015-07-16 11:50:10 +01:00
Adam Retter	62dec0e2b7	RemoveEmptyValueCompactionFilter - A compaction filter which removes entries which have an empty value	2015-07-16 11:50:10 +01:00
agiardullo	81d072623c	move convenience.h out of utilities Summary: Moved convenience.h out of utilities to remove a dependency on utilities in db. Test Plan: unit tests. Also compiled a link to the old location to verify the _Pragma works. Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42201	2015-07-15 14:51:51 -07:00
Yueh-Hsuan Chiang	801df912af	Move UniversalCompaction related db-tests to db_universal_compaction_test.cc Summary: Move UniversalCompaction related db-tests to db_universal_compaction_test.cc Test Plan: db_test db_universal_compaction_test Reviewers: igor, sdong, IslamAbdelRahman, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42225	2015-07-14 18:24:45 -07:00
Yueh-Hsuan Chiang	3ca6b2541e	Move TailingIterator tests from db_test.cc to db_test_tailing_iterator.cc Summary: Move TailingIterator tests from db_test.cc to db_test_tailing_iterator.cc Test Plan: db_test db_test_tailing_iterator Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D42021	2015-07-14 16:41:08 -07:00
Yueh-Hsuan Chiang	ce829c77e3	Make TransactionLogIterator related tests from db_test.cc to db_log_iter_test.cc Summary: Make TransactionLogIterator related tests from db_test.cc to db_log_iter_test.cc Test Plan: db_test db_log_iter_test Reviewers: sdong, IslamAbdelRahman, igor, anthony Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42045	2015-07-14 16:08:21 -07:00
Yueh-Hsuan Chiang	c3f98bb89b	Move CompactionFilter tests in db_test.cc to db_compaction_filter_test.cc Summary: Move CompactionFilter tests in db_test.cc to db_compaction_filter_test.cc Test Plan: db_test db_compaction_filter_test Reviewers: igor, sdong, IslamAbdelRahman, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42207	2015-07-14 16:03:47 -07:00
agiardullo	18d5e1bf88	Remove db_impl_readonly dependency on utilities Summary: Seems like the cleanest way to resolve this is to move CompactedDBImpl into db/. CompactedDBImpl should probably live in the same place as DBImplReadonly since Opening the latter could end up instantiating the former. Both DBImplReadonly and CompactedDBImpl inherit from DBImpl access protected members of DBImpl( and the latter access friendly private methods). Test Plan: unit tests Reviewers: sdong, yhchiang, kradhakrishnan, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42027	2015-07-14 11:32:54 -07:00
Yueh-Hsuan Chiang	b10cf4e2e0	Move DynamicLevel related db-tests to db_dynamic_level_test.cc Summary: Move DynamicLevel related db-tests to db_dynamic_level_test.cc Test Plan: db_dynamic_level_test db_test Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42039	2015-07-13 19:00:30 -07:00
Yueh-Hsuan Chiang	625467a08a	Move reusable part of db_test.cc to util/db_test_util.h Summary: Move reusable part of db_test.cc to util/db_test_util.h. This makes it more possible to partition db_test.cc into multiple smaller test files. Also, fixed many old lint errors in db_test. Test Plan: db_test Reviewers: igor, anthony, IslamAbdelRahman, sdong, kradhakrishnan Reviewed By: sdong, kradhakrishnan Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41973	2015-07-13 16:53:38 -07:00
Venkatesh Radhakrishnan	04251e1e3a	Add wal files to Checkpoint for multiple column families. Summary: When there are multiple column families, the flush in GetLiveFiles is not atomic, so that there are entries in the wal files which are needed to get a consisten RocksDB. We now add the log files to the checkpoint. Test Plan: CheckpointCF - This test forces more data to be written to the other column families after the flush of the first column family but before the second. Reviewers: igor, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40323	2015-06-19 16:08:31 -07:00
Yueh-Hsuan Chiang	fe5c6321cb	Allow EventListener::OnCompactionCompleted to return CompactionJobStats. Summary: Allow EventListener::OnCompactionCompleted to return CompactionJobStats, which contains useful information about a compaction. Example CompactionJobStats returned by OnCompactionCompleted(): smallest_output_key_prefix 05000000 largest_output_key_prefix 06990000 elapsed_time 42419 num_input_records 300 num_input_files 3 num_input_files_at_output_level 2 num_output_records 200 num_output_files 1 actual_bytes_input 167200 actual_bytes_output 110688 total_input_raw_key_bytes 5400 total_input_raw_value_bytes 300000 num_records_replaced 100 is_manual_compaction 1 Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats. Reviewers: rven, igor, anthony, sdong Reviewed By: sdong Subscribers: tnovak, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38463	2015-06-02 17:07:16 -07:00
Mike Kolupaev	ec7a944360	more times in perf_context and iostats_context Summary: We occasionally get write stalls (>1s Write() calls) on HDD under read load. The following timers explain almost all of the stalls: - perf_context.db_mutex_lock_nanos - perf_context.db_condition_wait_nanos - iostats_context.open_time - iostats_context.allocate_time - iostats_context.write_time - iostats_context.range_sync_time - iostats_context.logger_time In my experiments each of these occasionally takes >1s on write path under some workload. There are rare cases when Write() takes long but none of these takes long. Test Plan: Added code to our application to write the listed timings to log for slow writes. They usually add up to almost exactly the time Write() call took. Reviewers: rven, yhchiang, sdong Reviewed By: sdong Subscribers: march, dhruba, tnovak Differential Revision: https://reviews.facebook.net/D39177	2015-06-02 02:07:58 -07:00
agiardullo	dc9d70de65	Optimistic Transactions Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported. Test Plan: Added a new test, but still need to look into stress testing. Reviewers: yhchiang, igor, rven, sdong Reviewed By: sdong Subscribers: adamretter, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D33435	2015-05-29 14:36:35 -07:00
Yueh-Hsuan Chiang	ec4ff4e99c	Rename EventLoggerHelpers EventHelpers Summary: Rename EventLoggerHelpers EventHelpers, as it's going to include all event-related helper functions instead of EventLogger only stuffs. Test Plan: make Reviewers: sdong, rven, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39093	2015-05-28 13:37:47 -07:00
Igor Canadi	dbd95b7532	Add more table properties to EventLogger Summary: Example output: {"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}} Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter. Test Plan: make check + check out the output in the log Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38343	2015-05-12 15:53:55 -07:00
agiardullo	711465ccec	API to fetch from both a WriteBatchWithIndex and the db Summary: Added a couple functions to WriteBatchWithIndex to make it easier to query the value of a key including reading pending writes from a batch. (This is needed for transactions). I created write_batch_with_index_internal.h to use to store an internal-only helper function since there wasn't a good place in the existing class hierarchy to store this function (and it didn't seem right to stick this function inside WriteBatchInternal::Rep). Since I needed to access the WriteBatchEntryComparator, I moved some helper classes from write_batch_with_index.cc into write_batch_with_index_internal.h/.cc. WriteBatchIndexEntry, ReadableWriteBatch, and WriteBatchEntryComparator are all unchanged (just moved to a different file(s)). Test Plan: Added new unit tests. Reviewers: rven, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38037	2015-05-11 14:51:51 -07:00
Igor Canadi	6059bdf86a	Add experimental API MarkForCompaction() Summary: Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys). All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart. MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction. Test Plan: added a new unit test Reviewers: yhchiang, rven, MarkCallaghan, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37083	2015-04-17 16:44:45 -07:00
Venkatesh Radhakrishnan	0a0501c8d5	Add Xfunc to makefile Summary: Make target for running all xfunc tests Test Plan: make xfunc Reviewers: igor, sdong, meyering Reviewed By: meyering Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36873	2015-04-13 14:21:32 -07:00
Jim Meyering	cba5920011	build: don't use a glob for java/rocksjni/* Summary: * src.mk (JNI_NATIVE_SOURCES): New variable, so we don't have to use a glob in Makefile * Makefile (JNI_NATIVE_SOURCES): Remove glob-using definition, now that the explicit list of sources is in src.mk. Test Plan: Run this: JAVA_HOME=/usr/local/jdk-7u67-64 PATH=$JAVA_HOME/bin:$PATH \ make rocksdbjava Reviewers: yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D36633	2015-04-07 15:19:25 -07:00
Igor Canadi	2511b7d947	Makefile minor cleanup Summary: Just couple of small changes: 1. removed signal_test, since it doesn't seem useful and we don't even run it as part of `make check` 2. moved perf_context_test to TESTS instead of PROGRAMS 3. `make release` probably shouldn't compile benchmarks. We currently rely on `make release` building db_bench (via Jenkins), so I left db_bench there. This is just a minor cleanup. We need to rethink our targets since they are a bit messy right now. We can do this during our tech debt week. Test Plan: make release Reviewers: anthony, rven, yhchiang, sdong, meyering Reviewed By: meyering Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36171	2015-03-30 16:05:35 -04:00
Igor Canadi	d61cb0b9de	db_bench can now disable flashcache for background threads Summary: Most of the approach is copied from WebSQL's MySQL branch. It's nice that we can do this without touching core RocksDB code. Test Plan: Compiles and runs. Didn't test flashback code, as I don't have flashback device and most if it is c/p Reviewers: MarkCallaghan, sdong Reviewed By: sdong Subscribers: rven, lgalanis, kradhakrishnan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35391	2015-03-30 09:51:11 -07:00
agiardullo	81345b90f9	Create an abstract interface for write batches Summary: WriteBatch and WriteBatchWithIndex now both inherit from a common abstract base class. This makes it easier to write code that is agnostic toward the implementation of the particular write batch. In particular, I plan on utilizing this abstraction to allow transactions to support using either implementation of a write batch. Test Plan: modified existing WriteBatchWithIndex tests to test new functions. Running all tests. Reviewers: igor, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34017	2015-03-17 19:23:08 -07:00
Igor Sugak	a7aba2ef6b	rocksdb: Add gtest Summary: Adds gtest fused source code into `third-party` directory. No manual changes. gtest latest released 1.7 has clang dev compilation errors. Trunk version requires only one disabled warning (-Wno-missing-field-initializers) Fused code is made as described here https://fburl.com/90806322 Details about why we need gtest source code instead of precompiled library https://fburl.com/90805763 Source used from http://googletest.googlecode.com/svn/trunk Test Plan: Build and notice no errors. Also check in logs that gtest-all.o being compiled gtest-all.o. ```lang=bash % USE_CLANG=1 make all ``` Reviewers: lgalanis, yufei.zhu, rven, sdong, igor, meyering Reviewed By: meyering Subscribers: meyering, yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33345	2015-03-16 18:27:30 -07:00
Igor Canadi	52d8347a91	EventLogger Summary: Here's my proposal for making our LOGs easier to read by machines. The idea is to dump all events as JSON objects. JSON is easy to read by humans, but more importantly, it's easy to read by machines. That way, we can parse this, load into SQLite/mongo and then query or visualize. I started with table_create and table_delete events, but if everybody agrees, I'll continue by adding more events (flush/compaction/etc etc) Test Plan: Ran db_bench. Observed: 2015/01/15-14:13:25.788019 1105ef000 EVENT_LOG_v1 {"time_micros": 1421360005788015, "event": "table_file_creation", "file_number": 12, "file_size": 1909699} 2015/01/15-14:13:25.956500 110740000 EVENT_LOG_v1 {"time_micros": 1421360005956498, "event": "table_file_deletion", "file_number": 12} Reviewers: yhchiang, rven, dhruba, MarkCallaghan, lgalanis, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31647	2015-03-13 10:15:54 -07:00
Venkatesh Radhakrishnan	05d92efa75	Add convenience.cc to src.mk Summary: The build process now requires new source files to be added to src.mk. Adding convenience.cc to src.mk Test Plan: Build rocksdb Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34815	2015-03-11 12:39:53 -07:00
Jim Meyering	34c75e984b	fix-up patch: avoid new link error Summary: * src.mk (LIB_SOURCES): Add this file: utilities/document/json_document_builder.cc to avoid link errors. Blame Rev: D33849 Test Plan: run "make" Reviewers: ljin, igor.sugak, rven, sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D34593	2015-03-06 11:15:18 -08:00
Jim Meyering	ebc647de87	build: fix missing dependency problems Summary: Any time one would modify a dependent of any test.cc file, "make" would fail to rebuild the affected test binaries, e.g., db_test. That was due to the fact that we deliberately excluded those test-related files from the definition of SOURCES and only $(SOURCES) was used to create the automatically-generated .d dependency files. The fix is to generate a .d file for every source file. * src.mk: New file. Defines LIB_SOURCES, MOCK_SOURCES and TEST_BENCH_SOURCES. * Makefile: Include src.mk. Reflect s/SOURCES/LIB_SOURCES/ renaming. * build_tools/build_detect_platform: Remove the code that was used to generate SOURCES= and MOCK_SOURCES= definitions in make_config.mk. Those lists of files are now hard-coded in src.mk. Hard-coding this list of sources is desirable, because without that, one risks including stray .cc files in a build. Not reproducible. Test Plan: Touch a file used by db_test's dependent .o files and ensure that they are all recompiled. Before, none would be: $ touch db/db_impl.h && make db_test CC db/db_test.o CC db/column_family.o CC db/db_filesnapshot.o CC db/db_impl.o CC db/db_impl_debug.o CC db/db_impl_readonly.o CC db/forward_iterator.o CC db/internal_stats.o CC db/managed_iterator.o CC db/repair.o CC db/write_batch.o CC utilities/compacted_db/compacted_db_impl.o CC utilities/ttl/db_ttl_impl.o CC util/ldb_cmd.o CC util/ldb_tool.o CC util/sst_dump_tool.o CC util/xfunc.o CCLD db_test Reviewers: ljin, igor.sugak, igor, rven, sdong Reviewed By: sdong Subscribers: yhchiang, adamretter, fyrz, dhruba Differential Revision: https://reviews.facebook.net/D33849	2015-03-06 10:55:11 -08:00

... 3 4 5 6 7 ...

399 Commits