rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-26 07:30:54 +00:00

Author	SHA1	Message	Date
Maysam Yabandeh	638d239507	Charge block cache for cache internal usage (#5797 ) Summary: For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor. This PR builds up on https://github.com/facebook/rocksdb/issues/4258 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797 Test Plan: - Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata. - The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata. Differential Revision: D17396833 Pulled By: maysamyabandeh fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f	2019-09-16 15:26:21 -07:00
Wilfried Goesgens	fbab9913e2	upgrade gtest 1.7.0 => 1.8.1 for json result writing Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332 Differential Revision: D17242232 fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64	2019-09-09 11:24:11 -07:00
Zhongyi Xie	2f41ecfe75	Refactor trimming logic for immutable memtables (#5022 ) Summary: MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory. We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one. The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming. In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022 Differential Revision: D14394062 Pulled By: miasantreble fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5	2019-08-23 13:55:34 -07:00
Vijay Nadimpalli	d150e01474	New API to get all merge operands for a Key (#5604 ) Summary: This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases: 1. Update subset of columns and read subset of columns - Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU. 2. Updating very few attributes in a value which is a JSON-like document - Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge. ---------------------------------------------------------------------------------------------------- API : Status GetMergeOperands( const ReadOptions& options, ColumnFamilyHandle* column_family, const Slice& key, PinnableSlice* merge_operands, GetMergeOperandsOptions* get_merge_operands_options, int* number_of_operands) Example usage : int size = 100; int number_of_operands = 0; std::vector<PinnableSlice> values(size); GetMergeOperandsOptions merge_operands_info; db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands); Description : Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion. merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604 Test Plan: Added unit test and perf test in db_bench that can be run using the command: ./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist Differential Revision: D16657366 Pulled By: vjnadimpalli fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf	2019-08-06 14:26:44 -07:00
Levi Tamasi	092f417037	Move the uncompression dictionary object out of the block cache (#5584 ) Summary: RocksDB has historically stored uncompression dictionary objects in the block cache as opposed to storing just the block contents. This neccesitated evicting the object upon table close. With the new code, only the raw blocks are stored in the cache, eliminating the need for eviction. In addition, the patch makes the following improvements: 1) Compression dictionary blocks are now prefetched/pinned similarly to index/filter blocks. 2) A copy operation got eliminated when the uncompression dictionary is retrieved. 3) Errors related to retrieving the uncompression dictionary are propagated as opposed to silently ignored. Note: the patch temporarily breaks the compression dictionary evicition stats. They will be fixed in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5584 Test Plan: make asan_check Differential Revision: D16344151 Pulled By: ltamasi fbshipit-source-id: 2962b295f5b19628f9da88a3fcebbce5a5017a7b	2019-07-23 16:01:44 -07:00
haoyuhuang	8a008d4170	Block access tracing: Trace referenced key for Get on non-data blocks. (#5548 ) Summary: This PR traces the referenced key for Get for all types of blocks. This is useful when evaluating hybrid row-block caches. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5548 Test Plan: make clean && USE_CLANG=1 make check -j32 Differential Revision: D16157979 Pulled By: HaoyuHuang fbshipit-source-id: f6327411c9deb74e35e22a35f66cdbae09ab9d87	2019-07-17 13:05:58 -07:00
Levi Tamasi	3bde41b5a3	Move the filter readers out of the block cache (#5504 ) Summary: Currently, when the block cache is used for the filter block, it is not really the block itself that is stored in the cache but a FilterBlockReader object. Since this object is not pure data (it has, for instance, pointers that might dangle, including in one case a back pointer to the TableReader), it's not really sharable. To avoid the issues around this, the current code erases the cache entries when the TableReader is closed (which, BTW, is not sufficient since a concurrent TableReader might have picked up the object in the meantime). Instead of doing this, the patch moves the FilterBlockReader out of the cache altogether, and decouples the filter reader object from the filter block. In particular, instead of the TableReader owning, or caching/pinning the FilterBlockReader (based on the customer's settings), with the change the TableReader unconditionally owns the FilterBlockReader, which in turn owns/caches/pins the filter block. This change also enables us to reuse the code paths historically used for data blocks for filters as well. Note: Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504 Test Plan: make asan_check Differential Revision: D16036974 Pulled By: ltamasi fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091	2019-07-16 13:14:58 -07:00
haoyuhuang	6edc5d0719	Block cache tracing: Associate a unique id with Get and MultiGet (#5514 ) Summary: This PR associates a unique id with Get and MultiGet. This enables us to track how many blocks a Get/MultiGet request accesses. We can also measure the impact of row cache vs block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5514 Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32 Differential Revision: D16032681 Pulled By: HaoyuHuang fbshipit-source-id: 775b05f4440badd58de6667e3ec9f4fc87a0af4c	2019-07-03 19:35:41 -07:00
Mike Kolupaev	b4d7209428	Add an option to put first key of each sst block in the index (#5289 ) Summary: The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes. Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it. So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks. Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files. This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289 Differential Revision: D15256423 Pulled By: al13n321 fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a	2019-06-24 20:54:04 -07:00
haoyuhuang	705b8eecb4	Add more callers for table reader. (#5454 ) Summary: This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block. 1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name. 2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable. This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454 Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32. Differential Revision: D15819451 Pulled By: HaoyuHuang fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d	2019-06-20 14:31:48 -07:00
Vijay Nadimpalli	24b118ad98	Combine the read-ahead logic for user reads and compaction reads (#5431 ) Summary: Currently the read-ahead logic for user reads and compaction reads go through different code paths where compaction reads create new table readers and use `ReadaheadRandomAccessFile`. This change is to unify read-ahead logic to use read-ahead in BlockBasedTableReader::InitDataBlock(). As a result of the change `ReadAheadRandomAccessFile` class and `new_table_reader_for_compaction_inputs` option will no longer be used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5431 Test Plan: make check Here is the benchmarking - https://gist.github.com/vjnadimpalli/083cf423f7b6aa12dcdb14c858bc18a5 Differential Revision: D15772533 Pulled By: vjnadimpalli fbshipit-source-id: b71dca710590471ede6fb37553388654e2e479b9	2019-06-19 14:10:46 -07:00
Levi Tamasi	5355e527d9	Make the 'block read count' performance counters consistent (#5484 ) Summary: The patch brings the semantics of per-block-type read performance context counters in sync with the generic block_read_count by only incrementing the counter if the block was actually read from the file. It also fixes index_block_read_count, which fell victim to the refactoring in PR https://github.com/facebook/rocksdb/issues/5298. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5484 Test Plan: Extended the unit tests. Differential Revision: D15887431 Pulled By: ltamasi fbshipit-source-id: a3889759d0ac5759d56625d692cd828d1b9207a6	2019-06-18 19:03:24 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Vijay Nadimpalli	50e470791d	Organizing rocksdb/table directory by format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373 Differential Revision: D15559425 Pulled By: vjnadimpalli fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55	2019-05-30 14:51:11 -07:00
Levi Tamasi	1e35584251	Move the index readers out of the block cache (#5298 ) Summary: Currently, when the block cache is used for index blocks as well, it is not really the index block that is stored in the cache but an IndexReader object. Since this object is not pure data (it has, for instance, pointers that might dangle), it's not really sharable. To avoid the issues around this, the current code uses a dummy unique cache key for each TableReader to store the IndexReader, and erases the IndexReader entry when the TableReader is closed. Instead of doing this, the new code moves the IndexReader out of the cache altogether. In particular, instead of the TableReader owning, or caching/pinning the IndexReader based on the customer's settings, the TableReader unconditionally owns the IndexReader, which in turn owns/caches/pins the index block (which is itself sharable and thus can be safely put in the cache without any hacks). Note: the change has two side effects: 1) Partitions of partitioned indexes no longer affect the read amplification statistics. 2) Eviction statistics for index blocks are temporarily broken. We plan to fix this in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298 Differential Revision: D15303203 Pulled By: ltamasi fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47	2019-05-30 11:53:27 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
Mike Kolupaev	6a6aef25c1	Fix crash in BlockBasedTableIterator::Seek() (#5291 ) Summary: https://github.com/facebook/rocksdb/pull/5256 broke it: `block_iter_.user_key()` may not be valid even if `block_iter_points_to_real_block_` is true. E.g. if there was an IO error or Status::Incomplete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5291 Differential Revision: D15273324 Pulled By: al13n321 fbshipit-source-id: 442e5b09f9884a58f92a6ac1ca93af719c219886	2019-05-10 12:40:57 -07:00
yiwu-arbug	f1239d5f10	Avoid per-key upper bound check in BlockBasedTableIterator (#5142 ) Summary: This is second attempt for #5101. Original commit message: `BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons. This patch come with two fixes: Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter. Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142 Differential Revision: D14907113 Pulled By: siying fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb	2019-04-16 11:37:47 -07:00
Levi Tamasi	59ef2ba559	Evict the uncompression dictionary from the block cache upon table close (#5150 ) Summary: The uncompression dictionary object has a Statistics pointer that might dangle if the database closed. This patch evicts the dictionary from the block cache when a table is closed, similarly to how index and filter readers are handled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5150 Differential Revision: D14782422 Pulled By: ltamasi fbshipit-source-id: 0cec9336c742c479aa92206e04521767f1aa9622	2019-04-04 16:21:12 -07:00
Siying Dong	ebcc8ae1d3	Revert "Avoid per-key upper bound check in BlockBasedTableIterator (#5101 )" (#5132 ) Summary: This reverts commit `f29dc1b906`. In BlockBasedTableIterator, index_iter_->key() is sometimes a user key, so it is wrong to call ExtractUserKey() against it. This is a bug introduced by #5101. Temporarily revert the diff to keep the branch clean. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5132 Differential Revision: D14718584 Pulled By: siying fbshipit-source-id: 0ac55dc9b5dbc18c7809092146bdf7eb9364b9ad	2019-04-02 10:00:38 -07:00
Yi Wu	f29dc1b906	Avoid per-key upper bound check in BlockBasedTableIterator (#5101 ) Summary: `BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5101 Differential Revision: D14678707 Pulled By: siying fbshipit-source-id: 2372446116753c7892ea4cec7b4b49ef87ba463e	2019-03-29 13:11:46 -07:00
Shobhit Dayal	b45b1cde3e	Feature for sampling and reporting compressibility (#4842 ) Summary: This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms: 1. lz4 or snappy for fast compression 2. zstd or zlib for slow but higher compression. The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType. The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842 Differential Revision: D13629011 Pulled By: shobhitdayal fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa	2019-03-18 12:15:34 -07:00
Siying Dong	aef763b6d6	Make statistics's stats_level change thread-safe (#5030 ) Summary: Right now, users can change statistics.stats_level while DB is running, but TSAN may report data race. We make stats_level_ to be atomic, and access them using accessors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030 Differential Revision: D14267519 Pulled By: siying fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73	2019-03-01 10:42:09 -08:00
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	2019-02-14 14:41:36 -08:00
Andrew Kryczka	62f70f6d14	Reduce scope of compression dictionary to single SST (#4952 ) Summary: Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio. So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include: - The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called. - After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up. - Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952 Differential Revision: D13967980 Pulled By: ajkr fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f	2019-02-11 19:47:32 -08:00
Andrew Kryczka	8ec3e72551	Cache dictionary used for decompressing data blocks (#4881 ) Summary: - If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`. - If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle. - New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count" Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881 Differential Revision: D13663801 Pulled By: ajkr fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f	2019-01-23 18:15:47 -08:00
Yi Wu	512a5e3ef8	Fix BlockBasedTable not always using memory allocator if available (#4678 ) Summary: Fix block based table reader not using memory_allocator when allocating index blocks and compression dictionary blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4678 Differential Revision: D13054594 Pulled By: yiwu-arbug fbshipit-source-id: 379f25bcc665395662511c4f873f4b7b55104ce2	2018-11-28 18:01:24 -08:00
Abhishek Madan	8fe1e06ca0	Clean up FragmentedRangeTombstoneList (#4692 ) Summary: Removed `one_time_use` flag, which removed the need for some tests, and changed all `NewRangeTombstoneIterator` methods to return `FragmentedRangeTombstoneIterators`. These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones` and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692 Differential Revision: D13106570 Pulled By: abhimadan fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845	2018-11-28 15:29:02 -08:00
Yi Wu	05d9d82181	Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697 ) Summary: …ons (#4676)" This reverts commit `b32d087dbb`. `MemoryAllocator` needs to be with `Cache`, since cache entry can outlive DB and block based table. The cache needs to hold reference to memory allocator when deleting cache entry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697 Differential Revision: D13133490 Pulled By: yiwu-arbug fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a	2018-11-21 11:29:57 -08:00
Siying Dong	b82e57d425	Remove two variables from BlockContents class and don't use class Block for compressed block (#4650 ) Summary: We carry compression type and "cachable" variables for every block in the block cache, while they take well-known values. 8-byte is wasted for each block (2-byte for useful information but it takes 8 bytes because of padding). With this change, these two variables are removed. The cachable information is only useful in the process of reading the block. We use other information to infer from it. For compressed blocks, the compression type is a part of the block content itself so we can get it from there. Some code is slightly refactored so that the cachable information can flow better. Another change is to only use class BlockContents for compressed block, and narrow the class Block to only be used for uncompressed blocks, including blocks in compressed block cache. This can make the Block class less confusing. It also saves tens of bytes for each block in compressed block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4650 Differential Revision: D12969070 Pulled By: siying fbshipit-source-id: 548b62724e9eb66993026429fd9c7c3acd1f95ed	2018-11-13 17:02:55 -08:00
Yi Wu	b32d087dbb	Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676 ) Summary: Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy. It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676 Differential Revision: D13047662 Pulled By: yiwu-arbug fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd	2018-11-13 13:48:38 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Siying Dong	566fc8b994	Black list some valgrind tests (#4642 ) Summary: valgrind tests with 1 thread run too long. To make it shorter, black list some long tests. These are already blacklisted in parallel valgrind tests, but they are not in non-parallel mode Pull Request resolved: https://github.com/facebook/rocksdb/pull/4642 Differential Revision: D12945237 Pulled By: siying fbshipit-source-id: 04cf977d435996480fe87aa09f14b17975b74f7d	2018-11-06 14:22:36 -08:00
Bo Hou	cd9404bb77	xxhash 64 support Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607 Reviewed By: siying Differential Revision: D12836696 Pulled By: jsjhoubo fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870	2018-11-01 15:44:06 -07:00
Yi Wu	f560c8f5c8	s/CacheAllocator/MemoryAllocator/g (#4590 ) Summary: Rename the interface, as it is mean to be a generic interface for memory allocation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590 Differential Revision: D10866340 Pulled By: yiwu-arbug fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16	2018-10-26 14:30:30 -07:00
Abhishek Madan	7528130e38	Cache fragmented range tombstones in BlockBasedTableReader (#4493 ) Summary: This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses. On the same DB used in #4449, running `readrandom` results in the following: ``` readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found) ``` Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results): ``` Tombstones? \| avg micros/op \| stddev micros/op \| avg ops/s \| stddev ops/s ----------------- \| ------------- \| ---------------- \| ------------ \| ------------ None \| 0.6186 \| 0.04637 \| 1,625,252.90 \| 124,679.41 500 Expanded \| 0.6019 \| 0.03628 \| 1,666,670.40 \| 101,142.65 500 Unexpanded \| 0.6435 \| 0.03994 \| 1,559,979.40 \| 104,090.52 1k Expanded \| 0.6034 \| 0.04349 \| 1,665,128.10 \| 125,144.57 1k Unexpanded \| 0.6261 \| 0.03093 \| 1,600,457.50 \| 79,024.94 5k Expanded \| 0.6163 \| 0.05926 \| 1,636,668.80 \| 154,888.85 5k Unexpanded \| 0.6402 \| 0.04002 \| 1,567,804.70 \| 100,965.55 10k Expanded \| 0.6036 \| 0.05105 \| 1,667,237.70 \| 142,830.36 10k Unexpanded \| 0.6128 \| 0.02598 \| 1,634,633.40 \| 72,161.82 25k Expanded \| 0.6198 \| 0.04542 \| 1,620,980.50 \| 116,662.93 25k Unexpanded \| 0.5478 \| 0.0362 \| 1,833,059.10 \| 121,233.81 50k Expanded \| 0.5104 \| 0.04347 \| 1,973,107.90 \| 184,073.49 50k Unexpanded \| 0.4528 \| 0.03387 \| 2,219,034.50 \| 170,984.32 ``` After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493 Differential Revision: D10842844 Pulled By: abhimadan fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509	2018-10-25 19:26:44 -07:00
Neil Mayhew	43dbd4411e	Adapt three unit tests with newer compiler/libraries (#4562 ) Summary: This fixes three tests that fail with relatively recent tools and libraries: The tests are: * `spatial_db_test` * `table_test` * `db_universal_compaction_test` I'm using: * `gcc` 7.3.0 * `glibc` 2.27 * `snappy` 1.1.7 * `gflags` 2.2.1 * `zlib` 1.2.11 * `bzip2` 1.0.6.0.1 * `lz4` 1.8.2 * `jemalloc` 5.0.1 The versions used in the Travis environment (which is two Ubuntu LTS versions behind the current one and doesn't use `lz4` or `jemalloc`) don't seem to have a problem. However, to be safe, I verified that these tests pass with and without my changes in a trusty Docker container without `lz4` and `jemalloc`. However, I do get an unrelated set of other failures when using a trusty Docker container that uses `lz4` and `jemalloc`: ``` db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/0, where GetParam() = (1, false) (1189 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1, where GetParam() = (1, true) (1246 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2, where GetParam() = (3, false) (1237 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3, where GetParam() = (3, true) (1195 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4, where GetParam() = (5, false) (1161 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5, where GetParam() = (5, true) (1229 ms) ``` I haven't attempted to fix these since I'm not using trusty and Travis doesn't use `lz4` and `jemalloc`. However, the final commit in this PR does at least fix the compilation errors that occur when using trusty's version of `lz4`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4562 Differential Revision: D10510917 Pulled By: maysamyabandeh fbshipit-source-id: 59534042015ec339270e5fc2f6ac4d859370d189	2018-10-24 08:17:56 -07:00
Igor Canadi	1cf5deb8fd	Introduce CacheAllocator, a custom allocator for cache blocks (#4437 ) Summary: This is a conceptually simple change, but it touches many files to pass the allocator through function calls. We introduce CacheAllocator, which can be used by clients to configure custom allocator for cache blocks. Our motivation is to hook this up with folly's `JemallocNodumpAllocator` (`f43ce6d686/folly/experimental/JemallocNodumpAllocator.h`), but there are many other possible use cases. Additionally, this commit cleans up memory allocation in `util/compression.h`, making sure that all allocations are wrapped in a unique_ptr as soon as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437 Differential Revision: D10132814 Pulled By: yiwu-arbug fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564	2018-10-02 17:24:58 -07:00
Yanqin Jin	bb5dcea98e	Add path to WritableFileWriter. (#4039 ) Summary: We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039 Differential Revision: D8670178 Pulled By: riversand963 fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a	2018-08-23 10:12:58 -07:00
Fenggang Wu	9d646a6311	Add db_bench options of data block hash index (#4281 ) Summary: Add `--data_block_index_type` and `--data_block_hash_table_util_ratio` option to `db_bench`. `--data_block_index_type` can be either of `binary` (default) or `binary_and_hash`; `--data_block_hash_table_util_ratio` will be a double. The default value is `0.75`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4281 Differential Revision: D9361476 Pulled By: fgwu fbshipit-source-id: dc53e01acef9db81b9eec5e8a96f3bc8ed718c10	2018-08-16 18:42:46 -07:00
Fenggang Wu	19ec44fd39	Improve point-lookup performance using a data block hash index (#4174 ) Summary: Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.` The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174 Differential Revision: D8965914 Pulled By: fgwu fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198	2018-08-15 14:30:03 -07:00
Maysam Yabandeh	caf0f53a74	Index value delta encoding (#3983 ) Summary: Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4. The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size. Results with sysbench read-only using 4k blocks and using 16 index restart interval: Format 2: 19585 rocksdb read-only range=100 Format 3: 19569 rocksdb read-only range=100 Format 4: 19352 rocksdb read-only range=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983 Differential Revision: D8361343 Pulled By: maysamyabandeh fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651	2018-08-09 16:58:40 -07:00
Yanqin Jin	54de56844d	Remove random writes from SST file ingestion (#4172 ) Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071	2018-07-27 16:12:23 -07:00
Siying Dong	a5e851e113	Reformatting some recent changes (#4161 ) Summary: Lint is not happy with some new code recently committed. Format them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161 Differential Revision: D8940582 Pulled By: siying fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7	2018-07-20 14:43:38 -07:00
Siying Dong	8425c8bd4d	BlockBasedTableReader: automatically adjust tail prefetch size (#4156 ) Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156 Differential Revision: D8916847 Pulled By: siying fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758	2018-07-20 14:43:37 -07:00
Andrew Kryczka	ab35505e21	Write properties metablock last in block-based tables (#4158 ) Summary: The properties meta-block should come at the end since we always need to read it when opening a file, unlike index/filter/other meta-blocks, which are sometimes read depending on the user's configuration. This ordering will allow us to (in a future PR) do a small readahead on the end of the file to read properties and meta-index blocks with one I/O. The bulk of this PR is a refactoring of the `BlockBasedTableBuilder::Finish` function. It was previously too large with inconsistent error handling, which made it difficult to change. So I broke it up into one function per meta-block write, and tried to make error handling consistent within those functions. Then reordering the metablocks was trivial -- just reorder the calls to these helper functions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4158 Differential Revision: D8921705 Pulled By: ajkr fbshipit-source-id: 96c9cc3182eb1adf11af46adab79dbeba7b12fcc	2018-07-20 09:11:59 -07:00
Siying Dong	8f06b4fa01	Separate some IndexBlockIter logic from BlockIter (#4136 ) Summary: Some logic only related to IndexBlockIter is separated from BlockIter to IndexBlockIter. This is done by writing an exclusive Seek() and SeekForPrev() for DataBlockIter, and all metadata block iter and tombstone block iter now use data block iter. Dealing with the BinarySeek() sharing problem by passing in the comparator to use. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4136 Reviewed By: maysamyabandeh Differential Revision: D8859673 Pulled By: siying fbshipit-source-id: 703e5e6824b82b7cbf4721f3594b94127797ca9e	2018-07-16 10:13:18 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Tamir Duberstein	7bee48bdbd	Add GCC 8 to Travis (#3433 ) Summary: - Avoid `strdup` to use jemalloc on Windows - Use `size_t` for consistency - Add GCC 8 to Travis - Add CMAKE_BUILD_TYPE=Release to Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433 Differential Revision: D6837948 Pulled By: sagar0 fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f	2018-07-13 10:58:06 -07:00
Maysam Yabandeh	d4ad32d7bd	Refactor BlockIter (#4121 ) Summary: BlockIter is getting crowded including details that specific only to either index or data blocks. The patch moves down such details to DataBlockIter and IndexBlockIter, both inheriting from BlockIter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4121 Differential Revision: D8816832 Pulled By: maysamyabandeh fbshipit-source-id: d492e74155c11d8a0c1c85cd7ee33d24c7456197	2018-07-12 17:27:31 -07:00
Yanqin Jin	39218a72a4	Increase the size of LRU cache. (#4090 ) Summary: Increase the size of each shard so that the number of cache hit/miss match expectation. Otherwise FilterBlockInBlockCache test will fail. Closes https://github.com/facebook/rocksdb/pull/4090 Differential Revision: D8736158 Pulled By: riversand963 fbshipit-source-id: 5cdbc06b02390389fd5b72a6d251d88949ad3d91	2018-07-05 11:45:11 -07:00
Maysam Yabandeh	29ffbb8a50	Charging block cache more accurately (#4073 ) Summary: Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it. Closes https://github.com/facebook/rocksdb/pull/4073 Differential Revision: D8686552 Pulled By: maysamyabandeh fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f	2018-06-29 08:57:20 -07:00
Maysam Yabandeh	235ab9dd32	Pin mmap files in ReadOnlyDB (#4053 ) Summary: https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case. Closes https://github.com/facebook/rocksdb/pull/4053 Differential Revision: D8662546 Pulled By: maysamyabandeh fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd	2018-06-27 17:13:34 -07:00
Zhongyi Xie	408205a36b	use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899 ) Summary: Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file. This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at. Closes https://github.com/facebook/rocksdb/pull/3899 Differential Revision: D8157459 Pulled By: miasantreble fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41	2018-06-26 15:57:26 -07:00
Maysam Yabandeh	80ade9ad83	Pin top-level index on partitioned index/filter blocks (#4037 ) Summary: Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead. Closes https://github.com/facebook/rocksdb/pull/4037 Differential Revision: D8596218 Pulled By: maysamyabandeh fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2	2018-06-22 15:27:46 -07:00
Zhongyi Xie	80bc35927c	Should only decode restart points for uncompressed blocks (#3996 ) Summary: The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843. This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy. Closes https://github.com/facebook/rocksdb/pull/3996 Differential Revision: D8416186 Pulled By: miasantreble fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28	2018-06-15 19:26:58 -07:00
Fenggang Wu	3593275357	Remove restart point from the properties_block (#3970 ) Summary: Property block will be read sequentially and cached in a heap located object, so there's no need for restart points. Thus we set the restart interval to infinity to save space. Closes https://github.com/facebook/rocksdb/pull/3970 Differential Revision: D8332586 Pulled By: fgwu fbshipit-source-id: 899c3267832a81d0f084ec2db6b387332f461134	2018-06-12 12:57:37 -07:00
Fenggang Wu	f4502944c3	Change db path for BlockBasedTableTest.BadOptions (#3965 ) Summary: BadOptions test creates a temporary db path changed to table_block_based_bad_options_test to avoid collide with that created by the PrefixAndWholeKeyTest Closes https://github.com/facebook/rocksdb/pull/3965 Differential Revision: D8316080 Pulled By: fgwu fbshipit-source-id: bb8e0fdfdb9abf0e5ce94494b4388cd1622ee032	2018-06-08 12:57:14 -07:00
Maysam Yabandeh	d0c38c0c8c	Extend some tests to format_version=3 (#3942 ) Summary: format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3. Closes https://github.com/facebook/rocksdb/pull/3942 Differential Revision: D8238413 Pulled By: maysamyabandeh fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035	2018-06-04 20:13:00 -07:00
Maysam Yabandeh	402b7aa07f	Exclude seq from index keys Summary: Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator. The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s Closes https://github.com/facebook/rocksdb/pull/3894 Differential Revision: D8118775 Pulled By: maysamyabandeh fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd	2018-05-25 18:42:43 -07:00
Zhongyi Xie	c3ebc75843	Move prefix_extractor to MutableCFOptions Summary: Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users. This PR aims to make it possible to dynamically change bloom filter config. Closes https://github.com/facebook/rocksdb/pull/3601 Differential Revision: D7253114 Pulled By: miasantreble fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c	2018-05-21 14:43:11 -07:00
Mike Kolupaev	8bf555f487	Change and clarify the relationship between Valid(), status() and Seek() for all iterators. Also fix some bugs Summary: Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are: If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted. * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not. However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours). This PR changes the convention to: * If status() is not ok, Valid() always returns false. * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.) This does sacrifice the two use cases listed above, but siying said it's ok. Overview of the changes: * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario. * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed. * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator. To find the iterator types that needed changes I searched for "public .Iterator" in the code. Here's an overview of all 27 iterator types: Iterators that didn't need changes: status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator. * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator. Iterators with changes (see inline comments for details): * DBIter - an overhaul: - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back. - It had a few code paths silently discarding subiterator's status. The stress test caught a few. - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example. - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption. - It used to not reset status on seek for some types of errors. - Some simplifications and better comments. - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer. * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status. * ForwardIterator - changed to the new convention, also slightly simplified. * ForwardLevelIterator - fixed some bugs and simplified. * LevelIterator - simplified. * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_. * BlockBasedTableIterator - minor changes. * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid. * PlainTableIterator - some seeks used to not reset status. * CuckooTableIterator - tiny code cleanup. * ManagedIterator - fixed some bugs. * BaseDeltaIterator - changed to the new convention and fixed a bug. * BlobDBIterator - seeks used to not reset status. * KeyConvertingIterator - some small change. Closes https://github.com/facebook/rocksdb/pull/3810 Differential Revision: D7888019 Pulled By: al13n321 fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d	2018-05-17 02:56:56 -07:00
Maysam Yabandeh	fc522bdb3e	Evenly split HarnessTest.Randomized Summary: Currently HarnessTest.Randomized is already split but some of the splits are faster than the others. The reason is that each split takes a continuous range of the generated args and the test with later args takes longer to finish. The patch evenly split the args among splits in a round robin fashion. Before: ``` [ OK ] HarnessTest.Randomized1n2 (2278 ms) [ OK ] HarnessTest.Randomized3n4 (1095 ms) [ OK ] HarnessTest.Randomized5 (658 ms) [ OK ] HarnessTest.Randomized6 (1258 ms) [ OK ] HarnessTest.Randomized7 (6476 ms) [ OK ] HarnessTest.Randomized8 (8182 ms) ``` After ``` [ OK ] HarnessTest.Randomized1 (2649 ms) [ OK ] HarnessTest.Randomized2 (2645 ms) [ OK ] HarnessTest.Randomized3 (2577 ms) [ OK ] HarnessTest.Randomized4 (2490 ms) [ OK ] HarnessTest.Randomized5 (2553 ms) [ OK ] HarnessTest.Randomized6 (2560 ms) [ OK ] HarnessTest.Randomized7 (2501 ms) [ OK ] HarnessTest.Randomized8 (2574 ms) ``` Closes https://github.com/facebook/rocksdb/pull/3808 Differential Revision: D7882663 Pulled By: maysamyabandeh fbshipit-source-id: 09b749a9684b6d7d65466aa4b00c5334a49e833e	2018-05-04 15:28:06 -07:00
Maysam Yabandeh	d2bcd7611f	Fix the memory leak with pinned partitioned filters Summary: The existing unit test did not set the level so the check for pinned partitioned filter/index being properly released from the block cache was not properly exercised as they only take effect in level 0. As a result a memory leak in pinned partitioned filters was hidden. The patch fix the test as well as the bug. Closes https://github.com/facebook/rocksdb/pull/3692 Differential Revision: D7559763 Pulled By: maysamyabandeh fbshipit-source-id: 55eff274945838af983c764a7d71e8daff092e4a	2018-04-09 16:28:19 -07:00
Anand Ananthabhotla	f9f4d40f93	Align SST file data blocks to avoid spanning multiple pages Summary: Provide a block_align option in BlockBasedTableOptions to allow alignment of SST file data blocks. This will avoid higher IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages. When this option is set to true, the block alignment is set to lower of block size and 4KB. Closes https://github.com/facebook/rocksdb/pull/3502 Differential Revision: D7400897 Pulled By: anand1976 fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44	2018-03-26 20:26:10 -07:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
Zhongyi Xie	2f29991701	split RandomizedHarnessTest more ways Summary: RandomizedHarnessTest enumerates different combinations of test type, compression type, restart interval, etc. For some combinations it takes very long to finish, causing the test to time out in test infrastructure. This PR split the test input into smaller trunks in the hope that they will fit in the timeout window. Another possibility is to reduce `num_entries` of course Closes https://github.com/facebook/rocksdb/pull/3467 Differential Revision: D6910235 Pulled By: miasantreble fbshipit-source-id: 717246ee5d21a8a48ad82d4d9c04f9051a66f07f	2018-02-06 13:58:18 -08:00
Fosco Marotto	77dc069eb9	Change size_t cast in table_test Summary: Fixes this build error on master (macOS): ``` table/table_test.cc:972:27: error: implicit conversion loses integer precision: 'size_t' (aka 'unsigned long') to 'unsigned int' [-Werror,-Wshorten-64-to-32] ``` Closes https://github.com/facebook/rocksdb/pull/3434 Reviewed By: maysamyabandeh Differential Revision: D6840354 Pulled By: gfosco fbshipit-source-id: fffac6aefbbdd134ce1299453c5590aa855a5fc8	2018-01-30 11:12:51 -08:00
Maysam Yabandeh	46acdc9883	Split HarnessTest_Randomized to avoid timeout Summary: Split HarnessTest_Randomized to two tests Closes https://github.com/facebook/rocksdb/pull/3424 Differential Revision: D6826006 Pulled By: maysamyabandeh fbshipit-source-id: 59c9a11c7da092206effce6e4fa3792f9c66bef2	2018-01-29 07:41:44 -08:00
Yi Wu	dc360df81e	Fix multiple build failures Summary: * Fix DBTest.CompactRangeWithEmptyBottomLevel lite build failure * Fix DBTest.AutomaticConflictsWithManualCompaction failure introduce by #3366 * Fix BlockBasedTableTest::IndexUncompressed should be disabled if snappy is disabled * Fix ASAN failure with DBBasicTest::DBClose test Closes https://github.com/facebook/rocksdb/pull/3373 Differential Revision: D6732313 Pulled By: yiwu-arbug fbshipit-source-id: 1eb9b9d9a8d795f56188fa9770db9353f6fdedc5	2018-01-16 17:30:39 -08:00
Anand Ananthabhotla	199405192d	Add a BlockBasedTableOption to turn off index block compression. Summary: Add a new bool option index_uncompressed in BlockBasedTableOptions. Closes https://github.com/facebook/rocksdb/pull/3303 Differential Revision: D6686161 Pulled By: anand1976 fbshipit-source-id: 748b46993d48a01e5f89b6bd3e41f06a59ec6054	2018-01-10 15:11:59 -08:00
Maysam Yabandeh	1efc600ddf	Preload l0 index partitions Summary: This fixes the existing logic for pinning l0 index partitions. The patch preloads the partitions into block cache and pin them if they belong to level 0 and pin_l0 is set. The drawback is that it does many small IOs when preloading all the partitions into the cache is direct io is enabled. Working for a solution for that. Closes https://github.com/facebook/rocksdb/pull/2661 Differential Revision: D5554010 Pulled By: maysamyabandeh fbshipit-source-id: 1e6f32a3524d71355c77d4138516dcfb601ca7b2	2017-08-18 10:56:20 -07:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit `1d7048c598`, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Aaron Gao	7f6c02dda1	using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public… Summary: … headers https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB. We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term. make check -j64 Closes https://github.com/facebook/rocksdb/pull/2380 Differential Revision: D5177896 Pulled By: lightmark fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390	2017-06-02 17:26:19 -07:00
Andrew Kryczka	a4d9c02511	Pass CF ID to MemTableRepFactory Summary: Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff: - adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload. - updates MemTable to create MemTableRep's using the new overload. Closes https://github.com/facebook/rocksdb/pull/2346 Differential Revision: D5108061 Pulled By: ajkr fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616	2017-06-02 12:12:06 -07:00
Maysam Yabandeh	40af2381ec	Object lifetime in cache Summary: Any non-raw-data dependent object must be destructed before the table closes. There was a bug of not doing that for filter object. This patch fixes the bug and adds a unit test to prevent such bugs in future. Closes https://github.com/facebook/rocksdb/pull/2246 Differential Revision: D5001318 Pulled By: maysamyabandeh fbshipit-source-id: 6d8772e58765485868094b92964da82ef9730b6d	2017-05-05 23:20:01 -07:00
Maysam Yabandeh	6798d1f3be	Revert "Delete filter before closing the table" Summary: This reverts commit `89833577a8`. Closes https://github.com/facebook/rocksdb/pull/2240 Differential Revision: D4986982 Pulled By: maysamyabandeh fbshipit-source-id: 56c4c07b7b5b7c6fe122d5c2f2199d221c8510c0	2017-05-02 13:46:39 -07:00
Maysam Yabandeh	89833577a8	Delete filter before closing the table Summary: Some filters such as partitioned filter have pointers to the table for which they are created. Therefore is they are stored in the block cache, the should be forcibly erased from block cache before closing the table, which would result into deleting the object. Otherwise the destructor will be called later when the cache is lazily erasing the object, which having the parent table no longer existent it could result into undefined behavior. Update: there will be still cases the filter is not removed from the cache since the table has not kept a pointer to the cache handle to be able to forcibly release it later. We make sure that the filter destructor does not access the table pointer to get around such cases. Closes https://github.com/facebook/rocksdb/pull/2207 Differential Revision: D4941591 Pulled By: maysamyabandeh fbshipit-source-id: 56fbab2a11cf447e1aa67caa30b58d7bd7ce5bbd	2017-05-01 19:19:37 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Siying Dong	d2dce5611a	Move some files under util/ to separate dirs Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef	2017-04-05 19:09:16 -07:00
Maysam Yabandeh	e7731d119a	Configure index partition size Summary: Allow the users to specify the target index partition size. With this patch an index partition is cut before its estimated in-memory size goes above the configured value for metadata_block_size. The filter partitions are still cut right after an index partition is cut. Closes https://github.com/facebook/rocksdb/pull/2041 Differential Revision: D4780216 Pulled By: maysamyabandeh fbshipit-source-id: 95a0831	2017-03-28 12:09:12 -07:00
Maysam Yabandeh	11526252cc	Pinnableslice (2nd attempt) Summary: PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: value 100 byte: 1.8% regular, 1.2% merge values value 1k byte: 11.5% regular, 7.5% merge values value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The d Closes https://github.com/facebook/rocksdb/pull/1756 Differential Revision: D4391738 Pulled By: maysamyabandeh fbshipit-source-id: 6f3edd3	2017-03-13 11:54:10 -07:00
Andrew Kryczka	f2817fb7f9	avoid ASSERT_EQ(false, ...); Summary: lately it fails on travis due to a compiler bug (see https://github.com/google/googletest/issues/322#issuecomment-125645145). interestingly it seems to affect occurrences of `ASSERT_EQ(false, ...);` but not `ASSERT_EQ(true, ...);`. Closes https://github.com/facebook/rocksdb/pull/1958 Differential Revision: D4680742 Pulled By: ajkr fbshipit-source-id: 291fe41	2017-03-08 22:24:16 -08:00
Maysam Yabandeh	69d5262c81	Two-level Indexes Summary: Partition Index blocks and use a Partition-index as a 2nd level index. The two-level index can be used by setting BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and configuring BlockBasedTableOptions::index_per_partition t15539501 Closes https://github.com/facebook/rocksdb/pull/1814 Differential Revision: D4473535 Pulled By: maysamyabandeh fbshipit-source-id: bffb87e	2017-02-06 16:39:12 -08:00
Dmitri Smirnov	0a4cdde50a	Windows thread Summary: introduce new methods into a public threadpool interface, - allow submission of std::functions as they allow greater flexibility. - add Joining methods to the implementation to join scheduled and submitted jobs with an option to cancel jobs that did not start executing. - Remove ugly `#ifdefs` between pthread and std implementation, make it uniform. - introduce pimpl for a drop in replacement of the implementation - Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread. - Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation. - should be no functionality changes. Closes https://github.com/facebook/rocksdb/pull/1823 Differential Revision: D4492902 Pulled By: siying fbshipit-source-id: c74cb11	2017-02-06 14:54:18 -08:00
Maysam Yabandeh	d0ba8ec8f9	Revert "PinnableSlice" Summary: This reverts commit `54d94e9c2c`. The pull request was landed by mistake. Closes https://github.com/facebook/rocksdb/pull/1755 Differential Revision: D4391678 Pulled By: maysamyabandeh fbshipit-source-id: 36d5149	2017-01-08 14:24:12 -08:00
Maysam Yabandeh	54d94e9c2c	PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: 1. value 100 byte: 1.8% regular, 1.2% merge values 2. value 1k byte: 11.5% regular, 7.5% merge values 3. value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The difference is a little and could be noise. More importantly it is safely cancelled Closes https://github.com/facebook/rocksdb/pull/1732 Differential Revision: D4374613 Pulled By: maysamyabandeh fbshipit-source-id: a077f1a	2017-01-08 13:54:13 -08:00
Andrew Kryczka	fd43ee09da	Range deletion microoptimizations Summary: - Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter. - Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly. Closes https://github.com/facebook/rocksdb/pull/1548 Differential Revision: D4208169 Pulled By: ajkr fbshipit-source-id: 2fd65cf	2016-11-21 12:24:13 -08:00
Andrew Kryczka	fe349db57b	Remove Arena in RangeDelAggregator Summary: The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator. Closes https://github.com/facebook/rocksdb/pull/1547 Differential Revision: D4207781 Pulled By: ajkr fbshipit-source-id: 9d1c130	2016-11-19 14:24:12 -08:00
Andrew Kryczka	815f54afad	Insert range deletion meta-block into block cache Summary: This handles two issues: (1) range deletion iterator sometimes outlives the table reader that created it, in which case the block must not be destroyed during table reader destruction; and (2) we prefer to read these range tombstone meta-blocks from file fewer times. - Extracted cache-populating logic from NewDataBlockIterator() into a separate function: MaybeLoadDataBlockToCache() - Use MaybeLoadDataBlockToCache() to load range deletion meta-block and pin it through the reader's lifetime. This code reuse works since range deletion meta-block has same format as data blocks. - Use NewDataBlockIterator() to create range deletion iterators, which uses block cache if enabled, otherwise reads the block from file. Either way, the underlying block won't disappear until after the iterator is destroyed. Closes https://github.com/facebook/rocksdb/pull/1459 Differential Revision: D4123175 Pulled By: ajkr fbshipit-source-id: 8f64281	2016-11-05 09:24:26 -07:00
Andrew Kryczka	f998c9790f	DeleteRange Get support Summary: During Get()/MultiGet(), build up a RangeDelAggregator with range tombstones as we search through live memtable, immutable memtables, and SST files. This aggregator is then used by memtable.cc's SaveValue() and GetContext::SaveValue() to check whether keys are covered. added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761 Closes https://github.com/facebook/rocksdb/pull/1456 Differential Revision: D4111271 Pulled By: ajkr fbshipit-source-id: 6e388d4	2016-11-03 18:54:20 -07:00
Andrew Kryczka	a0ba0aa877	Fix uninitialized variable gcc error for MyRocks Summary: make sure seq_ is properly initialized even if ParseInternalKey() fails. Test Plan: run myrocks release tests Reviewers: lightmark, mung, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65199	2016-10-19 10:59:46 -07:00
Islam AbdelRahman	b88f8e87c5	Support SST files with Global sequence numbers [reland] Summary: reland https://reviews.facebook.net/D62523 - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno` - Update TableProperties to be aware of the offset of each property in the file - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks Test Plan: unit tests Reviewers: sdong, yhchiang Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65211	2016-10-18 16:59:37 -07:00
Islam AbdelRahman	d062328977	Revert "Support SST files with Global sequence numbers" This reverts commit `ab01da5437`.	2016-10-07 14:05:12 -07:00
Islam AbdelRahman	ab01da5437	Support SST files with Global sequence numbers Summary: - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno` - Update TableProperties to be aware of the offset of each property in the file - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks Test Plan: unit tests Reviewers: andrewkr, yhchiang, yiwu, sdong Reviewed By: sdong Subscribers: hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62523	2016-10-03 16:12:39 -07:00
Andrew Kryczka	6009c473c7	Store range tombstones in memtable Summary: - Store range tombstones in a separate MemTableRep instantiated with ColumnFamilyOptions::memtable_factory - MemTable::NewRangeTombstoneIterator() returns a MemTableIterator over the separate MemTableRep - Part of the read path is not implemented yet (i.e., MemTable::Get()) Test Plan: see unit tests Reviewers: wanning Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62217	2016-09-30 09:06:43 -07:00
Aaron Gao	f517d9dd09	Add SeekForPrev() to Iterator Summary: Add new Iterator API, `SeekForPrev`: find the last key that <= target key support prefix_extractor support prefix_same_as_start support upper_bound not supported in iterators without Prev() Also add tests in db_iter_test and db_iterator_test Pass all tests Cheers! Test Plan: make all check -j64 Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64149	2016-09-27 18:20:57 -07:00
rockeet	4c3f4496b5	Add TableBuilderOptions::level and relevant changes (#1335 )	2016-09-17 22:30:43 -07:00
Yi Wu	81747f1be6	Refactor MutableCFOptions Summary: * Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions. * Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable. Test Plan: existing unit tests. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63945	2016-09-13 21:11:59 -07:00
sdong	607628d349	Support ZSTD with finalized format Summary: ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility. Still keep ZSTDNotFinal for compatibility reason. Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older. Reviewers: andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63141	2016-09-06 12:22:16 -07:00
Aaron Gao	c7004840d2	store prefix_extractor_name in table Summary: Make sure prefix extractor name is stored in SST files and if DB is opened with a prefix extractor of a different name, prefix bloom is skipped when read the file. Also add unit tests for that. Test Plan: before change: ``` Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from BlockBasedTableTest [ RUN ] BlockBasedTableTest.SkipPrefixBloomFilter table/table_test.cc:1421: Failure Value of: db_iter->Valid() Actual: false Expected: true [ FAILED ] BlockBasedTableTest.SkipPrefixBloomFilter (1 ms) [----------] 1 test from BlockBasedTableTest (1 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (1 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] BlockBasedTableTest.SkipPrefixBloomFilter 1 FAILED TEST ``` after: ``` Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from BlockBasedTableTest [ RUN ] BlockBasedTableTest.SkipPrefixBloomFilter [ OK ] BlockBasedTableTest.SkipPrefixBloomFilter (0 ms) [----------] 1 test from BlockBasedTableTest (0 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (0 ms total) [ PASSED ] 1 test. ``` Reviewers: sdong, andrewkr, yiwu, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61215	2016-08-26 11:46:32 -07:00
Aaron Gao	cec2c6436b	fix data race in NewIndexIterator() in block_based_table_reader.cc Summary: fixed data race described in https://github.com/facebook/rocksdb/issues/1267 and add regression test Test Plan: ./table_test --gtest_filter=BlockBasedTableTest.NewIndexIteratorLeak make all check -j64 core dump before fix. ok after fix. Reviewers: andrewkr, sdong Reviewed By: sdong Subscribers: igor, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62361	2016-08-23 18:20:41 -07:00
Wanning Jiang	78837f5d61	TableBuilder / TableReader support for range deletion Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test ) Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61473	2016-08-19 15:10:31 -07:00
John Alexander	9430333f84	New Statistics to track Compression/Decompression (#1197 ) * Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer). * Prevent incrementing compression counter when compression is turned off in options. * Prevent incrementing compression counter when compression is turned off in options. * Added two more supported compression types to test code in db_test.cc * Prevent incrementing compression counter when compression is turned off in options. * Added new StatsLevel that excludes compression timing. * Fixed casting error in coding.h * Fixed CompressionStatsTest for new StatsLevel. * Removed unused variable that was breaking the Linux build	2016-07-19 09:44:03 -07:00
Yi Wu	296545a2c7	Fix clang analyzer errors Summary: Fixing erros reported by clang static analyzer. * Removing some unused variables. * Adding assertions to fix false positives reported by clang analyzer. * Adding `__clang_analyzer__` macro to suppress false positive warnings. Test Plan: USE_CLANG=1 OPT=-g make analyze -j64 Reviewers: andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60549	2016-07-08 17:50:51 -07:00
sdong	32df9733d1	Add options.write_buffer_manager: control total memtable size across DB instances Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances. Test Plan: Add a new unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59925	2016-07-05 18:11:25 -07:00
Ashish Shenoy	fa3536d202	Store SST file compression algorithm as a TableProperty Summary: Store SST file compression algorithm as a TableProperty. Test Plan: Modified and ran the table_test UT that checks for TableProperties Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: lgalanis, andrewkr, dhruba, IslamAbdelRahman Differential Revision: https://reviews.facebook.net/D58017	2016-05-12 09:47:16 -07:00
Andrew Kryczka	843d2e3137	Shared dictionary compression using reference block Summary: This adds a new metablock containing a shared dictionary that is used to compress all data blocks in the SST file. The size of the shared dictionary is configurable in CompressionOptions and defaults to 0. It's currently only used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of the compression type if the user chooses a nonzero dictionary size. During compaction, computes the dictionary by randomly sampling the first output file in each subcompaction. It pre-computes the intervals to sample by assuming the output file will have the maximum allowable length. In case the file is smaller, some of the pre-computed sampling intervals can be beyond end-of-file, in which case we skip over those samples and the dictionary will be a bit smaller. After the dictionary is generated using the first file in a subcompaction, it is loaded into the compression library before writing each block in each subsequent file of that subcompaction. On the read path, gets the dictionary from the metablock, if it exists. Then, loads that dictionary into the compression library before reading each block. Test Plan: new unit test Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52287	2016-04-27 17:36:03 -07:00
Islam AbdelRahman	5bd4022fec	Add comparator, merge operator, property collectors to SST file properties (again) Summary: This is the original diff that I have landed and reverted and now I want to land again https://reviews.facebook.net/D34269 For old SST files we will show ``` comparator name: N/A merge operator name: N/A property collectors names: N/A ``` For new SST files with no merge operator name and with no property collectors ``` comparator name: leveldb.BytewiseComparator merge operator name: nullptr property collectors names: [] ``` for new SST files with these properties ``` comparator name: leveldb.BytewiseComparator merge operator name: UInt64AddOperator property collectors names: [DummyPropertiesCollector1,DummyPropertiesCollector2] ``` Test Plan: unittests Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56487	2016-04-21 10:16:28 -07:00
Dmitri Smirnov	ee221d2de0	Introduce XPRESS compresssion on Windows. (#1081 ) Comparable with Snappy on comp ratio. Implemented using Windows API, does not require external package. Avaiable since Windows 8 and server 2012. Use -DXPRESS=1 with CMake to enable.	2016-04-19 22:54:24 -07:00
sdong	30d72ee43c	PrefixTest.PrefixAndWholeKeyTest should run against a different directory from prefix_test Summary: PrefixTest.PrefixAndWholeKeyTest runs against the same directory as prefix_test, which sometimes fail parallel tests. Fix it. Test Plan: Run it in parallel and see it doesn't fail anymore. Reviewers: andrewkr Reviewed By: andrewkr Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56541	2016-04-11 13:02:56 -07:00
sdong	1518b733eb	Change default number of cache shard bit to be 6 and max_file_opening_threads to be 16. Summary: Cache shard bit 4 is sometimes too small and 6 is a more common value picked by users. Make that default. It shouldn't hurt much to change options.max_file_opening_threads default to be 16, which will reduce the worst case DB open time. Test Plan: Run all existing tests. Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, igor, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D55047	2016-04-07 13:55:10 -07:00
Andrew Kryczka	2391ef7214	Embed column family name in SST file Summary: Added the column family name to the properties block. This property is omitted only if the property is unavailable, such as when RepairDB() writes SST files. In a next diff, I will change RepairDB to use this new property for deciding to which column family an existing SST file belongs. If this property is missing, it will add it to the "unknown" column family (same as its existing behavior). Test Plan: New unit test: $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55605	2016-04-06 23:10:32 -07:00
Marton Trencseni	9b51987521	Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. Test Plan: 'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK. I didn't run the Java tests, I don't have Java set up on my devserver. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56133	2016-04-01 10:42:39 -07:00
sdong	b1fafcaca6	Revert "Adding pin_l0_filter_and_index_blocks_in_cache feature." This reverts commit `522de4f59e`. It has bug of index block cleaning up.	2016-03-21 11:50:42 -07:00
Marton Trencseni	522de4f59e	Adding pin_l0_filter_and_index_blocks_in_cache feature. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction. Test Plan: Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false). DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32 Mac: OK. Linux: with D55287 patched in it's OK. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54801	2016-03-17 22:40:01 +00:00
Yueh-Hsuan Chiang	291ae4c206	Revert "Revert "Fixed the bug when both whole_key_filtering and prefix_extractor are set."" Summary: This reverts commit `73c31377bb`, which mistakenly reverts `73c31377bb` that fixes a bug when both whole_key_filtering and prefix_extractor are set Test Plan: revert the patch Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52707	2016-02-22 16:33:26 -08:00
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	2016-02-09 15:12:00 -08:00
Islam AbdelRahman	8e6172bc57	Add BlockBasedTableOptions::index_block_restart_interval Summary: Add a new option to BlockBasedTableOptions that will allow us to change the restart interval for the index block Test Plan: unit tests Reviewers: yhchiang, anthony, andrewkr, sdong Reviewed By: sdong Subscribers: march, dhruba Differential Revision: https://reviews.facebook.net/D53721	2016-02-05 10:22:37 -08:00
Islam AbdelRahman	f3fb39814d	Fix BlockBasedTableTest.NoopTransformSeek failure Summary: table_test is failing because we are creating a temp InternalComparator 14:27:28 [ RUN ] BlockBasedTableTest.NoopTransformSeek 14:27:28 pure virtual method called 14:27:28 terminate called without an active exception 14:27:28 /bin/sh: line 7: 2346261 Aborted (core dumped) ./$t Test Plan: make table_test -j64 && ./table_test --gtest_filter="BlockBasedTableTest.NoopTransformSeek" Reviewers: igor, sdong, anthony, rven Reviewed By: rven Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52671	2016-01-07 09:48:29 -08:00
Yueh-Hsuan Chiang	73c31377bb	Revert "Fixed the bug when both whole_key_filtering and prefix_extractor are set." Summary: This patch reverts commit `57605d7ef3` as it will cause BlockBasedTableTest.NoopTransformSeek test crashes in some environment. Test Plan: revert the patch Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52623	2016-01-06 23:33:41 -08:00
Yueh-Hsuan Chiang	57605d7ef3	Fixed the bug when both whole_key_filtering and prefix_extractor are set. Summary: When both whole_key_filtering and prefix_extractor are set, RocksDB will mistakenly encode prefix + whole key into the database instead of simply whole key when BlockBasedTable is used. This patch fixes this bug. Test Plan: Add a test in table_test Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52233	2016-01-06 23:01:23 -08:00
Peter Mattis	260c29762a	Fix index seeking in BlockTableReader::PrefixMayMatch. PrefixMayMatch previously seeked in the prefix index using an internal key with a sequence number of 0. This would cause the prefix index seek to fall off the end if the last key in the index had a user-key greater than or equal to the key being looked for. Falling off the end of the index in turn results in PrefixMayMatch returning false if the index is in memory.	2016-01-06 16:39:56 -05:00
Jay Edgar	7699439b7c	Prevent the user from setting block_restart_interval to less than 1 Summary: If block_restart_interval gets set to less than 1 an assert will be triggered in BlockBuilder::BlockBuilder(). This prevents the user from doing this by silently setting any value less than 1 to 1. Test Plan: Added a test (in BlockBasedTableTest in table_test) that checks invalid values to make sure that they are reset to the expected values. The block_restart_interval value is checked along with block_size_deviation which also silently sets the value if it is outside a specific range. Reviewers: yoshinorim, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52509	2016-01-04 14:13:18 -08:00
Nathan Bronson	ac16663bd6	use -Werror=missing-field-initializers, to closer match MyRocks build Summary: myrocks seems to build rocksdb using -Wmissing-field-initializers (and treats warnings as errors). This diff adds that flag to the rocksdb build, and fixes the compilation failures that result. I have not checked for any other differences in the build flags for rocksdb build as part of myrocks. Test Plan: make check Reviewers: sdong, rven Reviewed By: rven Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52443	2015-12-30 14:56:18 -08:00
Nathan Bronson	7d87f02799	support for concurrent adds to memtable Summary: This diff adds support for concurrent adds to the skiplist memtable implementations. Memory allocation is made thread-safe by the addition of a spinlock, with small per-core buffers to avoid contention. Concurrent memtable writes are made via an additional method and don't impose a performance overhead on the non-concurrent case, so parallelism can be selected on a per-batch basis. Write thread synchronization is an increasing bottleneck for higher levels of concurrency, so this diff adds --enable_write_thread_adaptive_yield (default off). This feature causes threads joining a write batch group to spin for a short time (default 100 usec) using sched_yield, rather than going to sleep on a mutex. If the timing of the yield calls indicates that another thread has actually run during the yield then spinning is avoided. This option improves performance for concurrent situations even without parallel adds, although it has the potential to increase CPU usage (and the heuristic adaptation is not yet mature). Parallel writes are not currently compatible with inplace updates, update callbacks, or delete filtering. Enable it with --allow_concurrent_memtable_write (and --enable_write_thread_adaptive_yield). Parallel memtable writes are performance neutral when there is no actual parallelism, and in my experiments (SSD server-class Linux and varying contention and key sizes for fillrandom) they are always a performance win when there is more than one thread. Statistics are updated earlier in the write path, dropping the number of DB mutex acquisitions from 2 to 1 for almost all cases. This diff was motivated and inspired by Yahoo's cLSM work. It is more conservative than cLSM: RocksDB's write batch group leader role is preserved (along with all of the existing flush and write throttling logic) and concurrent writers are blocked until all memtable insertions have completed and the sequence number has been advanced, to preserve linearizability. My test config is "db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write" on a two-socket Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1 thread I get ~440Kops/sec. Peak performance for 1 socket (numactl -N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance across both sockets happens at 30 threads, and is ~900Kops/sec, although with fewer threads there is less performance loss when the system has background work. Test Plan: 1. concurrent stress tests for InlineSkipList and DynamicBloom 2. make clean; make check 3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench 4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench 5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench 6. make clean; OPT=-DROCKSDB_LITE make check 7. verify no perf regressions when disabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba Differential Revision: https://reviews.facebook.net/D50589	2015-12-25 11:03:40 -08:00
Islam AbdelRahman	521da3abb3	Fix BlockBasedTableTest.BlockCacheLeak valgrind failure Summary: I added this line in my previous patch D48999 (which is incorrect) We should not release the iterator since releasing it will evict the blocks from cache Test Plan: Run the test under valgrind make check Reviewers: rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52161	2015-12-18 11:17:21 -08:00
Islam AbdelRahman	aececc209e	Introduce ReadOptions::pin_data (support zero copy for keys) Summary: This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed. Benchmark results (using https://phabricator.fb.com/P20083553) ``` // $ du -h /home/tec/local/normal.4K.Snappy/db10077 // 6.1G /home/tec/local/normal.4K.Snappy/db10077 // $ du -h /home/tec/local/zero.8K.LZ4/db10077 // 6.4G /home/tec/local/zero.8K.LZ4/db10077 // Benchmarks for shard db10077 // _build/opt/rocks/benchmark/rocks_copy_benchmark \ // --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \ // --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077" // First run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 1.73s 576.97m // BM_StringPiece 103.74% 1.67s 598.55m // ============================================================================ // Match rate : 1000000 / 1000000 // Second run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 611.99ms 1.63 // BM_StringPiece 203.76% 300.35ms 3.33 // ============================================================================ // Match rate : 1000000 / 1000000 ``` Test Plan: Unit tests Reviewers: sdong, igor, anthony, yhchiang, rven Reviewed By: rven Subscribers: dhruba, lovro, adsharma Differential Revision: https://reviews.facebook.net/D48999	2015-12-16 12:08:30 -08:00
Islam AbdelRahman	838676c17b	Revert "Adding new table properties" Summary: Reverting https://reviews.facebook.net/D34269 for now after I landed it a flaky test started continuously failing, I am almost sure this patch is not related to the test but I will revert it until I figure out why it's failing Test Plan: make check Reviewers: kradhakrishnan Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D50385	2015-11-06 16:49:38 -08:00
Islam AbdelRahman	8be568a9c2	Adding new table properties Summary: This diff introduce new table properties that will be written for block based tables These properties are - comparator name - merge operator name - property collectors names Test Plan: - Added a new unit test to verify that these tests are written/read correctly - Running all other tests right now (wont land until all tests finish) Reviewers: rven, kradhakrishnan, igor, sdong, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D34269	2015-11-06 11:19:01 -08:00
Venkatesh Radhakrishnan	a98fbacfa0	Moving memtable related files from util to a new directory memtable Summary: We are cleaning up dependencies. This diff takes a first step at moving memtable files to their own directory called memtable. In future diffs, we will move other memtable files from db to memtable. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48915	2015-10-16 14:10:33 -07:00
sdong	35ad531be3	Seperate InternalIterator from Iterator Summary: Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type. This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's. At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it. Test Plan: Run all existing tests. Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48549	2015-10-13 15:32:13 -07:00
Islam AbdelRahman	1fe78a4073	Fix tests failing in ROCKSDB_LITE Summary: Fix tests that compile under ROCKSDB_LITE but currently failing. table_test: RandomizedLongDB test is using internal stats which is not supported in ROCKSDB_LITE compaction_job_test: Using CompactionJobStats which is not supported perf_context_test: KeyComparisonCount test try to open DB in ReadOnly mode which is not supported Test Plan: run the tests under ROCKSDB_LITE Reviewers: yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48585	2015-10-13 10:32:05 -07:00
sdong	776bd8d5eb	Pass column family ID to table property collector Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently. Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct. Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48411	2015-10-09 14:36:51 -07:00
Islam AbdelRahman	51fa7ecec5	Bytes read/written from cache statistics Summary: Add 2 new counters BLOCK_CACHE_BYTES_WRITE, BLOCK_CACHE_BYTES_READ to keep track of how many bytes were written to the cache and how many bytes that we read from cache Test Plan: make check Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48195	2015-10-07 15:17:20 -07:00
sdong	df34aea331	PlainTableReader to support non-mmap mode Summary: PlainTableReader now only allows mmap-mode. Add the support to non-mmap mode for more flexibility. Refactor the codes to move all logic of reading data to PlainTableKeyDecoder, and consolidate the calls to Read() call and ReadVarint32() call. Implement the calls for both of mmap and non-mmap case seperately. For non-mmap mode, make copy of keys in several places when we need to move the buffer after reading the keys. Test Plan: Add the mode of non-mmap case in plain_table_db_test. Run it in valgrind mode too. Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D47187	2015-09-23 11:41:07 -07:00
Islam AbdelRahman	45e9e4f0bb	Refactor NewTableReader to accept TableReaderOptions Summary: Refactoring NewTableReader to accept TableReaderOptions This will make it easier to add new options in the future, for example in this diff https://reviews.facebook.net/D46071 Test Plan: run existing tests Reviewers: igor, yhchiang, anthony, rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D46179	2015-09-11 11:36:33 -07:00
Igor Canadi	14456aea52	Fix compile Summary: There was a merge conflict with https://reviews.facebook.net/D45993 Test Plan: make check Reviewers: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46065	2015-09-02 16:05:53 -07:00
Igor Canadi	76f286cc82	Optimize bloom filter cache misses Summary: This optimizes the case when (cache_index_and_filter_blocks=1) and bloom filter is not present in the cache. Previously we did: 1. Read meta block from file 2. Read the filter position from the meta block 3. Read the filter Now, we pre-load the filter position on Table::Open(), so we can skip steps (1) and (2) on bloom filter cache miss. Instead of 2 IOs, we do only 1. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46047	2015-09-02 15:36:47 -07:00
Andres Noetzli	3c9cef1eed	Unified maps with Comparator for sorting, other cleanup Summary: This diff is a collection of cleanups that were initially part of D43179. Additionally it adds a unified way of defining key-value maps that use a Comparator for sorting (this was previously implemented in four different places). Test Plan: make clean check all Reviewers: rven, anthony, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45993	2015-09-02 13:58:22 -07:00
sdong	7a0dbdf3ac	Add ZSTD (not final format) compression type Summary: Add ZSTD compression type. The same way as adding LZ4. Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests. Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor Reviewed By: igor Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D45747	2015-08-28 11:01:13 -07:00
Andres Notzli	c465071029	Removing duplicate code Summary: While working on https://reviews.facebook.net/D43179 , I found duplicate code in the tests. This patch removes it. Test Plan: make clean all check Reviewers: igor, sdong, rven, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43263	2015-08-05 07:33:27 -07:00
Islam AbdelRahman	4853e228ef	Make table_test runnable in ROCKSDB_LITE Summary: Remove plain table tests from table_test since plain table is not supported in ROCKSDB_LITE Test Plan: table_test Reviewers: igor, sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D42153	2015-07-20 11:09:14 -07:00
sdong	6e9fbeb27c	Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future. Test Plan: Run all existing unit tests. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42321	2015-07-17 16:58:18 -07:00
agiardullo	dc9d70de65	Optimistic Transactions Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported. Test Plan: Added a new test, but still need to look into stress testing. Reviewers: yhchiang, igor, rven, sdong Reviewed By: sdong Subscribers: adamretter, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D33435	2015-05-29 14:36:35 -07:00
Igor Canadi	5e067a7b19	Clean up compression logging Summary: Now we add warnings when user configures compression and the compression is not supported. Test Plan: Configured compression to non-supported values. Observed messages in my log: 2015/03/26-12:17:57.586341 7ffb8a496840 [WARN] Compression type chosen for level 2 is not supported: LZ4. RocksDB will not compress data on level 2. 2015/03/26-12:19:10.768045 7f36f15c5840 [WARN] Compression type chosen is not supported: LZ4. RocksDB will not compress data. Reviewers: rven, sdong, yhchiang Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35979	2015-04-06 12:50:44 -07:00

1 2 3 4 5 ...

381 commits