rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-12-02 20:52:55 +00:00

Author	SHA1	Message	Date
Yanqin Jin	ce52274640	Replace 'string' with 'const string&' in FileOperationInfo (#4491 ) Summary: Using const string& can avoid one extra string copy. This PR addresses a recent comment made by siying on #3933. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4491 Differential Revision: D10381211 Pulled By: riversand963 fbshipit-source-id: 27fc2d65d84bc7cd07833c77cdc47f06dcfaeb31	2018-10-15 13:46:01 -07:00
Yanqin Jin	729a617b5b	Add listener to sample file io (#3933 ) Summary: We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running. To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933 Differential Revision: D10219571 Pulled By: riversand963 fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15	2018-10-12 18:36:11 -07:00
Peter Pei	09814f2cfc	support OnCompactionBegin (#4431 ) Summary: fix #4288 Add `OnCompactionBegin` support to `rocksdb::EventListener`. Currently, we only have these three callbacks: - OnFlushBegin - OnFlushCompleted - OnCompactionCompleted As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`. This PR is a try to implement the support of `OnCompactionBegin`. Hope it is useful to you. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431 Differential Revision: D10055515 Pulled By: yiwu-arbug fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334	2018-10-10 17:32:27 -07:00
Fosco Marotto	35f26beca5	Update version macro for 5.17 (#4472 ) Summary: Forgot this in previous commit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4472 Differential Revision: D10244227 Pulled By: gfosco fbshipit-source-id: ba0cf7a2f5271f0d9f9443004e2620887cd5fd11	2018-10-08 16:22:17 -07:00
DorianZheng	e0f05754ba	Expose column family id to OnCompactionCompleted (#4466 ) Summary: The controller you requested could not be found. PTAL Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466 Differential Revision: D10241358 Pulled By: yiwu-arbug fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e	2018-10-08 14:24:16 -07:00
Andrew Gallagher	897fe6a4a3	rocksdb: put `#pragma once` before `#ifdef` Summary: Work around upstream bug with modules: https://bugs.llvm.org/show_bug.cgi?id=39184. Reviewed By: yiwu-arbug Differential Revision: D10209569 fbshipit-source-id: 696853a02a3869e9c33d0e61168ad4b0436fa3c0	2018-10-04 17:10:21 -07:00
Zhongyi Xie	ce1fc5af09	fix unused param `allocator` in compression.h (#4453 ) Summary: this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453 Differential Revision: D10202626 Pulled By: miasantreble fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355	2018-10-04 13:24:22 -07:00
Igor Canadi	1cf5deb8fd	Introduce CacheAllocator, a custom allocator for cache blocks (#4437 ) Summary: This is a conceptually simple change, but it touches many files to pass the allocator through function calls. We introduce CacheAllocator, which can be used by clients to configure custom allocator for cache blocks. Our motivation is to hook this up with folly's `JemallocNodumpAllocator` (`f43ce6d686/folly/experimental/JemallocNodumpAllocator.h`), but there are many other possible use cases. Additionally, this commit cleans up memory allocation in `util/compression.h`, making sure that all allocations are wrapped in a unique_ptr as soon as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437 Differential Revision: D10132814 Pulled By: yiwu-arbug fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564	2018-10-02 17:24:58 -07:00
Andrew Kryczka	ac6f435a9a	Fix CompactFiles support for kDisableCompressionOption (#4438 ) Summary: Previously `CompactFiles` with `CompressionType::kDisableCompressionOption` caused program to crash on assertion failure. This PR fixes the crash by adding support for that setting. Now, that setting will cause RocksDB to choose compression according to the column family's options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4438 Differential Revision: D10115761 Pulled By: ajkr fbshipit-source-id: a553c6fa76fa5b6f73b0d165d95640da6f454122	2018-10-01 01:18:10 -07:00
Anand Ananthabhotla	72712f4e28	Allow dynamic modification of window size and deletion trigger (#4403 ) Summary: Make the CompactOnDeletionCollectorFactory class public, and provide methods to update the window size and deletion trigger params. These will take effect on subsequent created SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4403 Differential Revision: D9976857 Pulled By: anand1976 fbshipit-source-id: 31dbf0511c12fa2bb9b2a7ba620079e0ee09cf48	2018-09-20 15:15:28 -07:00
Anand Ananthabhotla	a27fce408e	Auto recovery from out of space errors (#4164 ) Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a	2018-09-15 13:43:04 -07:00
Vitaly Isaev	0bd2ede10e	Memory usage stats in C API (#4340 ) Summary: Please consider this small PR providing access to the `MemoryUsage::GetApproximateMemoryUsageByType` function in plain C API. Actually I'm working on Go application and now trying to investigate the reasons of high memory consumption (#4313). Go [wrappers](https://github.com/tecbot/gorocksdb) are built on the top of Rocksdb C API. According to the #706, `MemoryUsage::GetApproximateMemoryUsageByType` is considered as the best option to get database internal memory usage stats, but it wasn't supported in C API yet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4340 Differential Revision: D9655135 Pulled By: ajkr fbshipit-source-id: a3d2f3f47c143ae75862fbcca2f571ea1b49e14a	2018-09-13 14:27:31 -07:00
Maysam Yabandeh	3f5282268f	Skip concurrency control during recovery of pessimistic txn (#4346 ) Summary: TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346 Differential Revision: D9759149 Pulled By: maysamyabandeh fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b	2018-09-10 16:57:53 -07:00
Kefu Chai	faf529fd7c	env_librados.h: drop redundant #endif (#4354 ) Summary: without this change, rocksdb_env_librados_test fails to build. it's a regression introduced by `64324e32` Signed-off-by: Kefu Chai <tchaikov@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4354 Differential Revision: D9702665 Pulled By: riversand963 fbshipit-source-id: 65134eaff0543733210edfc77f89c96709da7a3f	2018-09-07 11:12:44 -07:00
Maysam Yabandeh	655ef7d77f	Inline doc for format_version 4 (#4350 ) Summary: Fixes #4337 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4350 Differential Revision: D9700871 Pulled By: maysamyabandeh fbshipit-source-id: fe1e07803783f34588dc14aba66d51117ca4a180	2018-09-07 07:57:30 -07:00
cngzhnp	64324e329e	Support pragma once in all header files and cleanup some warnings (#4339 ) Summary: As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited. Besides this, try to fix some warnings about loss of data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339 Differential Revision: D9654990 Pulled By: ajkr fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848	2018-09-05 18:13:31 -07:00
Yi Wu	462ed70d64	BlobDB: GetLiveFiles and GetLiveFilesMetadata return relative path (#4326 ) Summary: `GetLiveFiles` and `GetLiveFilesMetadata` should return path relative to db path. It is a separate issue when `path_relative` is false how can we return relative path. But `DBImpl::GetLiveFiles` don't handle it as well when there are multiple `db_paths`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4326 Differential Revision: D9545904 Pulled By: yiwu-arbug fbshipit-source-id: 6762d879fcb561df2b612e6fdfb4a6b51db03f5d	2018-08-31 12:12:49 -07:00
Mikhail Antonov	927f274939	Avoiding write stall caused by manual flushes (#4297 ) Summary: Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening. One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable. This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297 Differential Revision: D9420705 Pulled By: mikhail-antonov fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc	2018-08-29 12:12:55 -07:00
Gauresh Rane	ad789e4e0d	Adding a method for memtable class for memtable getting flushed. (#4304 ) Summary: Memtables are selected for flushing by the flush job. Currently we have listener which is invoked when memtables for a column family are flushed. That listener does not indicate which memtable was flushed in the notification. If clients want to know if particular data in the memtable was retired, there is no straight forward way to know this. This method will help users who implement memtablerep factory and extend interface for memtablerep, to know if the data in the memtable was retired. Another option that was tried, was to depend on memtable destructor to be called after flush to mark that data was persisted. This works all the time but sometimes there can huge delays between actual flush happening and memtable getting destroyed. Hence, if anyone who is waiting for data to persist will have to wait that longer. It is expected that anyone who is implementing this method to have return quickly as it blocks RocksDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4304 Reviewed By: riversand963 Differential Revision: D9472312 Pulled By: gdrane fbshipit-source-id: 8e693308dee749586af3a4c5d4fcf1fa5276ea4d	2018-08-23 17:14:25 -07:00
Yanqin Jin	bb5dcea98e	Add path to WritableFileWriter. (#4039 ) Summary: We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039 Differential Revision: D8670178 Pulled By: riversand963 fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a	2018-08-23 10:12:58 -07:00
jsteemann	90f744941d	adds missing PopSavePoint method to Transaction (#4256 ) Summary: Transaction has had methods to deal with SavePoints already, but was missing the PopSavePoint method provided by WriteBatch and WriteBatchWithIndex. This PR adds PopSavePoint to Transaction as well. Having the method on Transaction-level too is useful for applications that repeatedly execute a sequence of operations that normally succeed, but infrequently need to get rolled back. Using SavePoints here is sensible, but as operations normally succeed the application may pile up a lot of useless SavePoints inside a Transaction, leading to slightly increased memory usage for managing the unneeded SavePoints. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4256 Differential Revision: D9326932 Pulled By: yiwu-arbug fbshipit-source-id: 53a0af18a6c7e87feff8a56f1f3eab9df7f371d6	2018-08-17 11:57:30 -07:00
Fenggang Wu	9d646a6311	Add db_bench options of data block hash index (#4281 ) Summary: Add `--data_block_index_type` and `--data_block_hash_table_util_ratio` option to `db_bench`. `--data_block_index_type` can be either of `binary` (default) or `binary_and_hash`; `--data_block_hash_table_util_ratio` will be a double. The default value is `0.75`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4281 Differential Revision: D9361476 Pulled By: fgwu fbshipit-source-id: dc53e01acef9db81b9eec5e8a96f3bc8ed718c10	2018-08-16 18:42:46 -07:00
Siying Dong	9c0c8f5ff6	GetAllKeyVersions() to take an extra argument of `max_num_ikeys`. (#4271 ) Summary: Right now, `ldb idump` may have memory out of control if there is a big range of tombstones. Add an option to cut maxinum number of keys in GetAllKeyVersions(), and push down --max_num_ikeys from ldb. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4271 Differential Revision: D9369149 Pulled By: siying fbshipit-source-id: 7cbb797b7d2fa16573495a7e84937456d3ff25bf	2018-08-16 15:57:08 -07:00
Andrey Zagrebin	aeed4f0749	#3865 fix performance regression introduced by MergeOperator.ShouldMerge (#4266 ) Summary: This PR addresses issue #3865 and implements the following approach to fix it: - adds `MergeContext::GetOperandsDirectionForward` and `MergeContext::GetOperandsDirectionBackward` to query merge operands in a specific order - `MergeContext::GetOperands` becomes a shortcut for `MergeContext::GetOperandsDirectionForward` - pass `MergeContext::GetOperandsDirectionBackward` to `MergeOperator::ShouldMerge` and document the order Pull Request resolved: https://github.com/facebook/rocksdb/pull/4266 Differential Revision: D9360750 Pulled By: sagar0 fbshipit-source-id: 20cb73ff017760b062ecdcf4382560767086e092	2018-08-16 10:58:05 -07:00
Fenggang Wu	19ec44fd39	Improve point-lookup performance using a data block hash index (#4174 ) Summary: Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.` The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174 Differential Revision: D8965914 Pulled By: fgwu fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198	2018-08-15 14:30:03 -07:00
Huachao Huang	d916a1105a	c-api: add some missing options Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4267 Differential Revision: D9309505 Pulled By: anand1976 fbshipit-source-id: eb9fee8037f4ff24dc1cdd5cc5ef41c231a03e1f	2018-08-13 18:42:30 -07:00
Maysam Yabandeh	caf0f53a74	Index value delta encoding (#3983 ) Summary: Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4. The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size. Results with sysbench read-only using 4k blocks and using 16 index restart interval: Format 2: 19585 rocksdb read-only range=100 Format 3: 19569 rocksdb read-only range=100 Format 4: 19352 rocksdb read-only range=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983 Differential Revision: D8361343 Pulled By: maysamyabandeh fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651	2018-08-09 16:58:40 -07:00
Georgios Bitzes	1b813a9b2e	Make rocksdb::Slice more interoperable with std::string_view (#4242 ) Summary: This change allows using std::string_view objects directly in the DB API: db->Get(some_string_view_object, ...); The conversion from std::string_view to rocksdb::Slice is done automatically, thanks to the added constructor. I'm stopping short of adding an implicit conversion operator from rocksdb::Slice to std::string_view, as I don't think that's a good idea for PinnableSlices. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4242 Differential Revision: D9224134 Pulled By: anand1976 fbshipit-source-id: f50aad04dd0b01737907c0fb88d495c83a81f4e4	2018-08-09 14:43:34 -07:00
Yanqin Jin	de7f423a82	Add SST ingestion to ldb (#4205 ) Summary: We add two subcommands `write_extern_sst` and `ingest_extern_sst` to ldb. This PR avoids changing existing code because we hope to cherry-pick to earlier releases to support compatibility check for external SST file ingestion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4205 Differential Revision: D9112711 Pulled By: riversand963 fbshipit-source-id: 7cae88380d4de86da8440230e87eca66755648e4	2018-08-09 14:29:11 -07:00
Huachao Huang	badfd70a3e	types: add kEntryBlobIndex for TablePropertiesCollector (#4233 ) Summary: So that we can act accordingly on blob index entries Pull Request resolved: https://github.com/facebook/rocksdb/pull/4233 Differential Revision: D9190205 Pulled By: yiwu-arbug fbshipit-source-id: e5b84d5b41e44fa7a76762f1f7b0305369bb3a0c	2018-08-06 18:27:44 -07:00
Gustav Davidsson	a15354d04e	Expose GetTotalTrashSize in SstFileManager interface (#4206 ) Summary: Hi, it would be great if we could expose this API, so that LogDevice can use it to track the total size of trash files and alarm if it grows too large in relation to disk size. There's probably other customers that would be interested in this as well. :) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4206 Differential Revision: D9115516 Pulled By: gdavidsson fbshipit-source-id: f34993a940e39cb0a0b544ae8298546499b7e047	2018-08-04 17:57:48 -07:00
Sagar Vemuri	12b6cdeed3	Trace and Replay for RocksDB (#3837 ) Summary: A framework for tracing and replaying RocksDB operations. A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench. - Column-families are supported - Multi-threaded tracing is supported. - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it. - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations. Currently supported DB operations: - Writes: -- Put -- Merge -- Delete -- SingleDelete -- DeleteRange -- Write - Reads: -- Get (point lookups) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837 Differential Revision: D7974837 Pulled By: sagar0 fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72	2018-08-01 00:27:08 -07:00
Fenggang Wu	ee7617167f	DataBlockHashIndex: Specify that DataBlockHashIndex is not yet implemented in the comment Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4203 Differential Revision: D9090912 Pulled By: fgwu fbshipit-source-id: 6a68be83693ddf2a5c060290382141f0d2fb400b	2018-07-31 11:43:08 -07:00
Yanqin Jin	54de56844d	Remove random writes from SST file ingestion (#4172 ) Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071	2018-07-27 16:12:23 -07:00
Fenggang Wu	a11df583ec	Add DataBlockIndexType option in BlockBasedTableOptions (#4150 ) Summary: Added DataBlockIndexType option in BlockBasedTableOptions. ``` enum DataBlockIndexType : char { kDataBlockBinarySearch = 0, // traditional block type kDataBlockHashIndex = 1, // additional hash index appended to the end. }; ``` The default type is the traditional binary seek option: `kDataBlockBinarySearch`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4150 Differential Revision: D8895958 Pulled By: fgwu fbshipit-source-id: 480adef48104cf11d30db3bad9a73f98b4a80c10	2018-07-27 15:42:27 -07:00
Maysam Yabandeh	c33b32671e	Correct description of GetColumnFamilyMetaData (#4196 ) Summary: The inline doc was incorrectly mentioned a return status while the function does not return a value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4196 Differential Revision: D9030927 Pulled By: maysamyabandeh fbshipit-source-id: 07c34dc6bf521021bf790ac1bfedb676171129ec	2018-07-27 11:42:37 -07:00
Maysam Yabandeh	e0906eb785	Clarify max_total_wal_size's scope (#4194 ) Summary: max_total_wal_size takes effect only when there are more than one column families. The patch clarify that in the inline docs Closes https://github.com/facebook/rocksdb/issues/4180 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4194 Differential Revision: D9028767 Pulled By: maysamyabandeh fbshipit-source-id: 8d730ca7f15e76e7ee9ff88b2b48030b2d1b7078	2018-07-27 09:29:44 -07:00
Yanqin Jin	18f538038a	Increase version number to 5.16 (#4176 ) Summary: Given that we have cut 5.15, we should bump the version number to the next version, i.e. 5.16. Also update HISTORY.md cc sagar0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4176 Differential Revision: D8977965 Pulled By: riversand963 fbshipit-source-id: 481d75d2f446946f0eb2afb7e94ef894c8c87e1e	2018-07-24 13:43:33 -07:00
Manuel Ung	ea212e5316	WriteUnPrepared: Implement unprepared batches for transactions (#4104 ) Summary: This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch. Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`. A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104 Differential Revision: D8785717 Pulled By: lth fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91	2018-07-24 00:13:18 -07:00
Chang Su	374c37da5b	move static msgs out of Status class (#4144 ) Summary: The member msgs of class Status contains all types of status messages. When users dump a Status object, msgs will confuse users. So move it out of class Status by making it as file-local static variable. Closes #3831 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/4144 Differential Revision: D8941419 Pulled By: sagar0 fbshipit-source-id: 56b0510258465ff26db15aa6b04e01532e053e3d	2018-07-23 15:44:16 -07:00
Yanqin Jin	79f009f22e	Release 5.15. (#4148 ) Summary: Cut 5.15.fb Pull Request resolved: https://github.com/facebook/rocksdb/pull/4148 Differential Revision: D8886802 Pulled By: riversand963 fbshipit-source-id: 6b6427ce97f5b323a7eebf92458fda8b24b0cece	2018-07-17 21:44:51 -07:00
Siying Dong	ddc07b40fc	Remove managed iterator Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4124 Differential Revision: D8829910 Pulled By: siying fbshipit-source-id: f3e952ccf3a631071a5d77c48e327046f8abb560	2018-07-17 14:43:18 -07:00
Sagar Vemuri	991120fa10	Allow ttl to be changed dynamically (#4133 ) Summary: Allow ttl to be changed dynamically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4133 Differential Revision: D8845440 Pulled By: sagar0 fbshipit-source-id: c8c87ae643b3a8c4123e4c037c4645efc094a2d3	2018-07-16 14:27:53 -07:00
Nathan VanBenschoten	ef7815b803	Support range deletion tombstones in IngestExternalFile SSTs (#3778 ) Summary: Fixes #3391. This change adds a `DeleteRange` method to `SstFileWriter` and adds support for ingesting SSTs with range deletion tombstones. This is important for applications that need to atomically ingest SSTs while clearing out any existing keys in a given key range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778 Differential Revision: D8821836 Pulled By: anand1976 fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844	2018-07-13 22:43:09 -07:00
Tamir Duberstein	7bee48bdbd	Add GCC 8 to Travis (#3433 ) Summary: - Avoid `strdup` to use jemalloc on Windows - Use `size_t` for consistency - Add GCC 8 to Travis - Add CMAKE_BUILD_TYPE=Release to Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433 Differential Revision: D6837948 Pulled By: sagar0 fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f	2018-07-13 10:58:06 -07:00
Siying Dong	35b38a232c	Update comments of WriteBatchWithIndex Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4113 Differential Revision: D8814172 Pulled By: siying fbshipit-source-id: cabc31db2c74803af9b2f99329155a1086eb1b22	2018-07-11 17:42:50 -07:00
Siying Dong	926f3a78a6	In delete scheduler, before ftruncate file for slow delete, check whether there is other hard links (#4093 ) Summary: Right now slow deletion with ftruncate doesn't work well with checkpoints because it ruin hard linked files in checkpoints. To fix it, check the file has no other hard link before ftruncate it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4093 Differential Revision: D8730360 Pulled By: siying fbshipit-source-id: 756eea5bce8a87b9a2ea3a5bfa190b2cab6f75df	2018-07-09 15:28:12 -07:00
Manuel Ung	b9846370e9	WriteUnPrepared: Add support for recovering WriteUnprepared transactions (#4078 ) Summary: This adds support for recovering WriteUnprepared transactions through the following changes: - The information in `RecoveredTransaction` is extended so that it can reference multiple batches. - `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not. - `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway. Commit/Rollback of live transactions is still unimplemented and will come later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078 Differential Revision: D8703382 Pulled By: lth fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a	2018-07-06 17:59:13 -07:00
Yanqin Jin	d4d9fe8e57	Fix a bug caused by not copying the block trailer. (#4096 ) Summary: This was caught by crash test, and the following is a simple way to reproduce it and verify the fix. One way to trigger this code path is to use the following configuration: - Compress SST file - Enable direct IO and prefetch buffer - Do NOT use compressed block cache Closes https://github.com/facebook/rocksdb/pull/4096 Differential Revision: D8742009 Pulled By: riversand963 fbshipit-source-id: f13381078bbb0dce92f60bd313a78ab602bcacd2	2018-07-06 13:12:39 -07:00
Siying Dong	17027aeffc	Change default value of `bytes_max_delete_chunk` to 0 in NewSstFileManager() (#4092 ) Summary: Now by default, with NewSstFileManager, checkpoints may be corrupted. Disable this feature to avoid this issue. Closes https://github.com/facebook/rocksdb/pull/4092 Differential Revision: D8729856 Pulled By: siying fbshipit-source-id: 914c321d6eaf52d8c5981171322d85dd29088307	2018-07-03 17:57:37 -07:00
Manuel Ung	8ad63a4b86	WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069 ) Summary: This adds a new WAL marker of type kTypeBeginUnprepareXID. Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy. Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not. Closes https://github.com/facebook/rocksdb/pull/4069 Differential Revision: D8675099 Pulled By: lth fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749	2018-06-28 18:58:29 -07:00
Anand Ananthabhotla	52d4c9b7f6	Allow DB resume after background errors (#3997 ) Summary: Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts - 1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not 2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place 3. Provide an API for the user to clear the error and resume the DB instance This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors. Closes https://github.com/facebook/rocksdb/pull/3997 Differential Revision: D8653831 Pulled By: anand1976 fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd	2018-06-28 12:34:40 -07:00
Yanqin Jin	26d67e357e	Support group commits of version edits (#3944 ) Summary: This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Closes https://github.com/facebook/rocksdb/pull/3944 Differential Revision: D8432536 Pulled By: riversand963 fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb	2018-06-28 12:34:39 -07:00
Zhichao Cao	1f6efabe23	Add bottommost_compression_opts to for bottommost_compression (#3985 ) Summary: …ression For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts Closes https://github.com/facebook/rocksdb/pull/3985 Reviewed By: riversand963 Differential Revision: D8385911 Pulled By: zhichao-cao fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc	2018-06-27 17:42:38 -07:00
chouxi	818c84e116	Store timestamp in deadlock detection (#4060 ) Summary: - Summary Add timestamp into the DeadlockInfo to store the timestamp when deadlock detected on the rocksdb side. - Testplan: `make check -j64` Closes https://github.com/facebook/rocksdb/pull/4060 Differential Revision: D8655380 Pulled By: chouxi fbshipit-source-id: f58e1aa5e09eb1d1eed0a181d4e2304aaf01efe8	2018-06-27 12:27:58 -07:00
Daniel Black	e5ae1bb465	Remove bogus gcc-8.1 warning (#3870 ) Summary: Various rearrangements of the cch maths failed or replacing = '\0' with memset failed to convince the compiler it was nul terminated. So took the perverse option of changing strncpy to strcpy. Return null if memory couldn't be allocated. util/status.cc: In static member function ‘static const char* rocksdb::Status::CopyState(const char)’: util/status.cc:28:15: error: ‘char strncpy(char, const char, size_t)’ output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation] std::strncpy(result, state, cch - 1); ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~ util/status.cc:19:18: note: length computed here std::strlen(state) + 1; // +1 for the null terminator ~~~~~~~~~~~^~~~~~~ cc1plus: all warnings being treated as errors make: *** [Makefile:645: shared-objects/util/status.o] Error 1 closes #2705 Closes https://github.com/facebook/rocksdb/pull/3870 Differential Revision: D8594114 Pulled By: anand1976 fbshipit-source-id: ab20f3a456a711e4d29144ebe630e4fe3c99ec25	2018-06-27 12:23:07 -07:00
Manuel Ung	a16e00b7b9	WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955 ) Summary: This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data. There are the places where reads had to be modified. - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible. - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key. - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case. Closes https://github.com/facebook/rocksdb/pull/3955 Differential Revision: D8286688 Pulled By: lth fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe	2018-06-27 12:23:07 -07:00
Nikhil Benesch	17339dc2f3	Add table property tracking number of range deletions (#4016 ) Summary: Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR #3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions. /cc ajkr nvanbenschoten Closes https://github.com/facebook/rocksdb/pull/4016 Differential Revision: D8527575 Pulled By: ajkr fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7	2018-06-26 20:27:35 -07:00
Zhongyi Xie	408205a36b	use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899 ) Summary: Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file. This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at. Closes https://github.com/facebook/rocksdb/pull/3899 Differential Revision: D8157459 Pulled By: miasantreble fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41	2018-06-26 15:57:26 -07:00
Maysam Yabandeh	80ade9ad83	Pin top-level index on partitioned index/filter blocks (#4037 ) Summary: Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead. Closes https://github.com/facebook/rocksdb/pull/4037 Differential Revision: D8596218 Pulled By: maysamyabandeh fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2	2018-06-22 15:27:46 -07:00
zhichao-cao	3fbc865cd5	Add kOptionsStatistics to GetProperty() (#3966 ) Summary: Add a new DB property to DB::GetProperty(), which returns the option.statistics. Test is updated to pass. Closes https://github.com/facebook/rocksdb/pull/3966 Differential Revision: D8311139 Pulled By: zhichao-cao fbshipit-source-id: ea78f4727358c807b0e5a0ea62e09defb10ad9ac	2018-06-15 17:28:01 -07:00
straw	89b37081a1	add c api rocksdb_sstfilewriter_file_size Summary: Closes https://github.com/facebook/rocksdb/pull/3922 Differential Revision: D8208528 Pulled By: ajkr fbshipit-source-id: d384fe53cf526f2aadc7b79a423ce36dbd3ff224	2018-06-01 09:43:59 -07:00
QingpingWang	2807678b11	c api set bottommost level compaction Summary: Closes https://github.com/facebook/rocksdb/pull/3928 Differential Revision: D8224962 Pulled By: ajkr fbshipit-source-id: 3caf463509a935bff46530f27232a85ae7e4e484	2018-05-31 17:30:50 -07:00
Maysam Yabandeh	402b7aa07f	Exclude seq from index keys Summary: Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator. The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s Closes https://github.com/facebook/rocksdb/pull/3894 Differential Revision: D8118775 Pulled By: maysamyabandeh fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd	2018-05-25 18:42:43 -07:00
QingpingWang	070319f7bb	add flush_before_backup parameter to c api rocksdb_backup_engine_create_new_backup Summary: Add flush_before_backup to rocksdb_backup_engine_create_new_backup. make c api able to control the flush before backup behavior. Closes https://github.com/facebook/rocksdb/pull/3897 Differential Revision: D8157676 Pulled By: ajkr fbshipit-source-id: 88998c62f89f087bf8672398fd7ddafabbada505	2018-05-24 22:28:52 -07:00
Yi Wu	bc7e8d472e	LRUCache midpoint insertion Summary: Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache. Closes https://github.com/facebook/rocksdb/pull/3877 Differential Revision: D8100895 Pulled By: yiwu-arbug fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da	2018-05-24 15:57:33 -07:00
Dmitri Smirnov	3db8504cde	Catchup with posix features Summary: Catch up with Posix features NewWritableRWFile must fail when file does not exists Implement Env::Truncate() Adjust Env options optimization functions Implement MemoryMappedBuffer on Windows. Closes https://github.com/facebook/rocksdb/pull/3857 Differential Revision: D8053610 Pulled By: ajkr fbshipit-source-id: ccd0d46c29648a9f6f496873bc1c9d6c5547487e	2018-05-24 15:13:04 -07:00
Andrew Kryczka	01bcc34896	Introduce library-independent default compression level Summary: Previously we were using -1 as the default for every library, which was legacy from our zlib options. That worked for a while, but after zstd introduced `a146ee04ae`, it started giving poor compression ratios by default in zstd. This PR adds a constant to RocksDB public API, `CompressionOptions::kDefaultCompressionLevel`, which will get translated to the default value specific to the compression library being used in "util/compression.h". The constant uses a number that appears to be larger than any library's maximum compression level. Closes https://github.com/facebook/rocksdb/pull/3895 Differential Revision: D8125780 Pulled By: ajkr fbshipit-source-id: 2db157a89118cd4f94577c2f4a0a5ff31c8391c6	2018-05-23 18:42:08 -07:00
Jacquin Mininger	4420cb49da	Fix Issue #3771 : Slice ctor checks for nullptr and creates empty string Summary: Fix Issue #3771 : Check for nullptr in Slice constructor Slice ctor checks for nullptr and creates empty string if the string does not exist Closes https://github.com/facebook/rocksdb/pull/3887 Differential Revision: D8098852 Pulled By: ajkr fbshipit-source-id: 04471077defa9776ce7b8c389a61312ce31002fb	2018-05-22 13:41:56 -07:00
Yanqin Jin	263ef52b65	Update ColumnFamilyTest for multi-CF verification Summary: Change `keys_` from `set<string>` to `vector<set<string>>` so that each column family's keys are stored in one set. ajkr When you have a chance, can you PTAL? Thanks! Closes https://github.com/facebook/rocksdb/pull/3871 Differential Revision: D8056447 Pulled By: riversand963 fbshipit-source-id: 650d0f9cad02b1bc005fc329ad76edbf053e6386	2018-05-21 11:57:42 -07:00
Andrew Kryczka	508a09fd62	Print histogram count and sum in statistics string Summary: Previously it only printed percentiles, even though our histogram keeps track of count and sum (and more). There have been many times we want to know more than the percentiles. For example, we currently want sum of "rocksdb.compression.times.nanos" and sum of "rocksdb.decompression.times.nanos", which would allow us to know the relative cost of compression vs decompression. This PR adds count and sum to the string printed by `StatisticsImpl::ToString`. This is a bit risky as there are definitely parsers assuming the old format. I will mention it in HISTORY.md and hope for the best... Closes https://github.com/facebook/rocksdb/pull/3863 Differential Revision: D8038831 Pulled By: ajkr fbshipit-source-id: 0465b72e4b0cbf18ef965f4efe402601d16d5b5c	2018-05-21 11:12:47 -07:00
Yanqin Jin	a0c7b4d526	Set the default value of max_manifest_file_size. Summary: In the past, the default value of max_manifest_file_size is uint64_t::MAX, allowing a long running RocksDB process to grow its MANIFEST file to take up the entire disk, as reported in [issue 3851](https://github.com/facebook/rocksdb/issues/3851). It is reasonable and common to provide a default non-max value for this option. Therefore, I set the value to 1GB. siying miasantreble Please let me know whether this looks good to you. Thanks! Closes https://github.com/facebook/rocksdb/pull/3867 Differential Revision: D8051524 Pulled By: riversand963 fbshipit-source-id: 50251f0804b1fa933a19a30d19d261ea8b9d2b72	2018-05-18 08:11:55 -07:00
Fosco Marotto	fa43948cbc	Update HISTORY and version for upcoming 5.14 Summary: Closes https://github.com/facebook/rocksdb/pull/3866 Differential Revision: D8043563 Pulled By: gfosco fbshipit-source-id: da4af20e604534602ac0e07943135513fd9a9f53	2018-05-17 14:27:17 -07:00
Mike Kolupaev	8bf555f487	Change and clarify the relationship between Valid(), status() and Seek() for all iterators. Also fix some bugs Summary: Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are: If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted. * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not. However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours). This PR changes the convention to: * If status() is not ok, Valid() always returns false. * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.) This does sacrifice the two use cases listed above, but siying said it's ok. Overview of the changes: * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario. * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed. * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator. To find the iterator types that needed changes I searched for "public .Iterator" in the code. Here's an overview of all 27 iterator types: Iterators that didn't need changes: status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator. * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator. Iterators with changes (see inline comments for details): * DBIter - an overhaul: - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back. - It had a few code paths silently discarding subiterator's status. The stress test caught a few. - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example. - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption. - It used to not reset status on seek for some types of errors. - Some simplifications and better comments. - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer. * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status. * ForwardIterator - changed to the new convention, also slightly simplified. * ForwardLevelIterator - fixed some bugs and simplified. * LevelIterator - simplified. * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_. * BlockBasedTableIterator - minor changes. * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid. * PlainTableIterator - some seeks used to not reset status. * CuckooTableIterator - tiny code cleanup. * ManagedIterator - fixed some bugs. * BaseDeltaIterator - changed to the new convention and fixed a bug. * BlobDBIterator - seeks used to not reset status. * KeyConvertingIterator - some small change. Closes https://github.com/facebook/rocksdb/pull/3810 Differential Revision: D7888019 Pulled By: al13n321 fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d	2018-05-17 02:56:56 -07:00
Maysam Yabandeh	66c7aa32fb	Clarify the ownership of root db after TransactionDB::Open Summary: The patch clarifies the ownership of the root db after TransactionDB::Open. If it is a success the ownership if with the TransactionDB, and the root db will be deleted when the destructor of the base class, StackableDB, is called. If it is failure, the temporarily created root db will also be deleted properly. The patch also includes lots of useful formatting changes. Closes https://github.com/facebook/rocksdb/pull/3714 upon which this patch is built. Closes https://github.com/facebook/rocksdb/pull/3806 Differential Revision: D7878010 Pulled By: maysamyabandeh fbshipit-source-id: f54f3942e29434143ae5a2423ceec9c7072cd4c2	2018-05-11 15:14:03 -07:00
Andrew Kryczka	072ae671a7	Apply use_direct_io_for_flush_and_compaction to writes only Summary: Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used. This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously. Closes https://github.com/facebook/rocksdb/pull/3829 Differential Revision: D7915443 Pulled By: ajkr fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279	2018-05-09 19:42:58 -07:00
Andrew Kryczka	46152d53bf	Second attempt at db_stress crash-recovery verification Summary: - Original commit: `a4fb1f8c04` - Revert commit (we reverted as a quick fix to get crash tests passing): `6afe22db2e` This PR includes the contents of the original commit plus two bug fixes, which are: - In whitebox crash test, only set `--expected_values_path` for `db_stress` runs in the first half of the crash test's duration. In the second half, a fresh DB is created for each `db_stress` run, so we cannot maintain expected state across `db_stress` runs. - Made `Exists()` return true for `UNKNOWN_SENTINEL` values. I previously had an assert in `Exists()` that value was not `UNKNOWN_SENTINEL`. But it is possible for post-crash-recovery expected values to be `UNKNOWN_SENTINEL` (i.e., if the crash happens in the middle of an update), in which case this assertion would be tripped. The effect of returning true in this case is there may be cases where a `SingleDelete` deletes no data. But if we had returned false, the effect would be calling `SingleDelete` on a key with multiple older versions, which is not supported. Closes https://github.com/facebook/rocksdb/pull/3793 Differential Revision: D7811671 Pulled By: ajkr fbshipit-source-id: 67e0295bfb1695ff9674837f2e05bb29c50efc30	2018-04-30 12:27:34 -07:00
Vincent Lee	282099fc0f	fix missing perfcontext destroy declare in C API Summary: `rocksdb_perfcontext_destroy` declare is missing in C API. Closes https://github.com/facebook/rocksdb/pull/3787 Differential Revision: D7816490 Pulled By: ajkr fbshipit-source-id: 3a488607bfc897c7ce846a1b3c2b7af693134d0d	2018-04-30 11:43:09 -07:00
Victor Grishchenko	c9ace1d81b	expose WAL iterator in the C API Summary: A minor change: I wrapped TransactionLogIterator for the C API. I needed that for the golang binding. Closes https://github.com/facebook/rocksdb/pull/3304 Differential Revision: D6628736 Pulled By: miasantreble fbshipit-source-id: 3374f3c64b1d7b225696b8767090917761e2f30a	2018-04-27 16:56:59 -07:00
Andrew Kryczka	6afe22db2e	revert db_stress crash-recovery verification Summary: crash-recovery verification is failing in the whitebox testing, which may or may not be a valid correctness issue -- need more time to investigate. In the meantime, reverting so we don't mask other failures. Closes https://github.com/facebook/rocksdb/pull/3786 Differential Revision: D7794516 Pulled By: ajkr fbshipit-source-id: 28ccdfdb9ec9b3b0fb08c15cbf9d2e282201ff33	2018-04-27 12:57:01 -07:00
Huachao Huang	ed7a95b28c	Add max_subcompactions as a compaction option Summary: Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default. I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically. Closes https://github.com/facebook/rocksdb/pull/3775 Differential Revision: D7792357 Pulled By: ajkr fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c	2018-04-27 11:57:39 -07:00
Nathan VanBenschoten	37cd617b6b	Add virtual Truncate method to Env Summary: This change adds a virtual `Truncate` method to `Env`, which truncates the named file to the specified size. At the moment, this is only supported for `MockEnv`, but other `Env's` could be extended to override the method too. This is the same approach that methods like `LinkFile` and `AreSameFile` have taken. This is useful for any user of the in-memory `Env`. The implementation's header is not exported, so before this change, it was impossible to access it's already existing `Truncate` method. Closes https://github.com/facebook/rocksdb/pull/3779 Differential Revision: D7785789 Pulled By: ajkr fbshipit-source-id: 3bcdaeea7b7180529f7d9b496dc67b791a00bbf0	2018-04-26 21:12:51 -07:00
Anand Ananthabhotla	406b95197c	Fix clang build failure with -Wgnu-redeclared-enum Summary: In include/rocksdb/db.h, enum EntryType is redeclared even though original declaration in types.h in included. Closes https://github.com/facebook/rocksdb/pull/3766 Differential Revision: D7765504 Pulled By: anand1976 fbshipit-source-id: 622a8ecb306993915be1b9dd5cdd79dbc6a4ea05	2018-04-25 15:42:46 -07:00
Andrew Kryczka	a4fb1f8c04	Add crash-recovery correctness check to db_stress Summary: Previously, our `db_stress` tool held the expected state of the DB in-memory, so after crash-recovery, there was no way to verify data correctness. This PR adds an option, `--expected_values_file`, which specifies a file holding the expected values. In black-box testing, the `db_stress` process can be killed arbitrarily, so updates to the `--expected_values_file` must be atomic. We achieve this by `mmap`ing the file and relying on `std::atomic<uint32_t>` for atomicity. Actually this doesn't provide a total guarantee on what we want as `std::atomic<uint32_t>` could, in theory, be translated into multiple stores surrounded by a mutex. We can verify our assumption by looking at `std::atomic::is_always_lock_free`. For the `mmap`'d file, we didn't have an existing way to expose its contents as a raw memory buffer. This PR adds it in the `Env::NewMemoryMappedFileBuffer` function, and `MemoryMappedFileBuffer` class. `db_crashtest.py` is updated to use an expected values file for black-box testing. On the first iteration (when the DB is created), an empty file is provided as `db_stress` will populate it when it runs. On subsequent iterations, that same filename is provided so `db_stress` can check the data is as expected on startup. Closes https://github.com/facebook/rocksdb/pull/3629 Differential Revision: D7463144 Pulled By: ajkr fbshipit-source-id: c8f3e82c93e045a90055e2468316be155633bd8b	2018-04-24 15:58:22 -07:00
Gabriel Wicke	090c78a0d7	Support lowering CPU priority of background threads Summary: Background activities like compaction can negatively affect latency of higher-priority tasks like request processing. To avoid this, rocksdb already lowers the IO priority of background threads on Linux systems. While this takes care of typical IO-bound systems, it does not help much when CPU (temporarily) becomes the bottleneck. This is especially likely when using more expensive compression settings. This patch adds an API to allow for lowering the CPU priority of background threads, modeled on the IO priority API. Benchmarks (see below) show significant latency and throughput improvements when CPU bound. As a result, workloads with some CPU usage bursts should benefit from lower latencies at a given utilization, or should be able to push utilization higher at a given request latency target. A useful side effect is that compaction CPU usage is now easily visible in common tools, allowing for an easier estimation of the contribution of compaction vs. request processing threads. As with IO priority, the implementation is limited to Linux, degrading to a no-op on other systems. Closes https://github.com/facebook/rocksdb/pull/3763 Differential Revision: D7740096 Pulled By: gwicke fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c	2018-04-24 08:41:51 -07:00
Mike Kolupaev	affe01b0d5	Improve write time breakdown stats Summary: There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes. These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing. Closes https://github.com/facebook/rocksdb/pull/3602 Differential Revision: D7251562 Pulled By: al13n321 fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d	2018-04-23 17:58:54 -07:00
Anand Ananthabhotla	dbdaa4662e	Add a stat for MultiGet keys found, update memtable hit/miss stats Summary: 1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the number of keys successfully read 2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in DBImpl::GetImpl(), but not MultiGet Closes https://github.com/facebook/rocksdb/pull/3730 Differential Revision: D7677364 Pulled By: anand1976 fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5	2018-04-20 15:28:19 -07:00
Maysam Yabandeh	17e04039dd	Propagate fill_cache config to partitioned index iterator Summary: Currently the partitioned index iterator creates a new ReadOptions which ignores the fill_cache config set to ReadOptions passed by the user. The patch propagates fill_cache from the user's ReadOptions to that of partition index iterator. Also it clarifies the contract of fill_cache that i) it does not apply to filters, ii) it still charges block cache for the size of the data block, it still pin the block if it is already in the block cache. Closes https://github.com/facebook/rocksdb/pull/3739 Differential Revision: D7678308 Pulled By: maysamyabandeh fbshipit-source-id: 53ed96424ae922e499e2d4e3580ddc3f0db893da	2018-04-20 15:13:05 -07:00
Yi Wu	ad511684b2	Add block cache related DB properties Summary: Add DB properties "rocksdb.block-cache-capacity", "rocksdb.block-cache-usage", "rocksdb.block-cache-pinned-usage" to show block cache usage. Closes https://github.com/facebook/rocksdb/pull/3734 Differential Revision: D7657180 Pulled By: yiwu-arbug fbshipit-source-id: dd34a019d5878dab539c51ee82669e97b2b745fd	2018-04-18 21:42:25 -07:00
Andrew Kryczka	3cea61392f	include thread-pool priority in thread names Summary: Previously threads were named "rocksdb:bg\<index in thread pool\>", so the first thread in all thread pools would be named "rocksdb:bg0". Users want to be able to distinguish threads used for flush (high-pri) vs regular compaction (low-pri) vs compaction to bottom-level (bottom-pri). So I changed the thread naming convention to include the thread-pool priority. Closes https://github.com/facebook/rocksdb/pull/3702 Differential Revision: D7581415 Pulled By: ajkr fbshipit-source-id: ce04482b6acd956a401ef22dc168b84f76f7d7c1	2018-04-18 17:27:56 -07:00
Harry Wong	b4f333922a	Improve the comment on TableFactory::NewTableReader() Summary: `DBImpl::AddFile()` has been replaced by `DBImpl::IngestExternalFile()`. Closes https://github.com/facebook/rocksdb/pull/3726 Differential Revision: D7646875 Pulled By: ajkr fbshipit-source-id: 241eb7a8d88527fdc5c26b0c3f6faec3296451f8	2018-04-16 16:58:20 -07:00
Jingguo Yao	81d44f2bc5	fix-typo: add missing periods Summary: Closes https://github.com/facebook/rocksdb/pull/3720 Differential Revision: D7631525 Pulled By: ajkr fbshipit-source-id: 50cf4dc363b0d32b150d963011171a8a6f53a384	2018-04-15 13:12:23 -07:00
Xiaofei Du	a0102aa6d7	Make database files' permissions configurable Summary: Closes https://github.com/facebook/rocksdb/pull/3709 Differential Revision: D7610227 Pulled By: xiaofeidu008 fbshipit-source-id: 88a52f0f9f96e2195fccde995cf9760b785e9f07	2018-04-13 13:13:04 -07:00
zhangjinpeng1987	31ee4bf240	add kEntryRangeDeletion Summary: When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read. After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction. Closes https://github.com/facebook/rocksdb/pull/3695 Differential Revision: D7570322 Pulled By: ajkr fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369	2018-04-13 11:27:17 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00
Maysam Yabandeh	d15397ba10	WritePrepared Txn: rollback_merge_operands hack Summary: This is a hack as temporary fix of MyRocks with rollbacking the merge operands. The way MyRocks uses merge operands is without protection of locks, which violates the assumption behind the rollback algorithm. They are ok with not being rolled back as it would just create a gap in the autoincrement column. The hack add an option to disable the rollback of merge operands by default and only enables it to let the unit test pass. Closes https://github.com/facebook/rocksdb/pull/3711 Differential Revision: D7597177 Pulled By: maysamyabandeh fbshipit-source-id: 544be0f666c7e7abb7f651ec8b23124e05056728	2018-04-12 11:58:11 -07:00
Yanqin Jin	d42bd041c5	Improve visibility into the reasons for compaction. Summary: Add `compaction_reason` as part of event log for event `compaction started`. Add counters for each `CompactionReason`. Closes https://github.com/facebook/rocksdb/pull/3679 Differential Revision: D7550348 Pulled By: riversand963 fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d	2018-04-11 10:58:44 -07:00
Yanqin Jin	d95014b9df	fix some text in comments. Summary: 1. Remove redundant text. 2. Make terminology consistent across all comments and doc of RocksDB. Also do our best to conform to conventions. Specifically, use 'callback' instead of 'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)). Closes https://github.com/facebook/rocksdb/pull/3693 Differential Revision: D7560396 Pulled By: riversand963 fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8	2018-04-10 15:59:24 -07:00
Maysam Yabandeh	bde1c1a72a	WritePrepared Txn: add stats Summary: Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex. Closes https://github.com/facebook/rocksdb/pull/3683 Differential Revision: D7529393 Pulled By: maysamyabandeh fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9	2018-04-07 21:56:42 -07:00
Maysam Yabandeh	eb5a295440	WritePrepared Txn: add write_committed option to dump_wal Summary: Currently dump_wal cannot print the prepared records from the WAL that is generated by WRITE_PREPARED write policy since the default reaction of the handler is to return NotSupported if markers of WRITE_PREPARED are encountered. This patch enables the admin to pass --write_committed=false option, which will be accordingly passed to the handler. Note that DBFileDumperCommand and DBDumperCommand are still not updated by this patch but firstly they are not urgent and secondly we need to revise this approach later when we also add WRITE_UNPREPARED markers so I leave it for future work. Tested by running it on a WAL generated by WRITE_PREPARED: $ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log \| grep BEGIN_PREARE \| head -1 1,2,70,0,BEGIN_PREARE $ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log --write_committed=false \| grep BEGIN_PREARE \| head -1 1,2,70,0,BEGIN_PREARE PUT(0) : 0x30303031313330313938 PUT(0) : 0x30303032353732313935 END_PREPARE(0x74786E31313535383434323738303738363938313335312D30) Closes https://github.com/facebook/rocksdb/pull/3682 Differential Revision: D7522090 Pulled By: maysamyabandeh fbshipit-source-id: a0332207261c61e18b2f9dfbe9feecd9a1339aca	2018-04-07 21:56:42 -07:00

1 2 3 4 5 ...

1636 commits