rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-26 07:30:54 +00:00

Author	SHA1	Message	Date
Yi Wu	3cf562be31	Fix IOError on WAL write doesn't propagate to write group follower Summary: This is a simpler version of #3097 by removing all unrelated changes. Fixing the bug where concurrent writes may get Status::OK while it actually gets IOError on WAL write. This happens when multiple writes form a write batch group, and the leader get an IOError while writing to WAL. The leader failed to pass the error to followers in the group, and the followers end up returning Status::OK() while actually writing nothing. The bug only affect writes in a batch group. Future writes after the batch group will correctly return immediately with the IOError. Closes https://github.com/facebook/rocksdb/pull/3201 Differential Revision: D6421644 Pulled By: yiwu-arbug fbshipit-source-id: 1c2a455c5b73f6842423785eb8a9dbfbb191dc0e	2017-11-28 11:42:48 -08:00
Andrew Kryczka	1bdb44de95	optimize file ingestion checks for range deletion overlap Summary: Before we were checking every file in the level which was unnecessary. We can piggyback onto the code for checking point-key overlap, which already opens all the files that could possibly contain overlapping range deletions. This PR makes us check just the range deletions from those files, so no extra ones will be opened. Closes https://github.com/facebook/rocksdb/pull/3179 Differential Revision: D6358125 Pulled By: ajkr fbshipit-source-id: 00e200770fdb8f3cc6b1b2da232b755e4ba36279	2017-11-28 11:27:02 -08:00
Griffin Smith	2f09524762	Expose all remaining read and write options via the C API Summary: Expose read and write options via the C API Closes https://github.com/facebook/rocksdb/pull/3185 Differential Revision: D6389658 Pulled By: sagar0 fbshipit-source-id: 1848912750329a476805b3cb2f315e7b71f61472	2017-11-28 10:28:46 -08:00
Maysam Yabandeh	e59cb2a19b	Add seq_per_batch to WriteWithCallbackTest Summary: Augment WriteWithCallbackTest to also test when seq_per_batch is true. Closes https://github.com/facebook/rocksdb/pull/3195 Differential Revision: D6398143 Pulled By: maysamyabandeh fbshipit-source-id: 7bc4218609355ec20fed25df426a8455ec2390d3	2017-11-22 13:56:44 -08:00
Zhongyi Xie	5fac4729cc	make compaction_readahead_size_ thread safe Summary: this should fix the failing tsan_check Closes https://github.com/facebook/rocksdb/pull/3192 Differential Revision: D6390004 Pulled By: miasantreble fbshipit-source-id: 6cadfc6f68febb1a77b0abcdb5416570dad926a5	2017-11-21 20:11:38 -08:00
anand1976	d394a6bb48	Add a ticker stat for number of keys skipped during iteration Summary: This diff adds a new ticker stat, NUMBER_ITER_SKIP, to count the number of internal keys skipped during iteration. Keys can be skipped due to deletes, or lower sequence number, or higher sequence number than the one requested. Also, fix the issue when StatisticsData is naturally aligned on cacheline boundary, padding becomes a zero size array, which the Windows compiler doesn't like. So add a cacheline worth of padding in that case to keep it happy. We cannot conditionally add padding as gcc doesn't allow using sizeof in preprocessor directives. Closes https://github.com/facebook/rocksdb/pull/3177 Differential Revision: D6353897 Pulled By: anand1976 fbshipit-source-id: 441d5a09af9c4e22e7355242dfc0c7b27aa0a6c2	2017-11-20 21:26:37 -08:00
Gustav Davidsson	2d04ed65e4	Make trash-to-DB size ratio limit configurable Summary: Allow users to configure the trash-to-DB size ratio limit, so that ratelimits for deletes can be enforced even when larger portions of the database are being deleted. Closes https://github.com/facebook/rocksdb/pull/3158 Differential Revision: D6304897 Pulled By: gdavidsson fbshipit-source-id: a28dd13059ebab7d4171b953ed91ce383a84d6b3	2017-11-17 11:58:17 -08:00
Zhongyi Xie	32e31d49d1	Make DBOption compaction_readahead_size dynamic Summary: Closes https://github.com/facebook/rocksdb/pull/3004 Differential Revision: D6056141 Pulled By: miasantreble fbshipit-source-id: 56df1630f464fd56b07d25d38161f699e0528b7f	2017-11-16 17:57:25 -08:00
Maysam Yabandeh	54b43563be	WritePrepared Txn: Refactoring WriteCallback Summary: Refactor the logic around WriteCallback in the write path to clarify when and how exactly we advance the sequence number and making sure it is consistent across the code. Closes https://github.com/facebook/rocksdb/pull/3168 Differential Revision: D6324312 Pulled By: maysamyabandeh fbshipit-source-id: 9a34f479561fdb2a5d01ef6d37a28908d03bbe33	2017-11-15 08:27:06 -08:00
Maysam Yabandeh	53863b76f9	WritePrepared Txn: fix bug with Rollback seq Summary: The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it. Closes https://github.com/facebook/rocksdb/pull/3157 Differential Revision: D6304291 Pulled By: maysamyabandeh fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401	2017-11-15 08:27:06 -08:00
Maysam Yabandeh	175d5d6a9e	Properly destruct rebuilding_trx_ Summary: When testing rebuilding_trx_ in MemTableInserter might still be set before the tests finishes which would cause ASAN alarms for leaks. This patch deletes the pointers in MemTableInserter destructor. Closes https://github.com/facebook/rocksdb/pull/3162 Differential Revision: D6317113 Pulled By: maysamyabandeh fbshipit-source-id: a68be70709a4fff7ac2b768660119311968f9c21	2017-11-14 08:56:50 -08:00
Maysam Yabandeh	2515266725	WritePrepared Txn: Refactoring TrackKeys Summary: This patch clarifies and refactors the logic around tracked keys in transactions. Closes https://github.com/facebook/rocksdb/pull/3140 Differential Revision: D6290258 Pulled By: maysamyabandeh fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1	2017-11-11 13:14:20 -08:00
Maysam Yabandeh	2edc92bc28	WritePrepared Txn: cross-compatibility test Summary: Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables Closes https://github.com/facebook/rocksdb/pull/3118 Differential Revision: D6227247 Pulled By: maysamyabandeh fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7	2017-11-11 11:28:37 -08:00
Maysam Yabandeh	857adf388f	WritePrepared Txn: Refactor conf params Summary: Summary of changes: - Move seq_per_batch out of Options - Rename concurrent_prepare to two_write_queues - Add allocate_seq_only_for_data_ Closes https://github.com/facebook/rocksdb/pull/3136 Differential Revision: D6304458 Pulled By: maysamyabandeh fbshipit-source-id: 08e685bfa82bbc41b5b1c5eb7040a8ca6e05e58c	2017-11-10 17:28:12 -08:00
Dmitri Smirnov	f8e2db0717	Fix crashes, address test issues and adjust windows test script Summary: Add per-exe execution capability Add fix parsing of groups/tests Add timer test exclusion Fix unit tests Ifdef threadpool specific tests that do not pass on Vista threadpool. Remove spurious outout from prefix_test so test case listing works properly. Fix not using standard test directories results in file creation errors in sst_dump_test. BlobDb fixes: In C++ end() iterators can not be dereferenced. They are not valid. When deleting blob_db_ set it to nullptr before any other code executes. Not fixed:. On Windows you can not delete a file while it is open. [ RUN ] BlobDBTest.ReadWhileGC d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options) IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options) IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied write_batch Should not call front() if there is a chance the container is empty Closes https://github.com/facebook/rocksdb/pull/3152 Differential Revision: D6293274 Pulled By: sagar0 fbshipit-source-id: 318c3717c22087fae13b18715dffb24565dbd956	2017-11-10 10:41:57 -08:00
Shaohua Li	eefd75a228	Stream Summary: Add a simple policy for NVMe write time life hint Closes https://github.com/facebook/rocksdb/pull/3095 Differential Revision: D6298030 Pulled By: shligit fbshipit-source-id: 9a72a42e32e92193af11599eb71f0cf77448e24d	2017-11-10 09:26:24 -08:00
kapitan-k	f1c5eaba56	updated c ingestexternalfileoptions for ingest behind Summary: Closes https://github.com/facebook/rocksdb/pull/3151 Differential Revision: D6293861 Pulled By: ajkr fbshipit-source-id: f8db0a71509d1cd8237f2d377bf9e1bb0464bdbf	2017-11-09 18:15:09 -08:00
Andrew Kryczka	93f69cb93a	use bottommost compression when base level is bottommost Summary: The previous compression type selection caused unexpected behavior when the base level was also the bottommost level. The following sequence of events could happen: - full compaction generates files with `bottommost_compression` type - now base level is bottommost level since all files are in the same level - any compaction causes files to be rewritten `compression_per_level` type since bottommost compression didn't apply to base level I changed the code to make bottommost compression apply to base level. Closes https://github.com/facebook/rocksdb/pull/3141 Differential Revision: D6264614 Pulled By: ajkr fbshipit-source-id: d7aaa8675126896684154a1f2c9034d6214fde82	2017-11-09 17:42:00 -08:00
Sagar Vemuri	a6d8e30c05	Remove unnecessary status check in TableCache::NewIterator Summary: While investigating the usage of `new_table_iterator_nanos` perf counter, I saw some code was wrapper around with unnecessary status check ... so removed it. Closes https://github.com/facebook/rocksdb/pull/3120 Differential Revision: D6229181 Pulled By: sagar0 fbshipit-source-id: f8a44fe67f5a05df94553fdb233b21e54e88cc34	2017-11-03 14:42:08 -07:00
Andrew Kryczka	24ad430600	pass key/value samples through zstd compression dictionary generator Summary: Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before. Note this is the first step of #2987, extracted into a separate PR per reviewer request. Closes https://github.com/facebook/rocksdb/pull/3057 Differential Revision: D6116891 Pulled By: ajkr fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497	2017-11-02 22:56:36 -07:00
Andrew Kryczka	c4c1f961e7	dynamically change current memtable size Summary: Previously setting `write_buffer_size` with `SetOptions` would only apply to new memtables. An internal user wanted it to take effect immediately, instead of at an arbitrary future point, to prevent OOM. This PR makes the memtable's size mutable, and makes `SetOptions()` mutate it. There is one case when we preserve the old behavior, which is when memtable prefix bloom filter is enabled and the user is increasing the memtable's capacity. That's because the prefix bloom filter's size is fixed and wouldn't work as well on a larger memtable. Closes https://github.com/facebook/rocksdb/pull/3119 Differential Revision: D6228304 Pulled By: ajkr fbshipit-source-id: e44bd9d10a5f8c9d8c464bf7436070bb3eafdfc9	2017-11-02 22:28:10 -07:00
Yi Wu	62578d80c1	Blob DB: Add compaction filter to remove expired blob index entries Summary: After adding expiration to blob index in #3066, we are now able to add a compaction filter to cleanup expired blob index entries. Closes https://github.com/facebook/rocksdb/pull/3090 Differential Revision: D6183812 Pulled By: yiwu-arbug fbshipit-source-id: 9cb03267a9702975290e758c9c176a2c03530b83	2017-11-02 17:27:38 -07:00
Yi Wu	7bfa88037e	Blob DB: fix snapshot handling Summary: Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is 1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish 2. otherwise, it is the latest sequence of the file Closes https://github.com/facebook/rocksdb/pull/3087 Differential Revision: D6182519 Pulled By: yiwu-arbug fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45	2017-11-02 15:58:27 -07:00
Andrew Kryczka	6778690b51	fix duplicate definition of GetEntryType() Summary: It's also defined in db/dbformat.cc per `7fe3b32896` Closes https://github.com/facebook/rocksdb/pull/3111 Differential Revision: D6219140 Pulled By: ajkr fbshipit-source-id: 0f2b14e41457334a4665c6b7e3f42f1a060a0f35	2017-11-01 22:56:17 -07:00
Maysam Yabandeh	02693f64fc	WritePrepared Txn: ValidateSnapshot Summary: Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function. Closes https://github.com/facebook/rocksdb/pull/3101 Differential Revision: D6199405 Pulled By: maysamyabandeh fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240	2017-11-01 19:11:09 -07:00
Mikhail Antonov	7fe3b32896	Added support for differential snapshots Summary: The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2). This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages. From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff". This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR. For now, what's done here according to initial discussions: Preserving deletes: - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion. - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum. - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum. Iterator changes: - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum. TableCache changes: - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span. What's left: - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type. Closes https://github.com/facebook/rocksdb/pull/2999 Differential Revision: D6175602 Pulled By: mikhail-antonov fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900	2017-11-01 18:56:43 -07:00
Maysam Yabandeh	17731a43a6	WritePrepared Txn: Optimize for recoverable state Summary: GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit. This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush. Closes https://github.com/facebook/rocksdb/pull/3071 Differential Revision: D6148023 Pulled By: maysamyabandeh fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7	2017-11-01 17:26:46 -07:00
Shaohua Li	33c7d4ccd9	Make writable_file_max_buffer_size dynamic Summary: The DBOptions::writable_file_max_buffer_size can be changed dynamically. Closes https://github.com/facebook/rocksdb/pull/3053 Differential Revision: D6152720 Pulled By: shligit fbshipit-source-id: aa0c0cfcfae6a54eb17faadb148d904797c68681	2017-10-31 13:56:35 -07:00
Andrew Kryczka	b7bc9cc038	fix tracking oldest snapshot for bottom-level compaction Summary: The assertion was caught by `MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/5` when run in a loop. The caller doesn't track whether the released snapshot is oldest, so let this function handle that case. Closes https://github.com/facebook/rocksdb/pull/3080 Differential Revision: D6185257 Pulled By: ajkr fbshipit-source-id: 4b3015c11db5d31e46521a00af568546ef4558cd	2017-10-30 00:55:58 -07:00
Yi Wu	792ef10ca8	Return Status::InvalidArgument if user request sync write while disabling WAL Summary: write_options.sync = true and write_options.disableWAL is incompatible. When WAL is disabled, we are not able to persist the write immediately. Return an error in this case to avoid misuse of the options. Closes https://github.com/facebook/rocksdb/pull/3086 Differential Revision: D6176822 Pulled By: yiwu-arbug fbshipit-source-id: 1eb10028c14fe7d7c13c8bc12c0ef659f75aa071	2017-10-28 22:11:18 -07:00
Andrew Kryczka	6a9335dbbb	always drop tombstones compacted to bottommost level Summary: Problem was in bottommost compaction, when an L0->L0 compaction happened and L0 was bottommost. Then we'd preserve tombstones according to `Compaction::KeyNotExistsBeyondOutputLevel`, while zeroing seqnum according to `CompactionIterator::PrepareOutput`, thus triggering the assertion in `PrepareOutput`. To fix, we can just drop tombstones in L0->L0 when the output is "bottommost", i.e., the compaction includes the oldest L0 file and there's nothing at lower levels. Closes https://github.com/facebook/rocksdb/pull/3085 Differential Revision: D6175742 Pulled By: ajkr fbshipit-source-id: 8ab19a2e001496f362e9eb0a71757e2f6ecfdb3b	2017-10-27 15:56:35 -07:00
Yi Wu	84a04af9a9	TableProperty::oldest_key_time defaults to 0 Summary: We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0. Also revert db_sst_test back to before #2842. Closes https://github.com/facebook/rocksdb/pull/3079 Differential Revision: D6165702 Pulled By: yiwu-arbug fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0	2017-10-27 15:00:05 -07:00
Islam AbdelRahman	05993155ef	Mark files as trash by using .trash extension Summary: SstFileManager move files that need to be deleted into a trash directory. Deprecate this behaviour and instead add ".trash" extension to files that need to be deleted Closes https://github.com/facebook/rocksdb/pull/2970 Differential Revision: D5976805 Pulled By: IslamAbdelRahman fbshipit-source-id: 27374ece4315610b2792c30ffcd50232d4c9a343	2017-10-27 13:27:12 -07:00
Prashant D	50e95a63dd	Fix coverity issues column_family, compaction_db/iterator Summary: db/column_family.h : 79 ColumnFamilyHandleInternal() CID 1322806 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 2. uninit_member: Non-static class member internal_cfd_ is not initialized in this constructor nor in any functions that it calls. 80 : ColumnFamilyHandleImpl(nullptr, nullptr, nullptr) {} db/compacted_db_impl.cc: 18CompactedDBImpl::CompactedDBImpl( 19 const DBOptions& options, const std::string& dbname) 20 : DBImpl(options, dbname) { 2. uninit_member: Non-static class member cfd_ is not initialized in this constructor nor in any functions that it calls. 4. uninit_member: Non-static class member version_ is not initialized in this constructor nor in any functions that it calls. CID 1396120 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 6. uninit_member: Non-static class member user_comparator_ is not initialized in this constructor nor in any functions that it calls. 21} db/compaction_iterator.cc: 9. uninit_member: Non-static class member current_user_key_sequence_ is not initialized in this constructor nor in any functions that it calls. 11. uninit_member: Non-static class member current_user_key_snapshot_ is not initialized in this constructor nor in any functions that it calls. CID 1419855 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 13. uninit_member: Non-static class member current_key_committed_ is not initialized in this constructor nor in any functions that it calls. Closes https://github.com/facebook/rocksdb/pull/3084 Differential Revision: D6172999 Pulled By: sagar0 fbshipit-source-id: 084d73393faf8022c01359cfb445807b6a782460	2017-10-27 11:26:42 -07:00
Prashant D	47166baeac	Fix coverity uninitialized fields warnings Pulled By: ajkr Differential Revision: D6170448 fbshipit-source-id: 5fd6d1608fc0df27c94d9f5059315ce7f79b8f5c	2017-10-26 21:11:50 -07:00
Andrew Kryczka	95667383db	implement lower bound for iterators Summary: - for `SeekToFirst()`, just convert it to a regular `Seek()` if lower bound is specified - for operations that iterate backwards over user keys (`SeekForPrev`, `SeekToLast`, `Prev`), change `PrevInternal` to check whether user key went below lower bound every time the user key changes -- same approach we use to ensure we stay within a prefix when `prefix_same_as_start=true`. Closes https://github.com/facebook/rocksdb/pull/3074 Differential Revision: D6158654 Pulled By: ajkr fbshipit-source-id: cb0e3a922e2650d2cd4d1c6e1c0f1e8b729ff518	2017-10-26 17:27:42 -07:00
Yi Wu	5a2a6483dc	Blob DB: Inline small values in base DB Summary: Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values. Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches. See blob_index.h for the new blob index format. There are 4 cases when writing a new key: * small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue) * small value w/ TTL: put (type, expiration, value) to base db. * large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db. * large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db. Closes https://github.com/facebook/rocksdb/pull/3066 Differential Revision: D6142115 Pulled By: yiwu-arbug fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0	2017-10-26 12:30:54 -07:00
Andrew Kryczka	9b18cc2363	single-file bottom-level compaction when snapshot released Summary: When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys. - Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys. - Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases. - Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called. - Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction. Closes https://github.com/facebook/rocksdb/pull/3009 Differential Revision: D6062044 Pulled By: ajkr fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21	2017-10-25 16:30:37 -07:00
Islam AbdelRahman	addfe1ef4b	Fix tombstone scans in SeekForPrev outside prefix Summary: When doing a Seek() or SeekForPrev() we should stop the moment we see a key with a different prefix as start if ReadOptions:: prefix_same_as_start was set to true Right now we don't stop if we encounter a tombstone outside the prefix while executing SeekForPrev() Closes https://github.com/facebook/rocksdb/pull/3067 Differential Revision: D6149638 Pulled By: IslamAbdelRahman fbshipit-source-id: 7f659862d2bf552d3c9104a360c79439ceba2f18	2017-10-25 15:12:00 -07:00
zach shipko	386a57e6ef	Fix build on OpenBSD Summary: A few simple changes to allow RocksDB to be built on OpenBSD. Let me know if any further changes are needed. Closes https://github.com/facebook/rocksdb/pull/3061 Differential Revision: D6138800 Pulled By: ajkr fbshipit-source-id: a13a17b5dc051e6518bd56a8c5efd1d24dd81b0c	2017-10-24 13:27:38 -07:00
Yi Wu	66a2c44ef4	Add DB::Properties::kEstimateOldestKeyTime Summary: With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property. My plan is to override the property with a more accurate value with blob db, where we actually have timestamp. Closes https://github.com/facebook/rocksdb/pull/2842 Differential Revision: D5770600 Pulled By: yiwu-arbug fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a	2017-10-23 15:27:27 -07:00
Dmitri Smirnov	d2a65c59e1	Fix unused var warnings in Release mode Summary: MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var. Closes https://github.com/facebook/rocksdb/pull/3048 Differential Revision: D6126272 Pulled By: maysamyabandeh fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f	2017-10-23 14:27:04 -07:00
Maysam Yabandeh	63822eb761	Enable two write queues for transactions Summary: Enable concurrent_prepare flag for WritePrepared transactions and extend the existing transaction tests with this config. Closes https://github.com/facebook/rocksdb/pull/3046 Differential Revision: D6106534 Pulled By: maysamyabandeh fbshipit-source-id: 88c8d21d45bc492beb0a131caea84a2ac5e7d38c	2017-10-23 14:27:04 -07:00
Sagar Vemuri	a02ed12638	Exclude DBTest.DynamicFIFOCompactionOptions test under RocksDB Lite Summary: This test shouldn't be enabled under the lite version; and this fixes the failing contrun test due to #3006. Closes https://github.com/facebook/rocksdb/pull/3056 Differential Revision: D6114681 Pulled By: sagar0 fbshipit-source-id: dc5243549ae6b1353cec7edb820c771d95f66dda	2017-10-20 17:11:39 -07:00
Maysam Yabandeh	3ef55d2c7c	Split CompactionFilterWithValueChange Summary: The test currently times out when it is run under tsan. This patch split it into 4 tests. Closes https://github.com/facebook/rocksdb/pull/3047 Differential Revision: D6106515 Pulled By: maysamyabandeh fbshipit-source-id: 03a28cdf8b1c097be2361b1b0cc3dc1acf2b5d63	2017-10-20 15:42:07 -07:00
Andrew Kryczka	f8b5bb2fd8	remove unused code Summary: fixup `6a541afcc4`. This code didn't do anything because (1) `bytes_per_sync` is assigned in `EnvOptions`'s constructor; and (2) `OptimizeForCompactionTableWrite`'s return value was ignored, even though its only purpose is to return something. Closes https://github.com/facebook/rocksdb/pull/3055 Differential Revision: D6114132 Pulled By: ajkr fbshipit-source-id: ea4831770930e9cf83518e13eb2e1934d1f5487c	2017-10-20 14:11:52 -07:00
Sagar Vemuri	f0804db7f7	Make FIFO compaction options dynamically configurable Summary: ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now. Some of the ways in which the fifo compaction options can be set are: - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})` - `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})` - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})` - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})` Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes. Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`. The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now. Closes https://github.com/facebook/rocksdb/pull/3006 Differential Revision: D6058619 Pulled By: sagar0 fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92	2017-10-19 15:26:36 -07:00
Dmitri Smirnov	ebab2e2d42	Enable MSVC W4 with a few exceptions. Fix warnings and bugs Summary: Closes https://github.com/facebook/rocksdb/pull/3018 Differential Revision: D6079011 Pulled By: yiwu-arbug fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57	2017-10-19 10:57:12 -07:00
Gihwan Oh	7deed2b43c	Fix a typo in a comment Summary: instad of for specific level -> instead of a specific level Closes https://github.com/facebook/rocksdb/pull/3040 Differential Revision: D6090811 Pulled By: sagar0 fbshipit-source-id: 499edef0a6f596c448f61791e6aca8f5cce08e9c	2017-10-18 12:32:28 -07:00
Maysam Yabandeh	7e38238981	WritePrepared Txn: Disable GC during recovery Summary: Disables GC during recovery of a WritePrepared txn db to avoid GCing uncommitted key values. Closes https://github.com/facebook/rocksdb/pull/2980 Differential Revision: D6000191 Pulled By: maysamyabandeh fbshipit-source-id: fc4d522c643d24ebf043f811fe4ecd0dd0294675	2017-10-18 09:11:50 -07:00

1 2 3 4 5 ...

2978 commits