rocksdb

Commit Graph

Author	SHA1	Message	Date
Changli Gao	0a7ba0e548	Fix memleak when DB::DeleteFile() Summary: Because the corresponding read_first_record_cache_ item wasn't erased, memory leaked. Closes https://github.com/facebook/rocksdb/pull/1712 Differential Revision: D4363654 Pulled By: ajkr fbshipit-source-id: 7da1adcfc8c380e4ffe05b8769fc2221ad17a225	2018-01-11 18:57:33 -08:00
Yi Wu	06149429d9	WritePrepared Txn: Return NotSupported on iterator refresh Summary: A proper implementation of Iterator::Refresh() for WritePreparedTxnDB would require release and acquire another snapshot. Since MyRocks don't make use of Iterator::Refresh(), we just simply mark it as not supported. Closes https://github.com/facebook/rocksdb/pull/3290 Differential Revision: D6599931 Pulled By: yiwu-arbug fbshipit-source-id: 4e1632d967316431424f6e458254ecf9a97567cf	2017-12-18 22:29:30 -08:00
Yi Wu	237b292515	BlobDB: Remove the need to get sequence number per write Summary: Previously we store sequence number range of each blob files, and use the sequence number range to check if the file can be possibly visible by a snapshot. But it adds complexity to the code, since the sequence number is only available after a write. (The current implementation get sequence number by calling GetLatestSequenceNumber(), which is wrong.) With the patch, we are not storing sequence number range, and check if snapshot_sequence < obsolete_sequence to decide if the file is visible by a snapshot (previously we check if first_sequence <= snapshot_sequence < obsolete_sequence). Closes https://github.com/facebook/rocksdb/pull/3274 Differential Revision: D6571497 Pulled By: yiwu-arbug fbshipit-source-id: ca06479dc1fcd8782f6525b62b7762cd47d61909	2017-12-15 13:27:30 -08:00
Zhongyi Xie	fcc8a6574d	Make Universal compaction options dynamic Summary: Let me know if more test coverage is needed Closes https://github.com/facebook/rocksdb/pull/3213 Differential Revision: D6457165 Pulled By: miasantreble fbshipit-source-id: 3f944abff28aa7775237f1c4f61c64ccbad4eea9	2017-12-11 13:27:06 -08:00
Andrew Kryczka	78d1a5ec72	Preserve overlapping file endpoint invariant Summary: Fix for #2833. - In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete. - In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`. - In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`). Closes https://github.com/facebook/rocksdb/pull/2843 Differential Revision: D5772387 Pulled By: ajkr fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881	2017-12-06 18:56:54 -08:00
Maysam Yabandeh	18dcf7f98d	WritePrepared Txn: PreReleaseCallback Summary: Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue. These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number. Closes https://github.com/facebook/rocksdb/pull/3205 Differential Revision: D6438959 Pulled By: maysamyabandeh fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9	2017-11-30 23:50:45 -08:00
Zhongyi Xie	32e31d49d1	Make DBOption compaction_readahead_size dynamic Summary: Closes https://github.com/facebook/rocksdb/pull/3004 Differential Revision: D6056141 Pulled By: miasantreble fbshipit-source-id: 56df1630f464fd56b07d25d38161f699e0528b7f	2017-11-16 17:57:25 -08:00
Maysam Yabandeh	857adf388f	WritePrepared Txn: Refactor conf params Summary: Summary of changes: - Move seq_per_batch out of Options - Rename concurrent_prepare to two_write_queues - Add allocate_seq_only_for_data_ Closes https://github.com/facebook/rocksdb/pull/3136 Differential Revision: D6304458 Pulled By: maysamyabandeh fbshipit-source-id: 08e685bfa82bbc41b5b1c5eb7040a8ca6e05e58c	2017-11-10 17:28:12 -08:00
Yi Wu	7bfa88037e	Blob DB: fix snapshot handling Summary: Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is 1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish 2. otherwise, it is the latest sequence of the file Closes https://github.com/facebook/rocksdb/pull/3087 Differential Revision: D6182519 Pulled By: yiwu-arbug fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45	2017-11-02 15:58:27 -07:00
Maysam Yabandeh	02693f64fc	WritePrepared Txn: ValidateSnapshot Summary: Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function. Closes https://github.com/facebook/rocksdb/pull/3101 Differential Revision: D6199405 Pulled By: maysamyabandeh fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240	2017-11-01 19:11:09 -07:00
Mikhail Antonov	7fe3b32896	Added support for differential snapshots Summary: The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2). This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages. From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff". This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR. For now, what's done here according to initial discussions: Preserving deletes: - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion. - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum. - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum. Iterator changes: - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum. TableCache changes: - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span. What's left: - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type. Closes https://github.com/facebook/rocksdb/pull/2999 Differential Revision: D6175602 Pulled By: mikhail-antonov fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900	2017-11-01 18:56:43 -07:00
Shaohua Li	33c7d4ccd9	Make writable_file_max_buffer_size dynamic Summary: The DBOptions::writable_file_max_buffer_size can be changed dynamically. Closes https://github.com/facebook/rocksdb/pull/3053 Differential Revision: D6152720 Pulled By: shligit fbshipit-source-id: aa0c0cfcfae6a54eb17faadb148d904797c68681	2017-10-31 13:56:35 -07:00
Andrew Kryczka	b7bc9cc038	fix tracking oldest snapshot for bottom-level compaction Summary: The assertion was caught by `MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/5` when run in a loop. The caller doesn't track whether the released snapshot is oldest, so let this function handle that case. Closes https://github.com/facebook/rocksdb/pull/3080 Differential Revision: D6185257 Pulled By: ajkr fbshipit-source-id: 4b3015c11db5d31e46521a00af568546ef4558cd	2017-10-30 00:55:58 -07:00
Yi Wu	5a2a6483dc	Blob DB: Inline small values in base DB Summary: Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values. Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches. See blob_index.h for the new blob index format. There are 4 cases when writing a new key: * small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue) * small value w/ TTL: put (type, expiration, value) to base db. * large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db. * large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db. Closes https://github.com/facebook/rocksdb/pull/3066 Differential Revision: D6142115 Pulled By: yiwu-arbug fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0	2017-10-26 12:30:54 -07:00
Andrew Kryczka	9b18cc2363	single-file bottom-level compaction when snapshot released Summary: When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys. - Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys. - Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases. - Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called. - Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction. Closes https://github.com/facebook/rocksdb/pull/3009 Differential Revision: D6062044 Pulled By: ajkr fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21	2017-10-25 16:30:37 -07:00
Maysam Yabandeh	63822eb761	Enable two write queues for transactions Summary: Enable concurrent_prepare flag for WritePrepared transactions and extend the existing transaction tests with this config. Closes https://github.com/facebook/rocksdb/pull/3046 Differential Revision: D6106534 Pulled By: maysamyabandeh fbshipit-source-id: 88c8d21d45bc492beb0a131caea84a2ac5e7d38c	2017-10-23 14:27:04 -07:00
Andrew Kryczka	f8b5bb2fd8	remove unused code Summary: fixup `6a541afcc4`. This code didn't do anything because (1) `bytes_per_sync` is assigned in `EnvOptions`'s constructor; and (2) `OptimizeForCompactionTableWrite`'s return value was ignored, even though its only purpose is to return something. Closes https://github.com/facebook/rocksdb/pull/3055 Differential Revision: D6114132 Pulled By: ajkr fbshipit-source-id: ea4831770930e9cf83518e13eb2e1934d1f5487c	2017-10-20 14:11:52 -07:00
Dmitri Smirnov	ebab2e2d42	Enable MSVC W4 with a few exceptions. Fix warnings and bugs Summary: Closes https://github.com/facebook/rocksdb/pull/3018 Differential Revision: D6079011 Pulled By: yiwu-arbug fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57	2017-10-19 10:57:12 -07:00
Maysam Yabandeh	7e38238981	WritePrepared Txn: Disable GC during recovery Summary: Disables GC during recovery of a WritePrepared txn db to avoid GCing uncommitted key values. Closes https://github.com/facebook/rocksdb/pull/2980 Differential Revision: D6000191 Pulled By: maysamyabandeh fbshipit-source-id: fc4d522c643d24ebf043f811fe4ecd0dd0294675	2017-10-18 09:11:50 -07:00
Yi Wu	eaaef91178	Blob DB: Store blob index as kTypeBlobIndex in base db Summary: Blob db insert blob index to base db as kTypeBlobIndex type, to tell apart values written by plain rocksdb or blob db. This is to make it possible to migrate from existing rocksdb to blob db. Also with the patch blob db garbage collection get away from OptimisticTransaction. Instead it use a custom write callback to achieve similar behavior as OptimisticTransaction. This is because we need to pass the is_blob_index flag to DBImpl::Get but OptimisticTransaction don't support it. Closes https://github.com/facebook/rocksdb/pull/3000 Differential Revision: D6050044 Pulled By: yiwu-arbug fbshipit-source-id: 61dc72ab9977625e75f78cd968e7d8a3976e3632	2017-10-17 17:28:11 -07:00
Yi Wu	fb4ae4d810	fix DBImpl::NewInternalIterator super-version leak on failure Summary: Close #2955 Closes https://github.com/facebook/rocksdb/pull/2960 Differential Revision: D5962872 Pulled By: yiwu-arbug fbshipit-source-id: a6472d5c015bea3dc476c572ff5a5c90259e6059	2017-10-11 14:57:43 -07:00
Yi Wu	8c392a31d7	WritePrepared Txn: Iterator Summary: On iterator create, take a snapshot, create a ReadCallback and pass the ReadCallback to the underlying DBIter to check if key is committed. Closes https://github.com/facebook/rocksdb/pull/2981 Differential Revision: D6001471 Pulled By: yiwu-arbug fbshipit-source-id: 3565c4cdaf25370ba47008b0e0cb65b31dfe79fe	2017-10-09 17:15:28 -07:00
Adrien Schildknecht	01542400a8	Inform caller when rocksdb is stalling writes Summary: Add a new function in Listener to let the caller know when rocksdb is stalling writes. Closes https://github.com/facebook/rocksdb/pull/2897 Differential Revision: D5860124 Pulled By: schischi fbshipit-source-id: ee791606169aa64f772c86f817cebf02624e05e1	2017-10-05 18:11:43 -07:00
Yi Wu	d1cab2b64e	Add ValueType::kTypeBlobIndex Summary: Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to 1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex. 2. Make rocksdb able to detect if the db contains value written by blob db, if so return error. 3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type). The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob(). Changes on blob db side will be in a separate patch. Closes https://github.com/facebook/rocksdb/pull/2886 Differential Revision: D5838431 Pulled By: yiwu-arbug fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca	2017-10-03 09:11:23 -07:00
Maysam Yabandeh	385049baf2	WritePrepared Txn: Recovery Summary: Recover txns from the WAL. Also added some unit tests. Closes https://github.com/facebook/rocksdb/pull/2901 Differential Revision: D5859596 Pulled By: maysamyabandeh fbshipit-source-id: 6424967b231388093b4effffe0a3b1b7ec8caeb0	2017-09-28 16:56:45 -07:00
Quinn Jarrell	6a541afcc4	Make bytes_per_sync and wal_bytes_per_sync mutable Summary: SUMMARY Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed. TEST PLAN ran make check all passed Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value. Closes https://github.com/facebook/rocksdb/pull/2893 Reviewed By: yiwu-arbug Differential Revision: D5845814 Pulled By: TheRushingWookie fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd	2017-09-27 17:49:45 -07:00
Yi Wu	b4596c6174	Fix Get does not return super version on error Summary: This is caught when I was testing #2886. Closes https://github.com/facebook/rocksdb/pull/2907 Differential Revision: D5863153 Pulled By: yiwu-arbug fbshipit-source-id: 8c54759ba1a0dc101f24ab50423e35731300612d	2017-09-19 12:01:09 -07:00
Maysam Yabandeh	60beefd6e0	WritePrepared Txn: Advance seq one per batch Summary: By default the seq number in DB is increased once per written key. WritePrepared txns requires the seq to be increased once per the entire batch so that the seq would be used as the prepare timestamp by which the transaction is identified. Also we need to increase seq for the commit marker since it would give a unique id to the commit timestamp of transactions. Two unit tests are added to verify our understanding of how the seq should be increased. The recovery path requires much more work and is left to another patch. Closes https://github.com/facebook/rocksdb/pull/2885 Differential Revision: D5837843 Pulled By: maysamyabandeh fbshipit-source-id: a08960b93d727e1cf438c254d0c2636fb133cc1c	2017-09-18 14:45:08 -07:00
Amy Xu	5785b1fcb8	Fix naming in InternalKey Summary: - Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic Closes https://github.com/facebook/rocksdb/pull/2868 Differential Revision: D5804152 Pulled By: axxufb fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183	2017-09-12 17:17:42 -07:00
Maysam Yabandeh	f46464d383	write-prepared txn: call IsInSnapshot Summary: This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot. Closes https://github.com/facebook/rocksdb/pull/2850 Differential Revision: D5787375 Pulled By: maysamyabandeh fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c	2017-09-11 09:14:48 -07:00
Siying Dong	0e99323ac2	Fix CLANG Analyze Summary: clang analyze shows warnings after we upgrade the CLANG version. Fix them. Closes https://github.com/facebook/rocksdb/pull/2839 Differential Revision: D5769060 Pulled By: siying fbshipit-source-id: 3f8e4df715590d8984f6564b608fa08cfdfa5f14	2017-09-07 14:28:06 -07:00
Kamalalochana Subbaiah	e612e31740	Updated CRC32 Power Optimization Changes Summary: Support for PowerPC Architecture Detecting AltiVec Support Closes https://github.com/facebook/rocksdb/pull/2716 Differential Revision: D5606836 Pulled By: siying fbshipit-source-id: 720262453b1546e5fdbbc668eff56848164113f3	2017-08-31 14:16:30 -07:00
Andrew Kryczka	c10cf166fa	Dump non-final ZSTD compression type support Summary: Closes https://github.com/facebook/rocksdb/pull/2810 Differential Revision: D5739947 Pulled By: ajkr fbshipit-source-id: 09f99718b6b083c2711dcf17f7b68c305f3fd261	2017-08-30 16:41:24 -07:00
Artem Danilov	8a6708f5f2	Extend property map with compaction stats Summary: This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch. Closes https://github.com/facebook/rocksdb/pull/2794 Differential Revision: D5717676 Pulled By: Tema fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a	2017-08-30 15:26:55 -07:00
Yi Wu	92bfd6c507	Fix DropColumnFamily data race Summary: It should hold db mutex while accessing max_total_in_memory_state_. Closes https://github.com/facebook/rocksdb/pull/2784 Differential Revision: D5696536 Pulled By: yiwu-arbug fbshipit-source-id: 45430634d7fe11909b38e42e5f169f618681c4ee	2017-08-24 14:56:04 -07:00
Andrew Kryczka	ed0a4c93ef	perf_context measure user bytes read Summary: With this PR, we can measure read-amp for queries where perf_context is enabled as follows: ``` SetPerfLevel(kEnableCount); Get(1, "foo"); double read_amp = static_cast<double>(get_perf_context()->block_read_byte / get_perf_context()->get_read_bytes); SetPerfLevel(kDisable); ``` Our internal infra enables perf_context for a sampling of queries. So we'll be able to compute the read-amp for the sample set, which can give us a good estimate of read-amp. Closes https://github.com/facebook/rocksdb/pull/2749 Differential Revision: D5647240 Pulled By: ajkr fbshipit-source-id: ad73550b06990cf040cc4528fa885360f308ec12	2017-08-18 11:43:33 -07:00
Aaron G	7848f0b24c	add VerifyChecksum() to db.h Summary: We need a tool to check any sst file corruption in the db. It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status. Closes https://github.com/facebook/rocksdb/pull/2498 Differential Revision: D5324269 Pulled By: lightmark fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b	2017-08-09 15:58:13 -07:00
Chang Liu	d97a72d63f	Try to repair db with wal_dir option, avoid leak some WAL files Summary: We should search wal_dir in Repairer::FindFiles function, and avoid use LogFileNmae(dbname, number) to get WAL file's name, which will get a wrong WAL filename. as following: ``` [WARN] [/home/liuchang/Workspace/rocksdb/db/repair.cc:310] Log #3: ignoring conversion error: IO error: While opening a file for sequentially reading: /tmp/rocksdbtest-1000/repair_test/000003.log: No such file or directory ``` I have added a new test case to repair_test.cc, which try to repair db with all WAL options. Signed-off-by: Chang Liu <liuchang0812@gmail.com> Closes https://github.com/facebook/rocksdb/pull/2692 Differential Revision: D5575888 Pulled By: ajkr fbshipit-source-id: 5b93e9f85cddc01663ccecd87631fa723ac466a3	2017-08-08 10:47:57 -07:00
Andrew Kryczka	cc01985db0	Introduce bottom-pri thread pool for large universal compactions Summary: When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs. This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions. - Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads. - Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default. - Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic. - Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`). - Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free. Closes https://github.com/facebook/rocksdb/pull/2580 Differential Revision: D5422916 Pulled By: ajkr fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333	2017-08-03 15:43:29 -07:00
Siying Dong	e67b35c076	Add Iterator::Refresh() Summary: Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily. Closes https://github.com/facebook/rocksdb/pull/2621 Differential Revision: D5464500 Pulled By: siying fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9	2017-07-24 10:54:37 -07:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit `1d7048c598`, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Maysam Yabandeh	499ebb3ab5	Optimize for serial commits in 2PC Summary: Throughput: 46k tps in our sysbench settings (filling the details later) The idea is to have the simplest change that gives us a reasonable boost in 2PC throughput. Major design changes: 1. The WAL file internal buffer is not flushed after each write. Instead it is flushed before critical operations (WAL copy via fs) or when FlushWAL is called by MySQL. Flushing the WAL buffer is also protected via mutex_. 2. Use two sequence numbers: last seq, and last seq for write. Last seq is the last visible sequence number for reads. Last seq for write is the next sequence number that should be used to write to WAL/memtable. This allows to have a memtable write be in parallel to WAL writes. 3. BatchGroup is not used for writes. This means that we can have parallel writers which changes a major assumption in the code base. To accommodate for that i) allow only 1 WriteImpl that intends to write to memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes come via group commit phase which is serial anyway, ii) make all the parts in the code base that assumed to be the only writer (via EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are protected via a stat_mutex_. Note: the first commit has the approach figured out but is not clean. Submitting the PR anyway to get the early feedback on the approach. If we are ok with the approach I will go ahead with this updates: 0) Rebase with Yi's pipelining changes 1) Currently batching is disabled by default to make sure that it will be consistent with all unit tests. Will make this optional via a config. 2) A couple of unit tests are disabled. They need to be updated with the serial commit of 2PC taken into account. 3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires releasing mutex_ beforehand (the same way EnterUnbatched does). This needs to be cleaned up. Closes https://github.com/facebook/rocksdb/pull/2345 Differential Revision: D5210732 Pulled By: maysamyabandeh fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4	2017-06-24 14:11:29 -07:00
Siying Dong	6837a17621	Fix Data Race Between CreateColumnFamily() and GetAggregatedIntProperty() Summary: CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed. Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later. Closes https://github.com/facebook/rocksdb/pull/2475 Differential Revision: D5298053 Pulled By: siying fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6	2017-06-22 15:56:47 -07:00
Siying Dong	52a7f38b19	WriteOptions.low_pri which can throttle low pri writes if needed Summary: If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown. Closes https://github.com/facebook/rocksdb/pull/2369 Differential Revision: D5127619 Pulled By: siying fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0	2017-06-05 15:02:35 -07:00
hyunwoo	c7662a44a4	fixed typo Summary: fixed typo Closes https://github.com/facebook/rocksdb/pull/2376 Differential Revision: D5183630 Pulled By: ajkr fbshipit-source-id: 133cfd0445959e70aa2cd1a12151bf3c0c5c3ac5	2017-06-05 11:27:34 -07:00
Tamir Duberstein	0dc3040d54	db: avoid `#include`ing malloc and jemalloc simultaneously Summary: This fixes a compilation failure on Linux when the system libc is not glibc. jemalloc's configure script incorrectly assumes that glibc is always used on Linux systems, producing glibc-style signatures; when the system libc is e.g. musl, the following error is observed: ``` [ 0%] Building CXX object CMakeFiles/rocksdb.dir/db/db_impl.cc.o In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/table/block.h:19:0, from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:77: /x-tools/x86_64-unknown-linux-musl/x86_64-unknown-linux-musl/sysroot/usr/include/malloc.h:19:8: error: declaration of 'size_t malloc_usable_size(void)' has a different exception specifier size_t malloc_usable_size(void ); ^~~~~~~~~~~~~~~~~~ In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:20:0: /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:78:33: note: from previous declaration 'size_t malloc_usable_size(void*) throw ()' # define je_malloc_usable_size malloc_usable_size ^ /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:239:41: note: in expansion of macro 'je_malloc_usable_size' JEMALLOC_EXPORT size_t JEMALLOC_NOTHROW je_malloc_usable_size( ^~~~~~~~~~~~~~~~~~~~~ CMakeFiles/rocksdb.dir/build.make:350: recipe for target 'CMakeFiles/rocksdb.dir/db/db_impl.cc.o' failed ``` This works around the issue by rearranging the sources such that jemalloc's headers are never in the same scope as the system's malloc header. The jemalloc issue has been reported as well, see: https://github.com/jemalloc/jemalloc/issues/778. cc tschottdorf Closes https://github.com/facebook/rocksdb/pull/2188 Differential Revision: D5163048 Pulled By: siying fbshipit-source-id: c553125458892def175c1be5682b0330d80b2a0d	2017-05-31 22:43:02 -07:00
Yi Wu	07bdcb91fe	New WriteImpl to pipeline WAL/memtable write Summary: PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that 1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit. 2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes. Merging #2056 and #2058 into this PR. Closes https://github.com/facebook/rocksdb/pull/2286 Differential Revision: D5054606 Pulled By: yiwu-arbug fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d	2017-05-19 14:26:42 -07:00
Mikhail Antonov	ba685a472a	Support ingest_behind for IngestExternalFile Summary: First cut for early review; there are few conceptual points to answer and some code structure issues. For conceptual points - - restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true \|\| disable_auto_compaction=false), the user is responsible to properly open and close DB with required params - we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise? - Modifying AssignLevelForIngestedFile seems the place we we'd handle that. On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites. Closes https://github.com/facebook/rocksdb/pull/2144 Differential Revision: D4873732 Pulled By: lightmark fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a	2017-05-17 11:42:42 -07:00
Anirban Rahut	d85ff4953c	Blob storage pr Summary: The final pull request for Blob Storage. Closes https://github.com/facebook/rocksdb/pull/2269 Differential Revision: D5033189 Pulled By: yiwu-arbug fbshipit-source-id: 6356b683ccd58cbf38a1dc55e2ea400feecd5d06	2017-05-10 15:14:44 -07:00
Yi Wu	2cd00773c7	Add bulk create/drop column family API Summary: Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time. The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO. Closes https://github.com/facebook/rocksdb/pull/2248 Differential Revision: D5001578 Pulled By: yiwu-arbug fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03	2017-05-07 23:20:46 -07:00
Leonidas Galanis	a45e98a5b5	max_open_files dynamic set, follow up Summary: Followup to make 0x40000 a TableCache constant that indicates infinite capacity Closes https://github.com/facebook/rocksdb/pull/2247 Differential Revision: D5001349 Pulled By: lgalanis fbshipit-source-id: ce7bd2e54b0975bb9f8680fdaa0f8bb0e7ae81a2	2017-05-04 10:42:45 -07:00
Leonidas Galanis	e7ae4a3a02	Max open files mutable Summary: Makes max_open_files db option dynamically set-able by SetDBOptions. During the call of SetDBOptions we call SetCapacity on the table cache, which is a LRUCache. Closes https://github.com/facebook/rocksdb/pull/2185 Differential Revision: D4979189 Pulled By: yiwu-arbug fbshipit-source-id: ca7e8dc5e3619c79434f579be4847c0f7e56afda	2017-05-03 21:13:14 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Aaron Gao	7eddecce12	support bulk loading with universal compaction Summary: Support buck load with universal compaction. More test cases to be added. Closes https://github.com/facebook/rocksdb/pull/2202 Differential Revision: D4935360 Pulled By: lightmark fbshipit-source-id: cc3ca1b6f42faa503207dab1408d6bcf393ee5b5	2017-04-26 13:41:32 -07:00
Siying Dong	c49d704656	Add DB:ResetStats() Summary: Add a function to allow users to reset internal stats without restarting the DB. Closes https://github.com/facebook/rocksdb/pull/2167 Differential Revision: D4907939 Pulled By: siying fbshipit-source-id: ab2dd85b88aabe9380da7485320a1d460d3e1f68	2017-04-18 16:56:48 -07:00
Siying Dong	8f47a97512	File level histogram should be printed per CF, not per DB Summary: Currently level histogram is only printed out for DB stats and for default CF. This is confusing. Change to print for every CF instead. Closes https://github.com/facebook/rocksdb/pull/2126 Differential Revision: D4865373 Pulled By: siying fbshipit-source-id: 1c853e0ac66e00120ee931cabc9daf69ccc2d577	2017-04-11 08:42:03 -07:00
Sagar Vemuri	7124268a09	Reduce the number of params needed to construct DBIter Summary: DBIter, and in-turn NewDBIterator and NewArenaWrappedDBIterator, take a bunch of params. They can be reduced by passing in ReadOptions directly instead of passing in every new param separately. It also seems much cleaner as a bunch of the params towards the end seem to be optional. (Recently I introduced max_skippable_internal_keys, which added one more to the already huge count). Idea courtesy IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/2116 Differential Revision: D4857128 Pulled By: sagar0 fbshipit-source-id: 7d239df094b94bd9ea79d145cdf825478ac037a8	2017-04-10 11:14:14 -07:00
Islam AbdelRahman	61730186df	dummy diff Summary: Closes https://github.com/facebook/rocksdb/pull/2114 Differential Revision: D4854860 Pulled By: IslamAbdelRahman fbshipit-source-id: b871c5b9ccc52d20f5ceacdd172dc70b1dbf9110	2017-04-07 17:07:37 -07:00
Tamir Duberstein	107c5f6a60	CMake: more MinGW fixes Summary: siying this is a resubmission of #2081 with the 4th commit fixed. From that commit message: > Note that the previous use of quotes in PLATFORM_{CC,CXX}FLAGS was incorrect and caused GCC to produce the incorrect define: > > #define ROCKSDB_JEMALLOC -DJEMALLOC_NO_DEMANGLE 1 > > This was the cause of the Linux build failure on the previous version of this change. I've tested this locally, and the Linux build succeeds now. Closes https://github.com/facebook/rocksdb/pull/2097 Differential Revision: D4839964 Pulled By: siying fbshipit-source-id: cc51322	2017-04-06 14:09:13 -07:00
Siying Dong	d2dce5611a	Move some files under util/ to separate dirs Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef	2017-04-05 19:09:16 -07:00
Siying Dong	ce64b8b719	Divide db/db_impl.cc Summary: db_impl.cc is too large to manage. Divide db_impl.cc into db/db_impl.cc, db/db_impl_compaction_flush.cc, db/db_impl_files.cc, db/db_impl_open.cc and db/db_impl_write.cc. Closes https://github.com/facebook/rocksdb/pull/2095 Differential Revision: D4838188 Pulled By: siying fbshipit-source-id: c5f3059	2017-04-05 17:24:19 -07:00
Siying Dong	43010a929f	Revert "[rocksdb][PR] CMake: more MinGW fixes" fbshipit-source-id: 43b4529	2017-04-04 16:24:26 -07:00
Tamir Duberstein	3450ac8c1b	CMake: more MinGW fixes Summary: See individual commits. yuslepukhin siying Closes https://github.com/facebook/rocksdb/pull/2081 Differential Revision: D4824639 Pulled By: IslamAbdelRahman fbshipit-source-id: 2fc2b00	2017-04-04 15:09:17 -07:00
Yi Wu	9e44531803	Refactor WriteImpl (pipeline write part 1) Summary: Refactor WriteImpl() so when I plug-in the pipeline write code (which is an alternative approach for WriteThread), some of the logic can be reuse. I split out the following methods from WriteImpl(): * PreprocessWrite() * HandleWALFull() (previous MaybeFlushColumnFamilies()) * HandleWriteBufferFull() * WriteToWAL() Also adding a constructor to WriteThread::Writer, and move WriteContext into db_impl.h. No real logic change in this patch. Closes https://github.com/facebook/rocksdb/pull/2042 Differential Revision: D4781014 Pulled By: yiwu-arbug fbshipit-source-id: d45ca18	2017-04-04 10:24:32 -07:00
Siying Dong	6ef8c620d3	Move auto_roll_logger and filename out of db/ Summary: It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency. Closes https://github.com/facebook/rocksdb/pull/2080 Differential Revision: D4821141 Pulled By: siying fbshipit-source-id: ca7d768	2017-04-03 18:39:14 -07:00
Nikhil Benesch	d25e28d584	replace sometimes-undefined uint type with unsigned int Summary: `uint` is nonstandard and not a built-in type on all compilers; replace it with the always-valid `unsigned int`. I assume this went unnoticed because it's inside an `#ifdef ROCKDB_JEMALLOC`. Closes https://github.com/facebook/rocksdb/pull/2075 Differential Revision: D4820427 Pulled By: ajkr fbshipit-source-id: 0876561	2017-04-03 11:39:09 -07:00
Sagar Vemuri	c6d04f2ecf	Option to fail a request as incomplete when skipping too many internal keys Summary: Operations like Seek/Next/Prev sometimes take too long to complete when there are many internal keys to be skipped. Adding an option, max_skippable_internal_keys -- which could be used to set a threshold for the maximum number of keys that can be skipped, will help to address these cases where it is much better to fail a request (as incomplete) than to wait for a considerable time for the request to complete. This feature -- to fail an iterator seek request as incomplete, is disabled by default when max_skippable_internal_keys = 0. It is enabled only when max_skippable_internal_keys > 0. This feature is based on the discussion mentioned in the PR https://github.com/facebook/rocksdb/pull/1084. Closes https://github.com/facebook/rocksdb/pull/2000 Differential Revision: D4753223 Pulled By: sagar0 fbshipit-source-id: 1c973f7	2017-03-30 12:09:21 -07:00
Aaron Gao	3e56c7e0c4	make total_log_size_ atomic Summary: make total_log_size_ atomic to avoid overflow caused by data race. Closes https://github.com/facebook/rocksdb/pull/2019 Differential Revision: D4751391 Pulled By: siying fbshipit-source-id: fac01dd	2017-03-22 11:54:40 -07:00
Siying Dong	8f5bf04468	Flush triggered by DB write buffer size picks the oldest unflushed CF Summary: Previously, when DB write buffer size triggers, we always pick the CF with most data in its memtable to flush. This approach can minimize total flush happens. Change the behavior to always pick the oldest unflushed CF, which makes it the same behavior when max_total_wal_size hits. This approach will minimize size used by max_total_wal_size. Closes https://github.com/facebook/rocksdb/pull/1987 Differential Revision: D4703214 Pulled By: siying fbshipit-source-id: 9ff8b09	2017-03-21 11:09:10 -07:00
Raza Hussain	6908e24b56	dynamic setting of stats_dump_period_sec through SetDBOption() Summary: Resolved the following issue: https://github.com/facebook/rocksdb/issues/1930 Closes https://github.com/facebook/rocksdb/pull/2004 Differential Revision: D4736764 Pulled By: yiwu-arbug fbshipit-source-id: 64fe0b7	2017-03-20 22:54:13 -07:00
Islam AbdelRahman	d52f334cbd	Break stalls when no bg work is happening Summary: Current stall will keep sleeping even if there is no Flush/Compactions to wait for, I changed the logic to break the stall if we are not flushing or compacting db_bench command used ``` # fillrandom # memtable size = 10MB # value size = 1 MB # num = 1000 # use /dev/shm ./db_bench --benchmarks="fillrandom,stats" --value_size=1048576 --write_buffer_size=10485760 --num=1000 --delayed_write_rate=XXXXX --db="/dev/shm/new_stall" \| grep "Cumulative stall" ``` ``` Current results # delayed_write_rate = 1000 Kb/sec Cumulative stall: 00:00:9.031 H:M:S # delayed_write_rate = 200 Kb/sec Cumulative stall: 00:00:22.314 H:M:S # delayed_write_rate = 100 Kb/sec Cumulative stall: 00:00:42.784 H:M:S # delayed_write_rate = 50 Kb/sec Cumulative stall: 00:01:23.785 H:M:S # delayed_write_rate = 25 Kb/sec Cumulative stall: 00:02:45.702 H:M:S ``` ``` New results # delayed_write_rate = 1000 Kb/sec Cumulative stall: 00:00:9.017 H:M:S # delayed_write_rate = 200 Kb/sec Cumulative stall: 00 Closes https://github.com/facebook/rocksdb/pull/1884 Differential Revision: D4585439 Pulled By: IslamAbdelRahman fbshipit-source-id: aed2198	2017-03-16 18:24:17 -07:00
Islam AbdelRahman	e19163688b	Add macros to include file name and line number during Logging Summary: current logging ``` 2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25 2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2. 2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK 2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK 2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0. 2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6 2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1 2017/03/14-14:20:31. Closes https://github.com/facebook/rocksdb/pull/1990 Differential Revision: D4708695 Pulled By: IslamAbdelRahman fbshipit-source-id: cb8968f	2017-03-15 19:39:12 -07:00
Maysam Yabandeh	11526252cc	Pinnableslice (2nd attempt) Summary: PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: value 100 byte: 1.8% regular, 1.2% merge values value 1k byte: 11.5% regular, 7.5% merge values value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The d Closes https://github.com/facebook/rocksdb/pull/1756 Differential Revision: D4391738 Pulled By: maysamyabandeh fbshipit-source-id: 6f3edd3	2017-03-13 11:54:10 -07:00
Sagar Vemuri	97edc72d39	Add a memtable-only iterator Summary: This PR is to support a way to iterate over all the keys that are just in memtables. Closes https://github.com/facebook/rocksdb/pull/1953 Differential Revision: D4663500 Pulled By: sagar0 fbshipit-source-id: 144e177	2017-03-07 11:54:10 -08:00
Leonidas Galanis	72202962f9	fix db_sst_test flakiness Summary: db_sst_test had been flaky occasionally in the following way: reached_max_space_on_compaction can in very rare cases be 0. This happens when the limit on maximum allowable space set using SetMaxAllowedSpaceUsage is hit during flush for all test db sizes (1,2,4,8 and 10MB).The fix clears the error returned when the the space limit is reached during flush. This ensures that the compaction call back will always be called. The runtime is increased slightly because the 1MB loop writes more data and hits the limit during multiple flushes until compaction is scheduled. Closes https://github.com/facebook/rocksdb/pull/1861 Differential Revision: D4557396 Pulled By: lgalanis fbshipit-source-id: ff778d1	2017-03-07 11:24:13 -08:00
Reid Horuff	58b12dfe37	Set logs as getting flushed before releasing lock, race condition fix Summary: Relating to #1903: In MaybeFlushColumnFamilies() we want to modify the 'getting_flushed' flag before releasing the db mutex when SwitchMemtable() is called. The following 2 actions need to be atomic in MaybeFlushColumnFamilies() - getting_flushed is false on oldest log - we determine that all CFs can be flushed to successfully release oldest log - we set getting_flushed = true on the oldest log. ------- - getting_flushed is false on oldest log - we determine that all CFs can NOT be flushed to successfully release oldest log - we set unable_to_flush_oldest_log_ = true on the oldest log. #### In the 2pc case: T1 enters function but is unable to flush all CFs to release log T1 sets unable_to_flush_oldest_log_ = true T1 begins flushing all CFs possible T2 enters function but is unable to flush all CFs to release log T2 sees unable_to_flush_oldes_log_ has been set so exits T3 enters function and will be able to flush all CFs to release oldest log T3 sets getting_flushed = true on oldes Closes https://github.com/facebook/rocksdb/pull/1909 Differential Revision: D4646235 Pulled By: reidHoruff fbshipit-source-id: c8d0447	2017-03-06 15:09:11 -08:00
Aaron Gao	6fb9013441	sanitize readahead when direct read enabled Summary: no readahead: readseq : 8.438 micros/op 118510 ops/sec; 13.1 MB/s sanitize to 10MB: readseq : 6.051 micros/op 165248 ops/sec; 18.3 MB/s Closes https://github.com/facebook/rocksdb/pull/1945 Differential Revision: D4645811 Pulled By: lightmark fbshipit-source-id: 5d63770	2017-03-02 17:24:11 -08:00
Aaron Gao	e877afa08b	Remove bulk loading and auto_roll_logger in rocksdb_lite Summary: shrink lite size Closes https://github.com/facebook/rocksdb/pull/1929 Differential Revision: D4622059 Pulled By: siying fbshipit-source-id: 050b796	2017-02-28 11:09:11 -08:00
Peter (Stig) Edwards	2ca2059f66	Get unique_ptr to use delete[] for char[] in DumpMallocStats Summary: Avoid mismatched free() / delete / delete [] in DumpMallocStats Closes https://github.com/facebook/rocksdb/pull/1927 Differential Revision: D4622045 Pulled By: siying fbshipit-source-id: 1131b30	2017-02-27 17:39:12 -08:00
Siying Dong	1ba2804b7f	Remove XFunc tests Summary: Xfunc is hardly used. Remove it to keep the code simple. Closes https://github.com/facebook/rocksdb/pull/1905 Differential Revision: D4603220 Pulled By: siying fbshipit-source-id: 731f96d	2017-02-23 12:09:11 -08:00
Mike Kolupaev	18eeb7b90e	Fix interference between max_total_wal_size and db_write_buffer_size checks Summary: This is a trivial fix for OOMs we've seen a few days ago in logdevice. RocksDB get into the following state: (1) Write throughput is too high for flushes to keep up. Compactions are out of the picture - automatic compactions are disabled, and for manual compactions we don't care that much if they fall behind. We write to many CFs, with only a few L0 sst files in each, so compactions are not needed most of the time. (2) total_log_size_ is consistently greater than GetMaxTotalWalSize(). It doesn't get smaller since flushes are falling ever further behind. (3) Total size of memtables is way above db_write_buffer_size and keeps growing. But the write_buffer_manager_->ShouldFlush() is not checked because (2) prevents it (for no good reason, afaict; this is what this commit fixes). (4) Every call to WriteImpl() hits the MaybeFlushColumnFamilies() path. This keeps flushing the memtables one by one in order of increasing log file number. (5) No write stalling trigger is hit. We rely on max_write_buffer_number Closes https://github.com/facebook/rocksdb/pull/1893 Differential Revision: D4593590 Pulled By: yiwu-arbug fbshipit-source-id: af79c5f	2017-02-21 16:09:10 -08:00
Yi Wu	381fd32247	Remove timeout_hint_us from WriteOptions Summary: The option has been deprecated for two years and has no effect. Removing. Closes https://github.com/facebook/rocksdb/pull/1866 Differential Revision: D4555203 Pulled By: yiwu-arbug fbshipit-source-id: c48f627	2017-02-17 15:24:17 -08:00
Islam AbdelRahman	fce7a6e196	Fail IngestExternalFile when bg_error_ exists Summary: Fail IngestExternalFile() when bg_error_ exists Closes https://github.com/facebook/rocksdb/pull/1881 Differential Revision: D4580621 Pulled By: IslamAbdelRahman fbshipit-source-id: 1194913	2017-02-17 13:39:17 -08:00
Yi Wu	c2247dc1c7	Make DBImpl::has_unpersisted_data_ atomic Summary: Seems to me `has_unpersisted_data_` is read from read thread and write from write thread concurrently without synchronization. Making it an atomic. I update the logic not because seeing any problem with it, but it just feel confusing. Closes https://github.com/facebook/rocksdb/pull/1869 Differential Revision: D4555837 Pulled By: yiwu-arbug fbshipit-source-id: eff2ab8	2017-02-13 18:54:13 -08:00
Sagar Vemuri	eb912a927e	Remove disableDataSync option Summary: Remove disableDataSync, and another similarly named disable_data_sync options. This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods. Closes https://github.com/facebook/rocksdb/pull/1859 Differential Revision: D4541292 Pulled By: sagar0 fbshipit-source-id: 5b3a6ca	2017-02-13 11:09:13 -08:00
Vitaliy Liptchinsky	1aaa898cf1	Adding GetApproximateMemTableStats method Summary: Added method that returns approx num of entries as well as size for memtables. Closes https://github.com/facebook/rocksdb/pull/1841 Differential Revision: D4511990 Pulled By: VitaliyLi fbshipit-source-id: 9a4576e	2017-02-06 14:54:16 -08:00
Siying Dong	036d668b19	Fix wrong result in data race case related to Get() Summary: In theory, Get() can get a wrong result, if it races in a special with with flush. The bug can be reproduced in DBTest2.GetRaceFlush. Fix this bug by getting snapshot after referencing the super version. Closes https://github.com/facebook/rocksdb/pull/1816 Differential Revision: D4475958 Pulled By: siying fbshipit-source-id: bd9e67a	2017-02-03 11:39:15 -08:00
Islam AbdelRahman	574b543f80	Rename merger.h -> merging_iterator.h Summary: merger.h was always a confusing name for me, simply give the file a better name Closes https://github.com/facebook/rocksdb/pull/1836 Differential Revision: D4505357 Pulled By: IslamAbdelRahman fbshipit-source-id: 07b28d8	2017-02-02 16:54:19 -08:00
Siying Dong	f25f1ec60b	Add test DBTest2.GetRaceFlush which can expose a data race bug Summary: A current data race issue in Get() and Flush() can cause a Get() to return wrong results when a flush happened in the middle. Disable the test for now. Closes https://github.com/facebook/rocksdb/pull/1813 Differential Revision: D4472310 Pulled By: siying fbshipit-source-id: 5755ebd	2017-01-26 16:39:14 -08:00
sdong	5dad9d6d28	Avoid logs_ operation out of DB mutex Summary: logs_.back() is called out of DB mutex, which can cause data race. We move the access into the DB mutex protection area. Closes https://github.com/facebook/rocksdb/pull/1774 Reviewed By: AsyncDBConnMarkedDownDBException Differential Revision: D4417472 Pulled By: AsyncDBConnMarkedDownDBException fbshipit-source-id: 2da1f1e	2017-01-25 15:54:13 -08:00
Islam AbdelRahman	a7b13919bf	Fix CompactFiles() bug when used with CompactionFilter using SuperVersion Summary: GetAndRefSuperVersion() should not be called again in the same thread before ReturnAndCleanupSuperVersion() is called. If we have a compaction filter that is using DB::Get, This will happen ``` CompactFiles() { GetAndRefSuperVersion() // -- first call .. CompactionFilter() { GetAndRefSuperVersion() // -- second call ReturnAndCleanupSuperVersion() } .. ReturnAndCleanupSuperVersion() } ``` We solve this issue in the same way Iterator is solving it, but using GetReferencedSuperVersion() This was discovered in https://github.com/facebook/mysql-5.6/issues/427 by alxyang Closes https://github.com/facebook/rocksdb/pull/1803 Differential Revision: D4460155 Pulled By: IslamAbdelRahman fbshipit-source-id: 5e54322	2017-01-25 14:09:13 -08:00
Islam AbdelRahman	03ca2ac8a9	Remove function from DBImpl that are not used anywhere Summary: GetAndRefSuperVersionUnlocked ReturnAndCleanupSuperVersionUnlocked GetColumnFamilyHandleUnlocked Are dead code that are not used any where Closes https://github.com/facebook/rocksdb/pull/1802 Differential Revision: D4459948 Pulled By: IslamAbdelRahman fbshipit-source-id: 30fa89d	2017-01-24 19:24:13 -08:00
Vitaliy Liptchinsky	753ff84a3d	Fix get approx size Summary: Fixing GetApproximateSize bug for the case of computing stats for mem tables only. Closes https://github.com/facebook/rocksdb/pull/1795 Differential Revision: D4445507 Pulled By: IslamAbdelRahman fbshipit-source-id: 3905846	2017-01-20 15:54:12 -08:00
Changli Gao	5ac97314e7	Fix std::out_of_range when DBOptions::keep_log_file_num is zero Summary: We should validate this option, otherwise we may see std::out_of_range thrown at: db/db_impl.cc:1124 1123 for (unsigned int i = 0; i <= end; i++) { 1124 std::string& to_delete = old_info_log_files.at(i); 1125 std::string full_path_to_delete = 1126 (immutable_db_options_.db_log_dir.empty() Closes https://github.com/facebook/rocksdb/pull/1722 Differential Revision: D4379495 Pulled By: yiwu-arbug fbshipit-source-id: e136552	2017-01-20 13:24:12 -08:00
Vitaliy Liptchinsky	e840213d6e	Change DB::GetApproximateSizes for more flexibility needed for MyRocks Summary: Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables. Closes https://github.com/facebook/rocksdb/pull/1787 Differential Revision: D4441111 Pulled By: IslamAbdelRahman fbshipit-source-id: c11f4c3	2017-01-20 09:39:11 -08:00
Yi Wu	9239103cd4	Flush job should release reference current version if sync log failed Summary: Fix the bug when sync log fail, FlushJob::Run() will not be execute and reference to cfd->current() will not be release. Closes https://github.com/facebook/rocksdb/pull/1792 Differential Revision: D4441316 Pulled By: yiwu-arbug fbshipit-source-id: 5523e28	2017-01-19 23:09:15 -08:00
Reid Horuff	5cf176ca15	Fix for 2PC causing WAL to grow too large Summary: Consider the following single column family scenario: prepare in log A commit in log B WAL is too large, flush all CFs to releast log A CFA is on log B so we do not see CFA is depending on log A so no flush is requested To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on. Closes https://github.com/facebook/rocksdb/pull/1768 Differential Revision: D4403265 Pulled By: reidHoruff fbshipit-source-id: ce800ff	2017-01-19 15:39:12 -08:00
Mike Kolupaev	d18dd2c41f	Abort compactions more reliably when closing DB Summary: DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper). There's also some minor code cleanup along the way. Closes https://github.com/facebook/rocksdb/pull/1639 Differential Revision: D4306571 Pulled By: yiwu-arbug fbshipit-source-id: f050890	2017-01-11 15:09:21 -08:00
Maysam Yabandeh	d0ba8ec8f9	Revert "PinnableSlice" Summary: This reverts commit `54d94e9c2c`. The pull request was landed by mistake. Closes https://github.com/facebook/rocksdb/pull/1755 Differential Revision: D4391678 Pulled By: maysamyabandeh fbshipit-source-id: 36d5149	2017-01-08 14:24:12 -08:00
Maysam Yabandeh	54d94e9c2c	PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: 1. value 100 byte: 1.8% regular, 1.2% merge values 2. value 1k byte: 11.5% regular, 7.5% merge values 3. value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The difference is a little and could be noise. More importantly it is safely cancelled Closes https://github.com/facebook/rocksdb/pull/1732 Differential Revision: D4374613 Pulled By: maysamyabandeh fbshipit-source-id: a077f1a	2017-01-08 13:54:13 -08:00
Siying Dong	438f22bc56	Fix bug of Checkpoint loses recent transactions with 2PC Summary: If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files. Closes https://github.com/facebook/rocksdb/pull/1724 Differential Revision: D4368319 Pulled By: siying fbshipit-source-id: cc2c746	2016-12-28 12:24:16 -08:00
Aaron Gao	972f96b3fb	direct io write support Summary: rocksdb direct io support ``` [gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 5.0 Date: Wed Nov 23 13:17:43 2016 CPU: 40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz CPUCache: 25600 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 4.393 micros/op 227639 ops/sec; 25.2 MB/s [gzh@dev11575.prn2 ~/roc Closes https://github.com/facebook/rocksdb/pull/1564 Differential Revision: D4241093 Pulled By: lightmark fbshipit-source-id: 98c29e3	2016-12-22 13:09:19 -08:00
Islam AbdelRahman	989e644ed8	Remove sst_file_manager option from LITE Summary: Remove sst_file_manager option from LITE Closes https://github.com/facebook/rocksdb/pull/1690 Differential Revision: D4341331 Pulled By: IslamAbdelRahman fbshipit-source-id: 9f9328d	2016-12-21 17:54:21 -08:00
Daniel Black	bfbcec2339	Gcc 7 error expansion to defined Summary: sorry if these gcc-7/clang-4 cleanups are getting tedious. Closes https://github.com/facebook/rocksdb/pull/1658 Differential Revision: D4318792 Pulled By: yiwu-arbug fbshipit-source-id: 8e85891	2016-12-13 18:39:14 -08:00
Islam AbdelRahman	1a146f89c7	break Flush wait for dropped CF Summary: In FlushJob we dont do the Flush if the CF is dropped https://github.com/facebook/rocksdb/blob/master/db/flush_job.cc#L184-L188 but inside WaitForFlushMemTable we keep waiting forever even if the CF is dropped. Closes https://github.com/facebook/rocksdb/pull/1664 Differential Revision: D4321032 Pulled By: IslamAbdelRahman fbshipit-source-id: 6e2b25d	2016-12-13 14:09:12 -08:00
Islam AbdelRahman	2ba59b5a1e	Disallow ingesting files into dropped CFs Summary: This PR update IngestExternalFile to return an error if we try to ingest a file into a dropped CF. Right now if IngestExternalFile want to flush a memtable, and it's ingesting a file into a dropped CF, it will wait forever since flushing is not possible for the dropped CF Closes https://github.com/facebook/rocksdb/pull/1657 Differential Revision: D4318657 Pulled By: IslamAbdelRahman fbshipit-source-id: ed6ea2b	2016-12-13 00:54:14 -08:00
Islam AbdelRahman	ed8fbdb560	Add EventListener::OnExternalFileIngested() event Summary: Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events Closes https://github.com/facebook/rocksdb/pull/1623 Differential Revision: D4285844 Pulled By: IslamAbdelRahman fbshipit-source-id: 0b95a88	2016-12-06 14:09:17 -08:00
Anton Safonov	9053fe2a5c	Made delete_obsolete_files_period_micros option dynamic Summary: Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions(). Closes https://github.com/facebook/rocksdb/pull/1595 Differential Revision: D4246569 Pulled By: tonek fbshipit-source-id: d23f560	2016-12-05 14:24:16 -08:00
Maysam Yabandeh	182b940e70	Add WriteOptions.no_slowdown Summary: If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for the write request, then fail immediately with Status::Incomplete(). Closes https://github.com/facebook/rocksdb/pull/1527 Differential Revision: D4191405 Pulled By: maysamyabandeh fbshipit-source-id: 7f3ce3f	2016-11-21 18:09:13 -08:00
Andrew Kryczka	fe349db57b	Remove Arena in RangeDelAggregator Summary: The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator. Closes https://github.com/facebook/rocksdb/pull/1547 Differential Revision: D4207781 Pulled By: ajkr fbshipit-source-id: 9d1c130	2016-11-19 14:24:12 -08:00
Andrew Kryczka	3f62215210	Lazily initialize RangeDelAggregator's map and pinning manager Summary: Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction. Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist. Closes https://github.com/facebook/rocksdb/pull/1539 Differential Revision: D4203488 Pulled By: ajkr fbshipit-source-id: 3b36279	2016-11-18 17:09:11 -08:00
Yi Wu	36e4762ce0	Remove Ticker::SEQUENCE_NUMBER Summary: Remove the ticker count because: * Having to reset the ticker count in WriteImpl is ineffiecent; * It doesn't make sense to have it as a ticker count if multiple db instance share a statistics object. Closes https://github.com/facebook/rocksdb/pull/1531 Differential Revision: D4194442 Pulled By: yiwu-arbug fbshipit-source-id: e2110a9	2016-11-16 22:39:09 -08:00
Andrew Kryczka	489d142808	DeleteRange interface Summary: Expose DeleteRange() interface since we think the implementation is functionally correct now. Closes https://github.com/facebook/rocksdb/pull/1503 Differential Revision: D4171921 Pulled By: ajkr fbshipit-source-id: 5e21c98	2016-11-15 15:24:16 -08:00
Artemiy Kolesnikov	91300d01f6	Dynamic max_total_wal_size option Summary: Closes https://github.com/facebook/rocksdb/pull/1509 Differential Revision: D4176426 Pulled By: yiwu-arbug fbshipit-source-id: b57689d	2016-11-14 22:54:17 -08:00
Lijun Tang	adb665e0bf	Allowed delayed_write_rate option to be dynamically set. Summary: Closes https://github.com/facebook/rocksdb/pull/1488 Differential Revision: D4157784 Pulled By: siying fbshipit-source-id: f150081	2016-11-12 15:54:11 -08:00
Maysam Yabandeh	361010d447	Exporting compaction stats in the form of a map Summary: Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present the stats in any format required by the them. Closes https://github.com/facebook/rocksdb/pull/1477 Differential Revision: D4149836 Pulled By: maysamyabandeh fbshipit-source-id: b3df19f	2016-11-11 20:54:14 -08:00
Reid Horuff	1ca5f6d132	Fix 2PC Recovery SeqId Miscount Summary: Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL. Closes https://github.com/facebook/rocksdb/pull/1486 Differential Revision: D4156064 Pulled By: reidHoruff fbshipit-source-id: a6da8d9	2016-11-10 11:09:22 -08:00
Andrew Kryczka	c90fef88b1	fix open failure with empty wal Summary: Closes https://github.com/facebook/rocksdb/pull/1490 Differential Revision: D4158821 Pulled By: IslamAbdelRahman fbshipit-source-id: 59b73f4	2016-11-09 22:24:26 -08:00
Reid Horuff	d133b08f68	Use correct sequence number when creating memtable Summary: copied from: `5ebfd2623a` Opening existing RocksDB attempts recovery from log files, which uses wrong sequence number to create the memtable. This is a regression introduced in change `a400336`. This change includes a test demonstrating the problem, without the fix the test fails with "Operation failed. Try again.: Transaction could not check for conflicts for operation at SequenceNumber 1 as the MemTable only contains changes newer than SequenceNumber 2. Increasing the value of the max_write_buffer_number_to_maintain option could reduce the frequency of this error" This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname and me. Closes https://github.com/facebook/rocksdb/pull/1458 Differential Revision: D4143791 Pulled By: reidHoruff fbshipit-source-id: 5a25033	2016-11-09 12:24:17 -08:00
Islam AbdelRahman	9bd191d2f4	Fix deadlock between (WriterThread/Compaction/IngestExternalFile) Summary: A deadlock is possible if this happen (1) Writer thread is stopped because it's waiting for compaction to finish (2) Compaction is waiting for current IngestExternalFile() calls to finish (3) IngestExternalFile() is waiting to be able to acquire the writer thread (4) WriterThread is held by stopped writes that are waiting for compactions to finish This patch fix the issue by not incrementing num_running_ingest_file_ except when we acquire the writer thread. This patch include a unittest to reproduce the described scenario Closes https://github.com/facebook/rocksdb/pull/1480 Differential Revision: D4151646 Pulled By: IslamAbdelRahman fbshipit-source-id: 09b39db	2016-11-09 10:54:10 -08:00
Andrew Kryczka	9e7cf3469b	DeleteRange user iterator support Summary: Note: reviewed in https://reviews.facebook.net/D65115 - DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times. - DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205) - DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones Closes https://github.com/facebook/rocksdb/pull/1464 Differential Revision: D4131753 Pulled By: ajkr fbshipit-source-id: be86559	2016-11-04 12:09:22 -07:00
Andrew Kryczka	f998c9790f	DeleteRange Get support Summary: During Get()/MultiGet(), build up a RangeDelAggregator with range tombstones as we search through live memtable, immutable memtables, and SST files. This aggregator is then used by memtable.cc's SaveValue() and GetContext::SaveValue() to check whether keys are covered. added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761 Closes https://github.com/facebook/rocksdb/pull/1456 Differential Revision: D4111271 Pulled By: ajkr fbshipit-source-id: 6e388d4	2016-11-03 18:54:20 -07:00
Yi Wu	437942e481	Add avoid_flush_during_shutdown DB option Summary: Add avoid_flush_during_shutdown DB option. Closes https://github.com/facebook/rocksdb/pull/1451 Differential Revision: D4108643 Pulled By: yiwu-arbug fbshipit-source-id: abdaf4d	2016-11-02 15:39:18 -07:00
Andrew Kryczka	40a2e406f8	DeleteRange flush support Summary: Changed BuildTable() (used for flush) to (1) add range tombstones to the aggregator, which is used by CompactionIterator to determine which keys can be removed; and (2) add aggregator's range tombstones to the table that is output for the flush. Closes https://github.com/facebook/rocksdb/pull/1438 Differential Revision: D4100025 Pulled By: ajkr fbshipit-source-id: cb01a70	2016-10-31 20:54:18 -07:00
Siying Dong	da61f348d3	Print compression and Fast CRC support info as Header level Summary: Currently the compression suppport and fast CRC support information is printed as info level. They should be in the same level as options, which is header level. Also add ZSTD to this printing. Closes https://github.com/facebook/rocksdb/pull/1448 Differential Revision: D4106608 Pulled By: yiwu-arbug fbshipit-source-id: cb9a076	2016-10-31 16:09:13 -07:00
Siying Dong	c90c48d3c8	Show More DB Stats in info logs Summary: DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line. Closes https://github.com/facebook/rocksdb/pull/1439 Differential Revision: D4100943 Pulled By: yiwu-arbug fbshipit-source-id: 79f7218	2016-10-29 16:09:18 -07:00
Siying Dong	04751d5345	L0 compression should follow options.compression_per_level if not empty Summary: Currently, we don't use options.compression_per_level[0] as the compression style for L0 compression type, unless it is None. This behavior doesn't look like on purpose. This diff will make sure L0 compress using the style of options.compression_per_level[0]. Reviewed and accepted in: https://reviews.facebook.net/D65607 Closes https://github.com/facebook/rocksdb/pull/1435 Differential Revision: D4099368 Pulled By: siying fbshipit-source-id: cfbbdcd	2016-10-28 17:39:20 -07:00
Aaron Gao	9de2f75216	revert Prev() in MergingIterator to use previous code in non-prefix-seek mode Summary: Siying suggested to keep old code for normal mode prev() for safety Test Plan: make check -j64 Reviewers: yiwu, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65439	2016-10-24 13:13:01 -07:00
Aaron Gao	59a7c0337b	Change ioptions to store user_comparator, fix bug Summary: change ioptions.comparator to user_comparator instread of internal_comparator. Also change Comparator* to InternalKeyComparator* to make its type explicitly. Test Plan: make all check -j64 Reviewers: andrewkr, sdong, yiwu Reviewed By: yiwu Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65121	2016-10-21 11:31:42 -07:00
Islam AbdelRahman	869ae5d786	Support IngestExternalFile (remove AddFile restrictions) Summary: Changes in the diff API changes: - Introduce IngestExternalFile to replace AddFile (I think this make the API more clear) - Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file) - Deprecate AddFile() API Logic changes: - If our file overlap with the memtable we will flush the memtable - We will find the first level in the LSM tree that our file key range overlap with the keys in it - We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it - We will assign a global sequence number to our new file - Remove AddFile restrictions by using global sequence numbers Other changes: - Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob Test Plan: unit tests (still need to add more) addfile_stress (https://reviews.facebook.net/D65037) Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong Reviewed By: sdong Subscribers: jkedgar, hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65061	2016-10-20 17:05:32 -07:00
Andrew Kryczka	f47054015d	Handle WAL deletion when using avoid_flush_during_recovery Summary: Previously the WAL files that were avoided during recovery would never be considered for deletion. That was because alive_log_files_ was only populated when log files are created. This diff further populates alive_log_files_ with existing log files that aren't flushed during recovery, such that FindObsoleteFiles() can find them later. Depends on D64053. Test Plan: new unit test, verifies it fails before this change and passes after Reviewers: sdong, IslamAbdelRahman, yiwu Reviewed By: yiwu Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64059	2016-10-14 12:59:51 -07:00
Yi Wu	e29d3b67c2	Make max_background_compactions and base_background_compactions dynamic changeable Summary: Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions. I'll add more dynamic changeable options soon. Test Plan: unit test. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64749	2016-10-14 12:25:39 -07:00
Islam AbdelRahman	5691a1d8a4	Fix compaction conflict with running compaction Summary: Issue scenario: (1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2 (2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files) (3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile() Test Plan: unit tests Reviewers: sdong Reviewed By: sdong Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D64947	2016-10-13 10:49:06 -07:00
Andrew Kryczka	017de666c7	fixup commit Summary: I accidentally left out these changes from my commit of D64053 due to messing up the merge conflict resolution. Test Plan: ./db_wal_test Reviewers: Subscribers: Tasks: Blame Revision: D64053	2016-10-13 08:48:40 -07:00
Andrew Kryczka	1b7af5fb1a	Redo handling of recycled logs in full purge Summary: This reverts commit `9e4aa798c3`, which doesn't handle all cases (see inline comment). I reimplemented the logic as suggested in the initial PR: https://github.com/facebook/rocksdb/pull/1313. This approach has two benefits: - All the parsing/filtering of full_scan_candidate_files is kept together in PurgeObsoleteFiles. - We only need to check whether log file is recycled in one place where we've already determined it's a log file Test Plan: new unit test, verified fails before the original fix, still passes now. Reviewers: IslamAbdelRahman, yiwu, sdong Reviewed By: yiwu, sdong Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64053	2016-10-12 23:13:09 -07:00
Aaron Gao	447f17127c	new Prev() prefix support using SeekForPrev() Summary: 1) The previous solution for Prev() prefix support is not clean. Since I add api SeekForPrev(), now the Prev() can be symmetric to Next(). and we do not need SeekToLast() to be called in Prev() any more. Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward. 2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor. add test cases for the above two cases. There are some tests for the SeekToLast() in Prev(), I will clean them later. Test Plan: make all check Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63933	2016-10-11 13:54:26 -07:00
Reid Horuff	2c1f95291d	Add facility to write only a portion of WriteBatch to WAL Summary: When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement. ``` RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628 RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054 ``` Test Plan: Two unit tests added. Reviewers: sdong, yiwu, IslamAbdelRahman Reviewed By: yiwu Subscribers: hermanlee4, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64599	2016-10-07 11:32:10 -07:00
Islam AbdelRahman	87dfc1d23e	Fix conflict between AddFile() and CompactRange() Summary: Fix the conflict bug between AddFile() and CompactRange() by - Make sure that no AddFile calls are running when asking CompactionPicker to pick compaction for manual compaction - If AddFile() run after we pick the compaction for the manual compaction it will be aware of it since we will add the manual compaction to running_compactions_ after picking it This will solve these 2 scenarios - If AddFile() is running, we will wait for it to finish before we pick a compaction for the manual compaction - If we already picked a manual compaction and then AddFile() started ... we ensure that it never ingest a file in a level that will overlap with the manual compaction Test Plan: unit tests Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D64449	2016-09-28 15:42:06 -07:00
yiwu-arbug	eb3894cf42	Recompute compaction score on SetOptions (#1346 ) Summary: We didn't recompute compaction score on SetOptions, and end up not having compaction if no flush happens afterward. The PR fixing it. Test Plan: See unit test. Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64167	2016-09-27 11:17:15 -07:00
Islam AbdelRahman	5c64fb67d2	Fix AddFile() conflict with compaction output [WaitForAddFile()] Summary: Since AddFile unlock/lock the mutex inside LogAndApply() we need to ensure that during this period other compactions cannot run since such compactions are not aware of the file we are ingesting and could create a compaction that overlap wit this file this diff add - WaitForAddFile() call that will ensure that no AddFile() calls are being processed right now - Call `WaitForAddFile()` in 3 locations -- When doing manual Compaction -- When starting automatic Compaction -- When doing CompactFiles() Test Plan: unit test Reviewers: lightmark, yiwu, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D64383	2016-09-27 00:14:55 -07:00
Yi Wu	9ed928e7a9	Split DBOptions into ImmutableDBOptions and MutableDBOptions Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs. Test Plan: make all check Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64065	2016-09-23 16:34:04 -07:00
yiwu-arbug	4bc8c88e6b	Recover same sequence id from WAL (#1350 ) Summary: Revert the behavior where we don't read sequence id from WAL, but increase it as we replay the log. We still keep the behave for 2PC for now but will fix later. This change fixes github issue 1339, where some writes come with WAL disabled and we may recover records with wrong sequence id. Test Plan: Added unit test. Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64275	2016-09-23 16:15:14 -07:00
Aaron Gao	715256338a	forbid merge during recovery Summary: Mitigate regression bug of options.max_successive_merges hit during DB Recovery For https://reviews.facebook.net/D62625 Test Plan: make all check Reviewers: horuff, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62655	2016-09-21 11:05:07 -07:00
Yi Wu	e4d3f5d9b8	Fix DBImpl::GetWalPreallocateBlockSize Mac build error Summary: Specify type param with std::min to resolve compile error on Mac. Test Plan: https://travis-ci.org/facebook/rocksdb/builds/161223845 Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64143	2016-09-20 10:17:28 -07:00
sdong	d78a4401b5	DBImpl::GetWalPreallocateBlockSize() should return size_t Summary: WritableFile::SetPreallocationBlockSize() requires parameter as size_t, and options used in DBImpl::GetWalPreallocateBlockSize() are all size_t. WritableFile::SetPreallocationBlockSize() should return size_t to avoid build break if size_t is not uint64_t. Test Plan: Run existing tests. Reviewers: andrewkr, IslamAbdelRahman, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64137	2016-09-19 16:51:38 -07:00
sdong	b666f85445	Consider more factors when determining preallocation size of WAL files Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller. Test Plan: Add a unit test. Reviewers: andrewkr, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63957	2016-09-19 12:04:35 -07:00
Yi Wu	0a88f38b7e	Remove ColumnFamilyData::options() Summary: One more small refactor before I split DBOptions into mutable and immutable parts. Test Plan: existing unit tests. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64047	2016-09-16 15:09:14 -07:00
Andrew Kryczka	06b4785fec	Fix recovery for WALs without data for all CFs Summary: if one or more CFs had no data in the WAL, the log number that's used by FindObsoleteFiles() wasn't updated. We need to treat this case the same as if the data for that WAL had been flushed. Test Plan: new unit test Reviewers: IslamAbdelRahman, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63963	2016-09-15 11:40:48 -07:00

1 2 3 4 5 ...

1194 Commits