rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-30 04:41:49 +00:00

Author	SHA1	Message	Date
Siying Dong	67d7623794	Expose the stalling information through DB::GetProperty() Summary: Add two DB properties: rocksdb.actual_delayed_write_rate and rocksdb.is_write_stooped, for people to know whether current writes are being throttled. Closes https://github.com/facebook/rocksdb/pull/2043 Differential Revision: D4782975 Pulled By: siying fbshipit-source-id: 6b2f5cf	2017-03-29 11:54:20 -07:00
Maysam Yabandeh	e7731d119a	Configure index partition size Summary: Allow the users to specify the target index partition size. With this patch an index partition is cut before its estimated in-memory size goes above the configured value for metadata_block_size. The filter partitions are still cut right after an index partition is cut. Closes https://github.com/facebook/rocksdb/pull/2041 Differential Revision: D4780216 Pulled By: maysamyabandeh fbshipit-source-id: 95a0831	2017-03-28 12:09:12 -07:00
Siying Dong	91b5feb37b	Fix Windows Build broken by a recent commit Summary: Closes https://github.com/facebook/rocksdb/pull/2032 Differential Revision: D4766260 Pulled By: siying fbshipit-source-id: 415daa4	2017-03-23 18:09:57 -07:00
Warren Falk	41ccae6d26	Add C API functions (and tests) for WriteBatchWithIndex Summary: I've added functions to the C API to support WriteBatchWithIndex as requested in #1833. I've also added unit tests to c_test I've implemented the WriteBatchWithIndex variation of every function available for regular WriteBatch. And added additional functions unique to WriteBatchWithIndex. For now, the following is omitted: 1. The ability to create WriteBatchWithIndex's custom batch-only iterator as I'm not sure what its purpose is. It should be possible to add later if anyone wants it. 2. The ability to create the batch with a fallback comparator, since it appears to be unnecessary. I believe the column family comparator will be used for this, meaning those using a custom comparator can just use the column family variations. Closes https://github.com/facebook/rocksdb/pull/1985 Differential Revision: D4760039 Pulled By: siying fbshipit-source-id: 393227e	2017-03-23 15:54:13 -07:00
Daniel Black	f4fce4751e	Fix clang compile error - [-Werror,-Wunused-lambda-capture] Summary: Errors where: db/version_set.cc:1535:20: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture] [this](const Fsize& f1, const Fsize& f2) -> bool { ^ db/version_set.cc:1541:20: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture] [this](const Fsize& f1, const Fsize& f2) -> bool { ^ db/db_test.cc:2983:27: error: lambda capture 'kNumPutsBeforeWaitForFlush' is not required to be captured for this use [-Werror,-Wunused-lambda-capture] auto gen_l0_kb = [this, kNumPutsBeforeWaitForFlush](int size) { ^ Closes https://github.com/facebook/rocksdb/pull/1972 Differential Revision: D4685991 Pulled By: siying fbshipit-source-id: 9125379	2017-03-22 18:09:10 -07:00
Siying Dong	15950fe3a0	Remove ASSERT_EQ(boolean, ...) Summary: Closes https://github.com/facebook/rocksdb/pull/2024 Differential Revision: D4755420 Pulled By: siying fbshipit-source-id: 7332ab1	2017-03-22 15:54:12 -07:00
Aaron Gao	3e56c7e0c4	make total_log_size_ atomic Summary: make total_log_size_ atomic to avoid overflow caused by data race. Closes https://github.com/facebook/rocksdb/pull/2019 Differential Revision: D4751391 Pulled By: siying fbshipit-source-id: fac01dd	2017-03-22 11:54:40 -07:00
Dmitri Smirnov	be723a8d8c	Optionally construct Post Processing Info map in MemTableInserter Summary: MemTableInserter default constructs Post processing info std::map. However, on Windows with 2015 STL the default constructed map still dynamically allocates one node which shows up on a profiler and we loose ~40% throughput on fillrandom benchmark. Solution: declare a map as std::aligned storage and optionally construct. This addresses https://github.com/facebook/rocksdb/issues/1976 Before: ------------------------------------------------------------------- Initializing RocksDB Options from command-line flags DB path: [k:\data\BulkLoadRandom_10M_fillonly] fillrandom : 2.775 micros/op 360334 ops/sec; 280.4 MB/s Microseconds per write: Count: 10000000 Average: 2.7749 StdDev: 39.92 Min: 1 Median: 2.0826 Max: 26051 Percentiles: P50: 2.08 P75: 2.55 P99: 3.55 P99.9: 9.58 P99.99: 51.5**6 ------------------------------------------------------ After: Initializing RocksDB Options from command-line flags DB path: [k:\data\BulkLoadRandom_10M_fillon Closes https://github.com/facebook/rocksdb/pull/2011 Differential Revision: D4740823 Pulled By: siying fbshipit-source-id: 1daaa2c	2017-03-22 11:24:12 -07:00
Maysam Yabandeh	8b0097b49b	Readers for partition filter Summary: This is the last split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes the reader part as well as the tests. Closes https://github.com/facebook/rocksdb/pull/1961 Differential Revision: D4672216 Pulled By: maysamyabandeh fbshipit-source-id: 6a2b829	2017-03-22 09:24:15 -07:00
Siying Dong	8f5bf04468	Flush triggered by DB write buffer size picks the oldest unflushed CF Summary: Previously, when DB write buffer size triggers, we always pick the CF with most data in its memtable to flush. This approach can minimize total flush happens. Change the behavior to always pick the oldest unflushed CF, which makes it the same behavior when max_total_wal_size hits. This approach will minimize size used by max_total_wal_size. Closes https://github.com/facebook/rocksdb/pull/1987 Differential Revision: D4703214 Pulled By: siying fbshipit-source-id: 9ff8b09	2017-03-21 11:09:10 -07:00
Raza Hussain	6908e24b56	dynamic setting of stats_dump_period_sec through SetDBOption() Summary: Resolved the following issue: https://github.com/facebook/rocksdb/issues/1930 Closes https://github.com/facebook/rocksdb/pull/2004 Differential Revision: D4736764 Pulled By: yiwu-arbug fbshipit-source-id: 64fe0b7	2017-03-20 22:54:13 -07:00
Islam AbdelRahman	d52f334cbd	Break stalls when no bg work is happening Summary: Current stall will keep sleeping even if there is no Flush/Compactions to wait for, I changed the logic to break the stall if we are not flushing or compacting db_bench command used ``` # fillrandom # memtable size = 10MB # value size = 1 MB # num = 1000 # use /dev/shm ./db_bench --benchmarks="fillrandom,stats" --value_size=1048576 --write_buffer_size=10485760 --num=1000 --delayed_write_rate=XXXXX --db="/dev/shm/new_stall" \| grep "Cumulative stall" ``` ``` Current results # delayed_write_rate = 1000 Kb/sec Cumulative stall: 00:00:9.031 H:M:S # delayed_write_rate = 200 Kb/sec Cumulative stall: 00:00:22.314 H:M:S # delayed_write_rate = 100 Kb/sec Cumulative stall: 00:00:42.784 H:M:S # delayed_write_rate = 50 Kb/sec Cumulative stall: 00:01:23.785 H:M:S # delayed_write_rate = 25 Kb/sec Cumulative stall: 00:02:45.702 H:M:S ``` ``` New results # delayed_write_rate = 1000 Kb/sec Cumulative stall: 00:00:9.017 H:M:S # delayed_write_rate = 200 Kb/sec Cumulative stall: 00 Closes https://github.com/facebook/rocksdb/pull/1884 Differential Revision: D4585439 Pulled By: IslamAbdelRahman fbshipit-source-id: aed2198	2017-03-16 18:24:17 -07:00
Islam AbdelRahman	995618a821	Support SstFileManager::SetDeleteRateBytesPerSecond() Summary: Update DeleteScheduler component to support changing delete rate in runtime by introducing SstFileManager::SetDeleteRateBytesPerSecond() Closes https://github.com/facebook/rocksdb/pull/1994 Differential Revision: D4719906 Pulled By: IslamAbdelRahman fbshipit-source-id: e6b8d9e	2017-03-16 12:09:15 -07:00
Islam AbdelRahman	e19163688b	Add macros to include file name and line number during Logging Summary: current logging ``` 2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25 2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2. 2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK 2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK 2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0. 2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6 2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1 2017/03/14-14:20:31. Closes https://github.com/facebook/rocksdb/pull/1990 Differential Revision: D4708695 Pulled By: IslamAbdelRahman fbshipit-source-id: cb8968f	2017-03-15 19:39:12 -07:00
Maysam Yabandeh	11526252cc	Pinnableslice (2nd attempt) Summary: PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: value 100 byte: 1.8% regular, 1.2% merge values value 1k byte: 11.5% regular, 7.5% merge values value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The d Closes https://github.com/facebook/rocksdb/pull/1756 Differential Revision: D4391738 Pulled By: maysamyabandeh fbshipit-source-id: 6f3edd3	2017-03-13 11:54:10 -07:00
Sagar Vemuri	1ffbdfd9a7	Add a new SstFileWriter constructor without explicit comparator Summary: The comparator param in SstFileWriter constructor is redundant as it already exists as a field in options. So the current SstFileWriter constructor should be deprecated in favor of a new one which does not take a comparator. Note that the jni/java apis have not been touched yet. Closes https://github.com/facebook/rocksdb/pull/1978 Differential Revision: D4685629 Pulled By: sagar0 fbshipit-source-id: 372ce96	2017-03-13 11:39:13 -07:00
Reid Horuff	ebd5639b6d	Add ability to search for key prefix in sst_dump tool Summary: Add the flag --prefix to the sst_dump tool This flag is similar to, and exclusive from, the --from flag. --prefix=0x00FF will return all rows prefixed with 0x00FF. The --to flag may also be specified and will work as expected. These changes were used to help in debugging the power cycle corruption issue and theses changes were tested by scanning through a udb. Closes https://github.com/facebook/rocksdb/pull/1984 Differential Revision: D4691814 Pulled By: reidHoruff fbshipit-source-id: 027f261	2017-03-13 10:39:12 -07:00
Maysam Yabandeh	e6725e8c8d	Fix some bugs in MockEnv Summary: Fixing some bugs in MockEnv so it be actually used. Closes https://github.com/facebook/rocksdb/pull/1914 Differential Revision: D4609923 Pulled By: maysamyabandeh fbshipit-source-id: ca25735	2017-03-13 09:54:11 -07:00
Andrew Kryczka	f2817fb7f9	avoid ASSERT_EQ(false, ...); Summary: lately it fails on travis due to a compiler bug (see https://github.com/google/googletest/issues/322#issuecomment-125645145). interestingly it seems to affect occurrences of `ASSERT_EQ(false, ...);` but not `ASSERT_EQ(true, ...);`. Closes https://github.com/facebook/rocksdb/pull/1958 Differential Revision: D4680742 Pulled By: ajkr fbshipit-source-id: 291fe41	2017-03-08 22:24:16 -08:00
Andrew Kryczka	5b11124e39	add max to histogram stats Summary: Domas enlightened me about p100 (i.e., max) stats. Let's add them to our histograms. Closes https://github.com/facebook/rocksdb/pull/1968 Differential Revision: D4678716 Pulled By: ajkr fbshipit-source-id: 65e7118	2017-03-08 22:24:15 -08:00
Andrew Kryczka	18fc1bc0e0	minor changes for rate limiter test flakiness Summary: the 50%+ drained constraint wasn't working consistently in some of our test environments, maybe their resources are too low. relax the constraints a bit. Closes https://github.com/facebook/rocksdb/pull/1970 Differential Revision: D4679419 Pulled By: ajkr fbshipit-source-id: 3789cd8	2017-03-08 17:54:11 -08:00
Aaron Gao	12ba00ea65	Reset DBIter::saved_key_ with proper user key anywhere before pass to DBIter::FindNextUserEntry Summary: fix db_iter bug introduced by [facebook#1413](https://github.com/facebook/rocksdb/pull/1413) Closes https://github.com/facebook/rocksdb/pull/1962 Differential Revision: D4672369 Pulled By: lightmark fbshipit-source-id: 6a22953	2017-03-08 17:24:11 -08:00
Sagar Vemuri	97edc72d39	Add a memtable-only iterator Summary: This PR is to support a way to iterate over all the keys that are just in memtables. Closes https://github.com/facebook/rocksdb/pull/1953 Differential Revision: D4663500 Pulled By: sagar0 fbshipit-source-id: 144e177	2017-03-07 11:54:10 -08:00
Leonidas Galanis	72202962f9	fix db_sst_test flakiness Summary: db_sst_test had been flaky occasionally in the following way: reached_max_space_on_compaction can in very rare cases be 0. This happens when the limit on maximum allowable space set using SetMaxAllowedSpaceUsage is hit during flush for all test db sizes (1,2,4,8 and 10MB).The fix clears the error returned when the the space limit is reached during flush. This ensures that the compaction call back will always be called. The runtime is increased slightly because the 1MB loop writes more data and hits the limit during multiple flushes until compaction is scheduled. Closes https://github.com/facebook/rocksdb/pull/1861 Differential Revision: D4557396 Pulled By: lgalanis fbshipit-source-id: ff778d1	2017-03-07 11:24:13 -08:00
Reid Horuff	58b12dfe37	Set logs as getting flushed before releasing lock, race condition fix Summary: Relating to #1903: In MaybeFlushColumnFamilies() we want to modify the 'getting_flushed' flag before releasing the db mutex when SwitchMemtable() is called. The following 2 actions need to be atomic in MaybeFlushColumnFamilies() - getting_flushed is false on oldest log - we determine that all CFs can be flushed to successfully release oldest log - we set getting_flushed = true on the oldest log. ------- - getting_flushed is false on oldest log - we determine that all CFs can NOT be flushed to successfully release oldest log - we set unable_to_flush_oldest_log_ = true on the oldest log. #### In the 2pc case: T1 enters function but is unable to flush all CFs to release log T1 sets unable_to_flush_oldest_log_ = true T1 begins flushing all CFs possible T2 enters function but is unable to flush all CFs to release log T2 sees unable_to_flush_oldes_log_ has been set so exits T3 enters function and will be able to flush all CFs to release oldest log T3 sets getting_flushed = true on oldes Closes https://github.com/facebook/rocksdb/pull/1909 Differential Revision: D4646235 Pulled By: reidHoruff fbshipit-source-id: c8d0447	2017-03-06 15:09:11 -08:00
Maysam Yabandeh	534581a356	Fix a bug in tests in options operator= Summary: Note: Using the default operator= is an unsafe approach for Options since it destructs shared_ptr in the same order of their creation, in contrast to destructors which destructs them in the opposite order of creation. One particular problme is that the cache destructor might invoke callback functions that use Option members such as statistics. To work around this problem, we manually call destructor of table_facotry which eventually clears the block cache. Closes https://github.com/facebook/rocksdb/pull/1950 Differential Revision: D4655473 Pulled By: maysamyabandeh fbshipit-source-id: 6c4bbff	2017-03-05 18:09:09 -08:00
Andrew Kryczka	4561275c2d	fix rate limiter test flakiness Summary: fix when elapsed time spans non-integral number of intervals since the rate limiter may still be drained during a partial interval. Closes https://github.com/facebook/rocksdb/pull/1948 Differential Revision: D4651304 Pulled By: ajkr fbshipit-source-id: b1f9e70	2017-03-03 11:09:11 -08:00
Andrew Kryczka	7c80a6d7d1	Statistic for how often rate limiter is drained Summary: This is the metric I plan to use for adaptive rate limiting. The statistics are updated only if the rate limiter is drained by flush or compaction. I believe (but am not certain) that this is the normal case. The Statistics object is passed in RateLimiter::Request() to avoid requiring changes to client code, which would've been necessary if we passed it in the RateLimiter constructor. Closes https://github.com/facebook/rocksdb/pull/1946 Differential Revision: D4646489 Pulled By: ajkr fbshipit-source-id: d8e0161	2017-03-02 17:54:15 -08:00
Aaron Gao	6fb9013441	sanitize readahead when direct read enabled Summary: no readahead: readseq : 8.438 micros/op 118510 ops/sec; 13.1 MB/s sanitize to 10MB: readseq : 6.051 micros/op 165248 ops/sec; 18.3 MB/s Closes https://github.com/facebook/rocksdb/pull/1945 Differential Revision: D4645811 Pulled By: lightmark fbshipit-source-id: 5d63770	2017-03-02 17:24:11 -08:00
Islam AbdelRahman	f89b3893c0	Remove skip_table_builder_flush and default it to true Summary: This option is needed to be enabled for Direct IO and I cannot think of a reason where we need to disable it remove it and default it to true Closes https://github.com/facebook/rocksdb/pull/1944 Differential Revision: D4641088 Pulled By: IslamAbdelRahman fbshipit-source-id: d7085b9	2017-03-02 16:54:10 -08:00
Siying Dong	d5b607a43f	Make db_wal_test slightly faster Summary: Avoid to run db_wal_test in all the DB test options, and some small changes. Closes https://github.com/facebook/rocksdb/pull/1921 Differential Revision: D4622054 Pulled By: siying fbshipit-source-id: 890fd64	2017-02-28 17:39:10 -08:00
Siying Dong	ba4c77bd6b	Divide external_sst_file_test Summary: Separate the platform dependent tests from external_sst_file_test. Only those tests need to run on platforms like OSX Closes https://github.com/facebook/rocksdb/pull/1923 Differential Revision: D4622461 Pulled By: siying fbshipit-source-id: d2d6f04	2017-02-28 14:24:11 -08:00
Aaron Gao	e877afa08b	Remove bulk loading and auto_roll_logger in rocksdb_lite Summary: shrink lite size Closes https://github.com/facebook/rocksdb/pull/1929 Differential Revision: D4622059 Pulled By: siying fbshipit-source-id: 050b796	2017-02-28 11:09:11 -08:00
Peter (Stig) Edwards	2ca2059f66	Get unique_ptr to use delete[] for char[] in DumpMallocStats Summary: Avoid mismatched free() / delete / delete [] in DumpMallocStats Closes https://github.com/facebook/rocksdb/pull/1927 Differential Revision: D4622045 Pulled By: siying fbshipit-source-id: 1131b30	2017-02-27 17:39:12 -08:00
Siying Dong	8ad0fcdf99	Separate small subset tests in DBTest Summary: Separate a smal subset of tests in DBTest to DBBasicTest. Tests in DBTest don't have to run in CI tests on platforms like OSX, as long as they are covered by Linux. Closes https://github.com/facebook/rocksdb/pull/1924 Differential Revision: D4616702 Pulled By: siying fbshipit-source-id: 13e6549	2017-02-27 12:24:11 -08:00
Siying Dong	3b8ba703cb	Fix flaky DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2 Summary: A previous fix to DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2 didn't address the right problem. The problem is L0->L0 compaction is not trivial move in the scenario, not parallel compactions. Fix this. Closes https://github.com/facebook/rocksdb/pull/1911 Differential Revision: D4608955 Pulled By: siying fbshipit-source-id: 7a712cb	2017-02-23 18:39:13 -08:00
Siying Dong	8efb5ffa2a	[rocksdb][PR] Remove option min_partial_merge_operands and verify_checksums_in_comp… Summary: …action The two options, min_partial_merge_operands and verify_checksums_in_compaction, are not seldom used. Remove them to reduce the total number of options. Also remove them from Java and C interface. Closes https://github.com/facebook/rocksdb/pull/1902 Differential Revision: D4601219 Pulled By: siying fbshipit-source-id: aad4cb2	2017-02-23 15:09:12 -08:00
Siying Dong	1ba2804b7f	Remove XFunc tests Summary: Xfunc is hardly used. Remove it to keep the code simple. Closes https://github.com/facebook/rocksdb/pull/1905 Differential Revision: D4603220 Pulled By: siying fbshipit-source-id: 731f96d	2017-02-23 12:09:11 -08:00
Aaron Gao	1ef5f50e84	detect logical sector size Summary: querying logical sector size from the device instead of hardcoding it for linux platform. Closes https://github.com/facebook/rocksdb/pull/1875 Differential Revision: D4591946 Pulled By: ajkr fbshipit-source-id: 4e9805c	2017-02-23 11:25:36 -08:00
Aaron Gao	0824934423	truncate patch Summary: omit the override for the previous commit Closes https://github.com/facebook/rocksdb/pull/1898 Differential Revision: D4598743 Pulled By: lightmark fbshipit-source-id: f98a378	2017-02-22 10:39:11 -08:00
Aaron Gao	286a36db7f	posix writablefile truncate Summary: we occasionally missing this call so the file size will be wrong Closes https://github.com/facebook/rocksdb/pull/1894 Differential Revision: D4598446 Pulled By: lightmark fbshipit-source-id: 42b6ef5	2017-02-22 10:09:14 -08:00
Mike Kolupaev	18eeb7b90e	Fix interference between max_total_wal_size and db_write_buffer_size checks Summary: This is a trivial fix for OOMs we've seen a few days ago in logdevice. RocksDB get into the following state: (1) Write throughput is too high for flushes to keep up. Compactions are out of the picture - automatic compactions are disabled, and for manual compactions we don't care that much if they fall behind. We write to many CFs, with only a few L0 sst files in each, so compactions are not needed most of the time. (2) total_log_size_ is consistently greater than GetMaxTotalWalSize(). It doesn't get smaller since flushes are falling ever further behind. (3) Total size of memtables is way above db_write_buffer_size and keeps growing. But the write_buffer_manager_->ShouldFlush() is not checked because (2) prevents it (for no good reason, afaict; this is what this commit fixes). (4) Every call to WriteImpl() hits the MaybeFlushColumnFamilies() path. This keeps flushing the memtables one by one in order of increasing log file number. (5) No write stalling trigger is hit. We rely on max_write_buffer_number Closes https://github.com/facebook/rocksdb/pull/1893 Differential Revision: D4593590 Pulled By: yiwu-arbug fbshipit-source-id: af79c5f	2017-02-21 16:09:10 -08:00
Aaron Gao	2a0f3d0de1	level compaction expansion Summary: reimplement the compaction expansion on lower level. Considering such a case: input level file: 1[B E] 2[F G] 3[H I] 4 [J M] output level file: 5[A C] 6[D K] 7[L O] If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level. The previous code is messy and wrong. In this diff, I first determine the input range [a, b], and output range [c, d], then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction. Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction. Closes https://github.com/facebook/rocksdb/pull/1760 Differential Revision: D4395564 Pulled By: lightmark fbshipit-source-id: 2dc2c5c	2017-02-21 10:24:17 -08:00
Yi Wu	381fd32247	Remove timeout_hint_us from WriteOptions Summary: The option has been deprecated for two years and has no effect. Removing. Closes https://github.com/facebook/rocksdb/pull/1866 Differential Revision: D4555203 Pulled By: yiwu-arbug fbshipit-source-id: c48f627	2017-02-17 15:24:17 -08:00
Islam AbdelRahman	fce7a6e196	Fail IngestExternalFile when bg_error_ exists Summary: Fail IngestExternalFile() when bg_error_ exists Closes https://github.com/facebook/rocksdb/pull/1881 Differential Revision: D4580621 Pulled By: IslamAbdelRahman fbshipit-source-id: 1194913	2017-02-17 13:39:17 -08:00
Shu Zhang	756c5924e6	Allow adding external v1 sst file with no global seqno support Summary: This is a follow up fix for https://github.com/facebook/rocksdb/pull/1783. After it, we should be able to ingest external v1 sst files with no global seqno field. Closes https://github.com/facebook/rocksdb/pull/1874 Differential Revision: D4576194 Pulled By: IslamAbdelRahman fbshipit-source-id: 5b34a3e	2017-02-16 17:09:12 -08:00
Aaron Gao	db2b4eb50e	avoid direct io in rocksdb_lite Summary: fix lite bugs disable direct io in lite mode Closes https://github.com/facebook/rocksdb/pull/1870 Differential Revision: D4559866 Pulled By: yiwu-arbug fbshipit-source-id: 3761c51	2017-02-16 10:39:13 -08:00
Andrew Kryczka	43e9f01c20	Fix repair_test on ROCKSDB_LITE Summary: RepairDB isn't included in rocksdb lite, so don't test it. Closes https://github.com/facebook/rocksdb/pull/1873 Differential Revision: D4565094 Pulled By: ajkr fbshipit-source-id: 8cc0898	2017-02-15 11:24:12 -08:00
Xiaofei Du	7106a994fe	Use monotonic time points in write_controller.cc and rate_limiter.cc Summary: NowMicros() provides non-monotonic time. When wall clock is synchronized or changed, the non-monotonicity time points will affect write rate controllers. This patch changes write_controller.cc and rate_limiter.cc to use monotonic time points. Closes https://github.com/facebook/rocksdb/pull/1865 Differential Revision: D4561732 Pulled By: siying fbshipit-source-id: 95ece62	2017-02-14 18:24:24 -08:00
Yi Wu	c2247dc1c7	Make DBImpl::has_unpersisted_data_ atomic Summary: Seems to me `has_unpersisted_data_` is read from read thread and write from write thread concurrently without synchronization. Making it an atomic. I update the logic not because seeing any problem with it, but it just feel confusing. Closes https://github.com/facebook/rocksdb/pull/1869 Differential Revision: D4555837 Pulled By: yiwu-arbug fbshipit-source-id: eff2ab8	2017-02-13 18:54:13 -08:00
Sagar Vemuri	eb912a927e	Remove disableDataSync option Summary: Remove disableDataSync, and another similarly named disable_data_sync options. This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods. Closes https://github.com/facebook/rocksdb/pull/1859 Differential Revision: D4541292 Pulled By: sagar0 fbshipit-source-id: 5b3a6ca	2017-02-13 11:09:13 -08:00
Dmitri Smirnov	a5adda0642	Fix repair issues Summary: Record the first parsed sequence number as the minimum so we can find the true minimum otherwise everything is larger than zero. Fix the comparator name comparision. Closes https://github.com/facebook/rocksdb/pull/1858 Differential Revision: D4544365 Pulled By: ajkr fbshipit-source-id: 439cbc2	2017-02-10 10:54:12 -08:00
Andrew Kryczka	b48e4778be	Consolidate file cutting logic in compaction loop Summary: It was really annoying to have two places (top and bottom of compaction loop) where we cut output files. I had bugs in both DeleteRange and dictionary compression due to updating only one of the two. This diff consolidates the file-cutting logic to the bottom of the compaction loop. Keep in mind that my goal with input_status is to be consistent with the past behavior, even though I'm not sure it's ideal. Closes https://github.com/facebook/rocksdb/pull/1832 Differential Revision: D4503038 Pulled By: ajkr fbshipit-source-id: 7da5213	2017-02-08 16:24:17 -08:00
Maysam Yabandeh	c4a37dcb44	Print the missed last layer in cfstats Summary: Printing compaction stats used to operate on two variable: number_levels_: for printing the layer num_levels_to_check: for updating the compaction score After this commit: `361010d447` these two are mixed up and as a result the last layer might not be printed out: https://fb.facebook.com/groups/rocksdb.internal/permalink/1315716625143616/ number_levels_ was used to decide which layers to print: `672300f47f/db/internal_stats.cc (L753)` but after the patch it is based on the return value of DumpCFMapStats `361010d447/db/internal_stats.cc (L929)` which returns num_levels_to_check: `361010d447/db/internal_stats.cc (L917)` Closes https://github.com/facebook/rocksdb/pull/1853 Differential Revision: D4529280 Pulled By: maysamyabandeh fbshipit-source-id: 3fd9448	2017-02-08 10:39:15 -08:00
Maysam Yabandeh	69d5262c81	Two-level Indexes Summary: Partition Index blocks and use a Partition-index as a 2nd level index. The two-level index can be used by setting BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and configuring BlockBasedTableOptions::index_per_partition t15539501 Closes https://github.com/facebook/rocksdb/pull/1814 Differential Revision: D4473535 Pulled By: maysamyabandeh fbshipit-source-id: bffb87e	2017-02-06 16:39:12 -08:00
Dmitri Smirnov	0a4cdde50a	Windows thread Summary: introduce new methods into a public threadpool interface, - allow submission of std::functions as they allow greater flexibility. - add Joining methods to the implementation to join scheduled and submitted jobs with an option to cancel jobs that did not start executing. - Remove ugly `#ifdefs` between pthread and std implementation, make it uniform. - introduce pimpl for a drop in replacement of the implementation - Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread. - Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation. - should be no functionality changes. Closes https://github.com/facebook/rocksdb/pull/1823 Differential Revision: D4492902 Pulled By: siying fbshipit-source-id: c74cb11	2017-02-06 14:54:18 -08:00
Vitaliy Liptchinsky	1aaa898cf1	Adding GetApproximateMemTableStats method Summary: Added method that returns approx num of entries as well as size for memtables. Closes https://github.com/facebook/rocksdb/pull/1841 Differential Revision: D4511990 Pulled By: VitaliyLi fbshipit-source-id: 9a4576e	2017-02-06 14:54:16 -08:00
Siying Dong	036d668b19	Fix wrong result in data race case related to Get() Summary: In theory, Get() can get a wrong result, if it races in a special with with flush. The bug can be reproduced in DBTest2.GetRaceFlush. Fix this bug by getting snapshot after referencing the super version. Closes https://github.com/facebook/rocksdb/pull/1816 Differential Revision: D4475958 Pulled By: siying fbshipit-source-id: bd9e67a	2017-02-03 11:39:15 -08:00
Islam AbdelRahman	574b543f80	Rename merger.h -> merging_iterator.h Summary: merger.h was always a confusing name for me, simply give the file a better name Closes https://github.com/facebook/rocksdb/pull/1836 Differential Revision: D4505357 Pulled By: IslamAbdelRahman fbshipit-source-id: 07b28d8	2017-02-02 16:54:19 -08:00
oranagra	b96372dead	improving the C wrapper Summary: - rocksdb_property_int (so that we don't have to parse strings) - and rocksdb_set_options (to allow controlling options via strings) - a few other missing options exposed - a documentation comment fix Closes https://github.com/facebook/rocksdb/pull/1793 Differential Revision: D4456569 Pulled By: yiwu-arbug fbshipit-source-id: 9f1fac1	2017-01-27 17:39:16 -08:00
Siying Dong	04c4ec41d1	Change corruption_test to use 4 bits. Summary: In the patch which LRU cache was made use dynamic shard bits, I changed to 2 shard bits to make the test happy. Look like it is occasionally still unhappy. Change it to 4 shard bits. Closes https://github.com/facebook/rocksdb/pull/1815 Differential Revision: D4475849 Pulled By: siying fbshipit-source-id: 575ff00	2017-01-27 11:24:16 -08:00
Siying Dong	2d75cd40d3	NewLRUCache() to pick number of shard bits based on capacity if not given Summary: If the users use the NewLRUCache() without passing in the number of shard bits, instead of using hard-coded 6, we'll determine it based on capacity. Closes https://github.com/facebook/rocksdb/pull/1584 Differential Revision: D4242517 Pulled By: siying fbshipit-source-id: 86b0f18	2017-01-27 06:39:12 -08:00
Siying Dong	f25f1ec60b	Add test DBTest2.GetRaceFlush which can expose a data race bug Summary: A current data race issue in Get() and Flush() can cause a Get() to return wrong results when a flush happened in the middle. Disable the test for now. Closes https://github.com/facebook/rocksdb/pull/1813 Differential Revision: D4472310 Pulled By: siying fbshipit-source-id: 5755ebd	2017-01-26 16:39:14 -08:00
sdong	5dad9d6d28	Avoid logs_ operation out of DB mutex Summary: logs_.back() is called out of DB mutex, which can cause data race. We move the access into the DB mutex protection area. Closes https://github.com/facebook/rocksdb/pull/1774 Reviewed By: AsyncDBConnMarkedDownDBException Differential Revision: D4417472 Pulled By: AsyncDBConnMarkedDownDBException fbshipit-source-id: 2da1f1e	2017-01-25 15:54:13 -08:00
Islam AbdelRahman	a7b13919bf	Fix CompactFiles() bug when used with CompactionFilter using SuperVersion Summary: GetAndRefSuperVersion() should not be called again in the same thread before ReturnAndCleanupSuperVersion() is called. If we have a compaction filter that is using DB::Get, This will happen ``` CompactFiles() { GetAndRefSuperVersion() // -- first call .. CompactionFilter() { GetAndRefSuperVersion() // -- second call ReturnAndCleanupSuperVersion() } .. ReturnAndCleanupSuperVersion() } ``` We solve this issue in the same way Iterator is solving it, but using GetReferencedSuperVersion() This was discovered in https://github.com/facebook/mysql-5.6/issues/427 by alxyang Closes https://github.com/facebook/rocksdb/pull/1803 Differential Revision: D4460155 Pulled By: IslamAbdelRahman fbshipit-source-id: 5e54322	2017-01-25 14:09:13 -08:00
Andrew Kryczka	616a1464ea	Fix DeleteRange including sentinels in output files Summary: when writing RangeDelAggregator::AddToBuilder, I forgot that there are sentinel tombstones in the middle of the interval map since gaps between real tombstones are represented with sentinels. blame: #1614 Closes https://github.com/facebook/rocksdb/pull/1804 Differential Revision: D4460426 Pulled By: ajkr fbshipit-source-id: 69444b5	2017-01-25 11:09:12 -08:00
Islam AbdelRahman	03ca2ac8a9	Remove function from DBImpl that are not used anywhere Summary: GetAndRefSuperVersionUnlocked ReturnAndCleanupSuperVersionUnlocked GetColumnFamilyHandleUnlocked Are dead code that are not used any where Closes https://github.com/facebook/rocksdb/pull/1802 Differential Revision: D4459948 Pulled By: IslamAbdelRahman fbshipit-source-id: 30fa89d	2017-01-24 19:24:13 -08:00
Andrew Kryczka	b0029bc7fa	Test merge op covered by range deletion in memtable Summary: It's a test case for #1797. Also got rid of kTypeDeletion in the conditional since we treat it the same as kTypeRangeDeletion. Closes https://github.com/facebook/rocksdb/pull/1800 Differential Revision: D4451300 Pulled By: ajkr fbshipit-source-id: b39dda1	2017-01-24 13:39:11 -08:00
Andrew Kryczka	d438e1ec17	Test range deletion block outlives table reader Summary: This test ensures RangeDelAggregator can still access blocks even if it outlives the table readers that created them (detailed description in comments). I plan to optimize away the extra cache lookup we currently do in BlockBasedTable::NewRangeTombstoneIterator(), as it is ~5% CPU in my random read benchmark in a database with 1k tombstones. This test will help make sure nothing breaks in the process. Closes https://github.com/facebook/rocksdb/pull/1739 Differential Revision: D4375954 Pulled By: ajkr fbshipit-source-id: aef9357	2017-01-24 13:24:14 -08:00
Andrew Kryczka	9da4d542fe	Range deletions unsupported in tailing iterator Summary: change the iterator status to NotSupported as soon as a range tombstone is encountered by a ForwardIterator. Closes https://github.com/facebook/rocksdb/pull/1593 Differential Revision: D4246294 Pulled By: ajkr fbshipit-source-id: aef9f49	2017-01-23 13:39:12 -08:00
Hyeonseok Oh	f2b4939da4	fixed typo Summary: I fixed exisit -> exist Closes https://github.com/facebook/rocksdb/pull/1799 Differential Revision: D4451466 Pulled By: yiwu-arbug fbshipit-source-id: b447c3a	2017-01-23 12:54:13 -08:00
yinqiwen	973f1b78fd	memtable: delete merge value for range deleteion Summary: Closes https://github.com/facebook/rocksdb/pull/1797 Differential Revision: D4448004 Pulled By: ajkr fbshipit-source-id: 3ffc27c	2017-01-23 12:24:14 -08:00
Vitaliy Liptchinsky	753ff84a3d	Fix get approx size Summary: Fixing GetApproximateSize bug for the case of computing stats for mem tables only. Closes https://github.com/facebook/rocksdb/pull/1795 Differential Revision: D4445507 Pulled By: IslamAbdelRahman fbshipit-source-id: 3905846	2017-01-20 15:54:12 -08:00
Jay Lee	537da370da	c: allow set savepoint to writebatch Summary: Allow set SavePoint to WriteBatch in C ABI. Closes https://github.com/facebook/rocksdb/pull/1698 Differential Revision: D4378556 Pulled By: yiwu-arbug fbshipit-source-id: afca746	2017-01-20 13:24:13 -08:00
Changli Gao	5ac97314e7	Fix std::out_of_range when DBOptions::keep_log_file_num is zero Summary: We should validate this option, otherwise we may see std::out_of_range thrown at: db/db_impl.cc:1124 1123 for (unsigned int i = 0; i <= end; i++) { 1124 std::string& to_delete = old_info_log_files.at(i); 1125 std::string full_path_to_delete = 1126 (immutable_db_options_.db_log_dir.empty() Closes https://github.com/facebook/rocksdb/pull/1722 Differential Revision: D4379495 Pulled By: yiwu-arbug fbshipit-source-id: e136552	2017-01-20 13:24:12 -08:00
Shu Zhang	3c0852d1da	Make ingest external file backward compatible Summary: Closes https://github.com/facebook/rocksdb/pull/1783 Differential Revision: D4443463 Pulled By: IslamAbdelRahman fbshipit-source-id: 39d21d6	2017-01-20 12:09:19 -08:00
Siying Dong	0e8dfd6062	Fix OptimizeForPointLookup() Summary: If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead. Closes https://github.com/facebook/rocksdb/pull/1791 Differential Revision: D4442836 Pulled By: siying fbshipit-source-id: bf6c9cd	2017-01-20 10:54:12 -08:00
Vitaliy Liptchinsky	e840213d6e	Change DB::GetApproximateSizes for more flexibility needed for MyRocks Summary: Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables. Closes https://github.com/facebook/rocksdb/pull/1787 Differential Revision: D4441111 Pulled By: IslamAbdelRahman fbshipit-source-id: c11f4c3	2017-01-20 09:39:11 -08:00
Yi Wu	9239103cd4	Flush job should release reference current version if sync log failed Summary: Fix the bug when sync log fail, FlushJob::Run() will not be execute and reference to cfd->current() will not be release. Closes https://github.com/facebook/rocksdb/pull/1792 Differential Revision: D4441316 Pulled By: yiwu-arbug fbshipit-source-id: 5523e28	2017-01-19 23:09:15 -08:00
Islam AbdelRahman	da54d36a96	Disable IngestExternalFile in ReadOnly mode Summary: Disable IngestExternalFile() in read only mode Closes https://github.com/facebook/rocksdb/pull/1781 Differential Revision: D4439179 Pulled By: IslamAbdelRahman fbshipit-source-id: b7e46e7	2017-01-19 15:54:19 -08:00
Reid Horuff	5cf176ca15	Fix for 2PC causing WAL to grow too large Summary: Consider the following single column family scenario: prepare in log A commit in log B WAL is too large, flush all CFs to releast log A CFA is on log B so we do not see CFA is depending on log A so no flush is requested To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on. Closes https://github.com/facebook/rocksdb/pull/1768 Differential Revision: D4403265 Pulled By: reidHoruff fbshipit-source-id: ce800ff	2017-01-19 15:39:12 -08:00
Andrew Kryczka	f9d18e22d2	Fix DeleteRange file boundary correctness issue with max_compaction_bytes Summary: Cockroachdb exposed this bug in #1778. The bug happens when a compaction's output files are ended due to exceeding max_compaction_bytes. In that case we weren't taking into account the next file's start key when deciding how far to extend the current file's max_key. This caused the non-overlapping key-range invariant to be violated. Note this was correctly handled for the usual case of cutting compaction output, which is file size exceeding max_output_file_size. I am not sure why these are two separate code paths, but we can consider refactoring it to prevent such errors in the future. Closes https://github.com/facebook/rocksdb/pull/1784 Differential Revision: D4430235 Pulled By: ajkr fbshipit-source-id: 80af748	2017-01-18 11:54:22 -08:00
Islam AbdelRahman	3ce091fd73	Add KEEP_DB env var option Summary: When debugging tests, it's useful to preserve the DB to investigate it and check the logs This will allow us to set KEEP_DB=1 to preserve the DB Closes https://github.com/facebook/rocksdb/pull/1759 Differential Revision: D4393826 Pulled By: IslamAbdelRahman fbshipit-source-id: 1bff689	2017-01-17 13:54:20 -08:00
Siying Dong	77b4806625	Fix 2PC with concurrent memtable insert Summary: If concurrent memtable insert is enabled, and one prepare command and a normal command are grouped into a commit group, the sequence ID will be calculated incorrectly. Closes https://github.com/facebook/rocksdb/pull/1730 Differential Revision: D4371081 Pulled By: siying fbshipit-source-id: cd40c6d	2017-01-17 11:24:28 -08:00
Mike Kolupaev	d18dd2c41f	Abort compactions more reliably when closing DB Summary: DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper). There's also some minor code cleanup along the way. Closes https://github.com/facebook/rocksdb/pull/1639 Differential Revision: D4306571 Pulled By: yiwu-arbug fbshipit-source-id: f050890	2017-01-11 15:09:21 -08:00
Changli Gao	9f246298e2	Performance: Iterate vector by reference Summary: Closes https://github.com/facebook/rocksdb/pull/1763 Differential Revision: D4398796 Pulled By: yiwu-arbug fbshipit-source-id: b82636d	2017-01-11 10:54:37 -08:00
Dmitri Smirnov	3c233ca4ea	Fix Windows environment issues Summary: Enable directIO on WritableFileImpl::Append with offset being current length of the file. Enable UniqueID tests on Windows, disable others but leeting them to compile. Unique tests are valuable to detect failures on different filesystems and upcoming ReFS. Clear output in WinEnv Getchildren.This is different from previous strategy, do not touch output on failure. Make sure DBTest.OpenWhenOpen works with windows error message Closes https://github.com/facebook/rocksdb/pull/1746 Differential Revision: D4385681 Pulled By: IslamAbdelRahman fbshipit-source-id: c07b702	2017-01-09 15:54:12 -08:00
Maysam Yabandeh	d0ba8ec8f9	Revert "PinnableSlice" Summary: This reverts commit `54d94e9c2c`. The pull request was landed by mistake. Closes https://github.com/facebook/rocksdb/pull/1755 Differential Revision: D4391678 Pulled By: maysamyabandeh fbshipit-source-id: 36d5149	2017-01-08 14:24:12 -08:00
Maysam Yabandeh	54d94e9c2c	PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: 1. value 100 byte: 1.8% regular, 1.2% merge values 2. value 1k byte: 11.5% regular, 7.5% merge values 3. value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The difference is a little and could be noise. More importantly it is safely cancelled Closes https://github.com/facebook/rocksdb/pull/1732 Differential Revision: D4374613 Pulled By: maysamyabandeh fbshipit-source-id: a077f1a	2017-01-08 13:54:13 -08:00
Andrew Kryczka	b104b87814	Maintain position in range deletions map Summary: When deletion-collapsing mode is enabled (i.e., for DBIter/CompactionIterator), we maintain position in the tombstone maps across calls to ShouldDelete(). Since iterators often access keys sequentially (or reverse-sequentially), scanning forward/backward from the last position can be faster than binary-searching the map for every key. - When Next() is invoked on an iterator, we use kForwardTraversal to scan forwards, if needed, until arriving at the range deletion containing the next key. - Similarly for Prev(), we use kBackwardTraversal to scan backwards in the range deletion map. - When the iterator seeks, we use kBinarySearch for repositioning - After tombstones are added or before the first ShouldDelete() invocation, the current position is set to invalid, which forces kBinarySearch to be used. - Non-iterator users (i.e., Get()) use kFullScan, which has the same behavior as before---scan the whole map for every key passed to ShouldDelete(). Closes https://github.com/facebook/rocksdb/pull/1701 Differential Revision: D4350318 Pulled By: ajkr fbshipit-source-id: 5129b76	2017-01-05 10:39:12 -08:00
siddontang	653ac1f9c6	C API: support total_order_mode Summary: Closes https://github.com/facebook/rocksdb/pull/1687 Differential Revision: D4349210 Pulled By: IslamAbdelRahman fbshipit-source-id: 32d0fbd	2017-01-03 18:39:14 -08:00
Adam Retter	85ac1a320a	Fix rocksdb::Status::getState Summary: This fixes the Java API for Status#getState use in Native code and also simplifies the implementation of rocksdb::Status::getState. Closes https://github.com/facebook/rocksdb/issues/1688 Closes https://github.com/facebook/rocksdb/pull/1714 Differential Revision: D4364181 Pulled By: yiwu-arbug fbshipit-source-id: 8e073b4	2017-01-03 18:39:14 -08:00
Islam AbdelRahman	76711b6e77	Make ExternalSSTFileTest::CompactionDeadlock more deterministic Summary: It's not always true that `ASSERT_EQ(running_threads.load(), 2);` Closes https://github.com/facebook/rocksdb/pull/1736 Differential Revision: D4374091 Pulled By: IslamAbdelRahman fbshipit-source-id: 4f70bbd	2017-01-03 18:09:20 -08:00
Islam AbdelRahman	c963460dbc	Fix tests under GCC_481 Summary: This fix the issue with tests failing under GCC 481, I am not sure what is the exact reason Closes https://github.com/facebook/rocksdb/pull/1735 Differential Revision: D4374094 Pulled By: IslamAbdelRahman fbshipit-source-id: b3625bc	2017-01-03 17:54:12 -08:00
Vincent Lee	e425ec1162	utilities/backupable: backup should limit the copy size of wal. Summary: Since the backup work as snapshot, we should only copy the bytes of the wal while we get the alive files. Closes https://github.com/facebook/rocksdb/pull/1733 Differential Revision: D4373457 Pulled By: ajkr fbshipit-source-id: 389318f	2016-12-31 10:54:20 -08:00
Maysam Yabandeh	0712d541d1	Delegate Cleanables Summary: Cleanable objects will perform the registered cleanups when they are destructed. We however rather to delay this cleaning like when we are gathering the merge operands. Current approach is to create the Cleanable object on heap (instead of on stack) and delay deleting it. By allowing Cleanables to delegate their cleanups to another cleanable object we can delay the cleaning without however the need to craete the cleanable object on heap and keeping it around. This patch applies this technique for the cleanups of BlockIter and shows improved performance for some in-memory benchmarks: +1.8% for merge worklaod, +6.4% for non-merge workload when the merge operator is specified. https://our.intern.facebook.com/intern/tasks?t=15168163 Non-merge benchmark: TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=100 -compression_type=none Reading random with no merge operator specified: TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="read Closes https://github.com/facebook/rocksdb/pull/1711 Differential Revision: D4361163 Pulled By: maysamyabandeh fbshipit-source-id: 9801e07	2016-12-29 15:54:19 -08:00
Islam AbdelRahman	d58ef52ba6	Allow SstFileWriter to Fadvise the file away from page cache Summary: Add `fadvise_trigger` option to `SstFileWriter` If fadvise_trigger is passed with a non-zero value, SstFileWriter will invalidate the os page cache every `fadvise_trigger` bytes for the sst file Closes https://github.com/facebook/rocksdb/pull/1731 Differential Revision: D4371246 Pulled By: IslamAbdelRahman fbshipit-source-id: 91caff1	2016-12-29 15:09:19 -08:00
Siying Dong	17a4b75cc3	Always fsync the file after file copying Summary: File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large. Closes https://github.com/facebook/rocksdb/pull/1728 Differential Revision: D4371083 Pulled By: siying fbshipit-source-id: 579e14c	2016-12-28 19:09:16 -08:00
leipeng	a738af8f84	db/pinned_iterators_manager.h: bugfix Summary: std::unique(beg, end) returns an iterator of unique_end, data behind unique_end should not be accessed. Closes https://github.com/facebook/rocksdb/pull/1726 Differential Revision: D4371076 Pulled By: IslamAbdelRahman fbshipit-source-id: 5564450	2016-12-28 18:54:57 -08:00
Siying Dong	438f22bc56	Fix bug of Checkpoint loses recent transactions with 2PC Summary: If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files. Closes https://github.com/facebook/rocksdb/pull/1724 Differential Revision: D4368319 Pulled By: siying fbshipit-source-id: cc2c746	2016-12-28 12:24:16 -08:00
Aaron Gao	972f96b3fb	direct io write support Summary: rocksdb direct io support ``` [gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 5.0 Date: Wed Nov 23 13:17:43 2016 CPU: 40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz CPUCache: 25600 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 4.393 micros/op 227639 ops/sec; 25.2 MB/s [gzh@dev11575.prn2 ~/roc Closes https://github.com/facebook/rocksdb/pull/1564 Differential Revision: D4241093 Pulled By: lightmark fbshipit-source-id: 98c29e3	2016-12-22 13:09:19 -08:00
Islam AbdelRahman	989e644ed8	Remove sst_file_manager option from LITE Summary: Remove sst_file_manager option from LITE Closes https://github.com/facebook/rocksdb/pull/1690 Differential Revision: D4341331 Pulled By: IslamAbdelRahman fbshipit-source-id: 9f9328d	2016-12-21 17:54:21 -08:00
Islam AbdelRahman	1beef6569a	Fix c_test Summary: addfile phase in c_test could fail because in previous steps we did a DeleteRange. Fix the test by simply moving the addfile phase before DeleteRange Closes https://github.com/facebook/rocksdb/pull/1672 Differential Revision: D4328896 Pulled By: IslamAbdelRahman fbshipit-source-id: 1d946df	2016-12-21 17:39:14 -08:00
Andrew Kryczka	50e305de98	Collapse range deletions Summary: Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach. Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map. Closes https://github.com/facebook/rocksdb/pull/1614 Differential Revision: D4270397 Pulled By: ajkr fbshipit-source-id: 93092c7	2016-12-19 16:54:12 -08:00
Yi Wu	5d1457dbbf	Dump persistent cache options Summary: Dump persistent cache options Closes https://github.com/facebook/rocksdb/pull/1679 Differential Revision: D4337019 Pulled By: yiwu-arbug fbshipit-source-id: 3812f8a	2016-12-19 14:09:12 -08:00
Daniel Black	342370f1d3	Simplify MemTable::Update Summary: As suggested by testn in #1650 The Add is at the end of the function. Having a fallthough will result in it being added twice. Closes https://github.com/facebook/rocksdb/pull/1676 Differential Revision: D4331906 Pulled By: yiwu-arbug fbshipit-source-id: 895c4a0	2016-12-17 00:09:13 -08:00
Ding Ma	1a136c1f13	Expose file size Summary: add a new function to SstFileWriter that will tell the user how big is there file right now. Closes https://github.com/facebook/rocksdb/pull/1686 Differential Revision: D4338868 Pulled By: mdyuki1016 fbshipit-source-id: c1ee16a	2016-12-16 18:39:12 -08:00
Andrew Kryczka	fbff4628a9	Reduce compaction iterator status checks Summary: seems it's expensive to check status since the underlying merge iterator checks status of all its children. so only do it when it's really necessary to get the status before invoking Next(), i.e., when we're advancing to get the first key in the next file. Closes https://github.com/facebook/rocksdb/pull/1691 Differential Revision: D4343446 Pulled By: siying fbshipit-source-id: 70ab315	2016-12-16 17:39:09 -08:00
Daniel Black	816c1e30ca	gcc-7 requires include <functional> for std::function Summary: Fixes compile error: In file included from ./util/statistics.h:17:0, from ./util/stop_watch.h:8, from ./util/perf_step_timer.h:9, from ./util/iostats_context_imp.h:8, from ./util/posix_logger.h:27, from ./port/util_logger.h:18, from ./db/auto_roll_logger.h:15, from db/auto_roll_logger.cc:6: ./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type typedef std::function<void(void, void)> FoldFunc; Closes https://github.com/facebook/rocksdb/pull/1656 Differential Revision: D4318702 Pulled By: yiwu-arbug fbshipit-source-id: 8c5d17a	2016-12-16 11:24:18 -08:00
Yi Wu	c270735861	Iterator should be in corrupted status if merge operator return false Summary: Iterator should be in corrupted status if merge operator return false. Also add test to make sure if max_successive_merges is hit during write, data will not be lost. Closes https://github.com/facebook/rocksdb/pull/1665 Differential Revision: D4322695 Pulled By: yiwu-arbug fbshipit-source-id: b327b05	2016-12-16 11:09:16 -08:00
siddontang	8f5d24ae68	C API: support get usage and pinned_usage for cache Summary: Closes https://github.com/facebook/rocksdb/pull/1671 Differential Revision: D4327453 Pulled By: yiwu-arbug fbshipit-source-id: bcdbc65	2016-12-15 17:24:17 -08:00
Daniel Black	cfc34d7c4e	Missing break in case in DBTestBase::CurrentOptions Summary: Found by gcc-7 compile error. This appeared to be a fault as these options seems too different. Closes https://github.com/facebook/rocksdb/pull/1667 Differential Revision: D4324174 Pulled By: yiwu-arbug fbshipit-source-id: 0f65383	2016-12-13 18:39:14 -08:00
Daniel Black	bfbcec2339	Gcc 7 error expansion to defined Summary: sorry if these gcc-7/clang-4 cleanups are getting tedious. Closes https://github.com/facebook/rocksdb/pull/1658 Differential Revision: D4318792 Pulled By: yiwu-arbug fbshipit-source-id: 8e85891	2016-12-13 18:39:14 -08:00
Daniel Black	67adc937b6	intentional fallthough (prevents gcc-7/clang-4 error) Summary: db/memtable.cc: In member function 'void rocksdb::MemTable::Update(rocksdb::SequenceNumber, const rocksdb::Slice&, const rocksdb::Slice&)': db/memtable.cc:736:11: error: this statement may fall through [-Werror=implicit-fallthrough=] } ^ db/memtable.cc:738:9: note: here default: ^~~~~~~ cc1plus: all warnings being treated as errors closes #1650 Closes https://github.com/facebook/rocksdb/pull/1655 Differential Revision: D4318696 Pulled By: yiwu-arbug fbshipit-source-id: 1a8981c	2016-12-13 14:39:17 -08:00
Islam AbdelRahman	1a146f89c7	break Flush wait for dropped CF Summary: In FlushJob we dont do the Flush if the CF is dropped https://github.com/facebook/rocksdb/blob/master/db/flush_job.cc#L184-L188 but inside WaitForFlushMemTable we keep waiting forever even if the CF is dropped. Closes https://github.com/facebook/rocksdb/pull/1664 Differential Revision: D4321032 Pulled By: IslamAbdelRahman fbshipit-source-id: 6e2b25d	2016-12-13 14:09:12 -08:00
Yi Wu	36d42e65d0	Disable test to unblock travis build Summary: The two tests keep failing in travis. Disable them and will fix later. Closes https://github.com/facebook/rocksdb/pull/1648 Differential Revision: D4316389 Pulled By: yiwu-arbug fbshipit-source-id: 0a370e7	2016-12-13 11:54:14 -08:00
siddontang	b57dd9262a	C API: support writebatch delete range Summary: Seem that writebatch delete range can work now, so I add C API for later use. Btw, can we use this feature in production now? Closes https://github.com/facebook/rocksdb/pull/1647 Differential Revision: D4314534 Pulled By: ajkr fbshipit-source-id: e835165	2016-12-13 11:24:18 -08:00
Islam AbdelRahman	2ba59b5a1e	Disallow ingesting files into dropped CFs Summary: This PR update IngestExternalFile to return an error if we try to ingest a file into a dropped CF. Right now if IngestExternalFile want to flush a memtable, and it's ingesting a file into a dropped CF, it will wait forever since flushing is not possible for the dropped CF Closes https://github.com/facebook/rocksdb/pull/1657 Differential Revision: D4318657 Pulled By: IslamAbdelRahman fbshipit-source-id: ed6ea2b	2016-12-13 00:54:14 -08:00
Jonathan Lee	2cabdb8f44	Increase buffer size Summary: When compiling with GCC>=7.0.0, "db/internal_stats.cc" fails to compile as the data being written to the buffer potentially exceeds its size. This fix simply doubles the size of the buffer, thus accommodating the max possible data size. Closes https://github.com/facebook/rocksdb/pull/1635 Differential Revision: D4302162 Pulled By: yiwu-arbug fbshipit-source-id: c76ad59	2016-12-09 11:54:22 -08:00
Jonathan Lee	4a17b47bb5	Remove unnecessary header include Summary: Remove "util/testharness.h" from list of includes for "db/db_filesnapshot.cc", as it wasn't being used and thus caused an extraneous dependency on gtest. Closes https://github.com/facebook/rocksdb/pull/1634 Differential Revision: D4302146 Pulled By: yiwu-arbug fbshipit-source-id: e900c0b	2016-12-09 11:54:21 -08:00
Mike Kolupaev	8c2b921fdf	Fixed a crash in debug build in flush_job.cc Summary: It was doing `&range_del_iters[0]` on an empty vector. Even though the resulting pointer is never dereferenced, it's still bad for two reasons: * the practical reason: it crashes with `std::out_of_range` exception in our debug build, * the "C++ standard lawyer" reason: it's undefined behavior because, in `std::vector` implementation, it probably "dereferences" a null pointer, which is invalid even though it doesn't actually read the pointed memory, just converts a pointer into a reference (and then flush_job.cc converts it back to pointer); nullptr references are undefined behavior. Closes https://github.com/facebook/rocksdb/pull/1612 Differential Revision: D4265625 Pulled By: al13n321 fbshipit-source-id: db26fb9	2016-12-09 10:39:12 -08:00
Islam AbdelRahman	20ce081fae	Fix issue where IngestExternalFile insert blocks in block cache with g_seqno=0 Summary: When we Ingest an external file we open it to read some metadata and first/last key during doing that we insert blocks into the block cache with global_seqno = 0 If we move the file (did not copy it) into the DB, we will use these blocks with the wrong seqno in the read path Closes https://github.com/facebook/rocksdb/pull/1627 Differential Revision: D4293332 Pulled By: yiwu-arbug fbshipit-source-id: 3ce5523	2016-12-08 13:39:18 -08:00
zhangjinpeng1987	45c7ce1377	CompactRangeOptions C API Summary: Add C API for CompactRangeOptions. Closes https://github.com/facebook/rocksdb/pull/1596 Differential Revision: D4252339 Pulled By: yiwu-arbug fbshipit-source-id: f768f93	2016-12-07 17:54:14 -08:00
Andrew Kryczka	b821984d31	DeleteRange read path end-to-end tests Summary: Closes https://github.com/facebook/rocksdb/pull/1592 Differential Revision: D4246260 Pulled By: ajkr fbshipit-source-id: ce03fa2	2016-12-07 12:54:17 -08:00
Artemiy Kolesnikov	2f4fc539c6	Compaction::IsTrivialMove relaxing Summary: IsTrivialMove returns true if no input file overlaps with output_level+1 with more than max_compaction_bytes_ bytes. Closes https://github.com/facebook/rocksdb/pull/1619 Differential Revision: D4278338 Pulled By: yiwu-arbug fbshipit-source-id: 994c001	2016-12-07 11:54:11 -08:00
Islam AbdelRahman	ed8fbdb560	Add EventListener::OnExternalFileIngested() event Summary: Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events Closes https://github.com/facebook/rocksdb/pull/1623 Differential Revision: D4285844 Pulled By: IslamAbdelRahman fbshipit-source-id: 0b95a88	2016-12-06 14:09:17 -08:00
Mike Kolupaev	beb36d9c1e	Fixed CompactionFilter::Decision::kRemoveAndSkipUntil Summary: Embarassingly enough, the first time I tried to use my new feature in logdevice it crashed with this assertion failure: db/pinned_iterators_manager.h:30: void rocksdb::PinnedIteratorsManager::StartPinning(): Assertion `pinning_enabled == false' failed The issue was that `pinned_iters_mgr_.StartPinning()` was called but `pinned_iters_mgr_.ReleasePinnedData()` wasn't. Closes https://github.com/facebook/rocksdb/pull/1611 Differential Revision: D4265622 Pulled By: al13n321 fbshipit-source-id: 747b10f	2016-12-05 15:24:11 -08:00
Islam AbdelRahman	67f37cf198	Allow user to specify a CF for SST files generated by SstFileWriter Summary: Allow user to explicitly specify that the generated file by SstFileWriter will be ingested in a specific CF. This allow us to persist the CF id in the generated file Closes https://github.com/facebook/rocksdb/pull/1615 Differential Revision: D4270422 Pulled By: IslamAbdelRahman fbshipit-source-id: 7fb954e	2016-12-05 14:24:16 -08:00
Anton Safonov	9053fe2a5c	Made delete_obsolete_files_period_micros option dynamic Summary: Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions(). Closes https://github.com/facebook/rocksdb/pull/1595 Differential Revision: D4246569 Pulled By: tonek fbshipit-source-id: d23f560	2016-12-05 14:24:16 -08:00
Islam AbdelRahman	edde954e7b	fix clang build Summary: override is missing for FilterV2 Closes https://github.com/facebook/rocksdb/pull/1606 Differential Revision: D4263832 Pulled By: IslamAbdelRahman fbshipit-source-id: d8b337a	2016-12-01 18:39:10 -08:00
Islam AbdelRahman	e39d080871	Fix travis (compile for clang < 3.9) Summary: Travis fail because it uses clang 3.6 which don't recognize `__attribute__((__no_sanitize__("undefined")))` Closes https://github.com/facebook/rocksdb/pull/1601 Differential Revision: D4257175 Pulled By: IslamAbdelRahman fbshipit-source-id: fb4d1ab	2016-12-01 10:09:22 -08:00
fangchenliaohui	b77007df8b	Bug: paralle_group status updated in WriteThread::CompleteParallelWorker Summary: Multi-write thread may update the status of the parallel_group in WriteThread::CompleteParallelWorker if the status of Writer is not ok! When copy write status to the paralle_group, the write thread just hold the mutex of the the writer processed by itself. it is useless. The thread should held the the leader of the parallel_group instead. Closes https://github.com/facebook/rocksdb/pull/1598 Differential Revision: D4252335 Pulled By: siying fbshipit-source-id: 3864cf7	2016-12-01 09:54:11 -08:00
Mike Kolupaev	247d0979aa	Support for range skips in compaction filter Summary: This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API. Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :) The diff is not as big as it may seem, about half of the lines are a test. Closes https://github.com/facebook/rocksdb/pull/1599 Differential Revision: D4252092 Pulled By: al13n321 fbshipit-source-id: 41e1e48	2016-12-01 07:09:15 -08:00
Panagiotis Ktistakis	96fcefbf1d	c api: expose option for dynamic level size target Summary: Closes https://github.com/facebook/rocksdb/pull/1587 Differential Revision: D4245923 Pulled By: yiwu-arbug fbshipit-source-id: 6ee7291	2016-11-30 11:24:14 -08:00
zhangjinpeng1987	00197cff39	Add C API to set base_backgroud_compactions Summary: Add C API to set base_backgroud_compactions Closes https://github.com/facebook/rocksdb/pull/1571 Differential Revision: D4245709 Pulled By: yiwu-arbug fbshipit-source-id: 792c6b8	2016-11-30 11:09:13 -08:00
Andrew Kryczka	5b219eccb5	deleterange end-to-end test improvements for lite/robustness Summary: Closes https://github.com/facebook/rocksdb/pull/1591 Differential Revision: D4246019 Pulled By: ajkr fbshipit-source-id: 0c4aa37	2016-11-29 12:24:13 -08:00
Andrew Kryczka	e333528991	DeleteRange write path end-to-end tests Summary: Closes https://github.com/facebook/rocksdb/pull/1578 Differential Revision: D4241171 Pulled By: ajkr fbshipit-source-id: ce5fd83	2016-11-29 11:09:22 -08:00
Siying Dong	7784980fcd	Fix mis-reporting of compaction read bytes to the base level Summary: In dynamic leveled compaction, when calculating read bytes, output level bytes may be wronglyl calculated as input level inputs. Fix it. Closes https://github.com/facebook/rocksdb/pull/1475 Differential Revision: D4148412 Pulled By: siying fbshipit-source-id: f2f475a	2016-11-29 11:09:22 -08:00
Islam AbdelRahman	3c6b49ed66	Fix implicit conversion between int64_t to int Summary: Make conversion explicit, implicit conversion breaks the build Closes https://github.com/facebook/rocksdb/pull/1589 Differential Revision: D4245158 Pulled By: IslamAbdelRahman fbshipit-source-id: aaec00d	2016-11-29 10:54:15 -08:00
Siying Dong	b3b875657f	Remove unused assignment in db/db_iter.cc Summary: "make analyze" complains the assignment is not useful. Remove it. Closes https://github.com/facebook/rocksdb/pull/1581 Differential Revision: D4241697 Pulled By: siying fbshipit-source-id: 178f67a	2016-11-29 09:09:14 -08:00
Andrew Kryczka	4f6e89b1d0	Fix range deletion covering key in same SST file Summary: AddTombstones() needs to be before t->Get(), oops :'( Closes https://github.com/facebook/rocksdb/pull/1576 Differential Revision: D4241041 Pulled By: ajkr fbshipit-source-id: 781ceea	2016-11-28 22:54:13 -08:00
Islam AbdelRahman	a2bf265a39	Avoid intentional overflow in GetL0ThresholdSpeedupCompaction Summary: `99c052a34f` fixes integer overflow in GetL0ThresholdSpeedupCompaction() by checking if int become -ve. UBSAN will complain about that since this is still an overflow, we can fix the issue by simply using int64_t Closes https://github.com/facebook/rocksdb/pull/1582 Differential Revision: D4241525 Pulled By: IslamAbdelRahman fbshipit-source-id: b3ae21f	2016-11-28 18:39:13 -08:00
Islam AbdelRahman	52fd1ff2c2	disable UBSAN for functions with intentional -ve shift / overflow Summary: disable UBSAN for functions with intentional left shift on -ve number / overflow These functions are rocksdb:: Hash FixedLengthColBufEncoder::Append FaultInjectionTest:: Key Closes https://github.com/facebook/rocksdb/pull/1577 Differential Revision: D4240801 Pulled By: IslamAbdelRahman fbshipit-source-id: 3e1caf6	2016-11-28 17:54:12 -08:00
Islam AbdelRahman	1886c435b9	Fix CompactionJob::Install division by zero Summary: Fix CompactionJob::Install division by zero Closes https://github.com/facebook/rocksdb/pull/1580 Differential Revision: D4240794 Pulled By: IslamAbdelRahman fbshipit-source-id: 7286721	2016-11-28 16:54:16 -08:00
Islam AbdelRahman	13e66a8f51	Fix compaction_job.cc division by zero Summary: Fix division by zero in compaction_job.cc Closes https://github.com/facebook/rocksdb/pull/1575 Differential Revision: D4240818 Pulled By: IslamAbdelRahman fbshipit-source-id: a8bc757	2016-11-28 16:39:13 -08:00
Andrew Kryczka	01eabf7375	Fix double-counted deletion stat Summary: Both the single deletion and the value are included in compaction outputs, so no need to update the stat for the value's deletion yet, otherwise it'd be double-counted. Closes https://github.com/facebook/rocksdb/pull/1574 Differential Revision: D4241181 Pulled By: ajkr fbshipit-source-id: c9aaa15	2016-11-28 15:54:12 -08:00
Andrew Kryczka	7ffb10fc1a	DeleteRange compaction statistics Summary: - "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them - "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them - s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction - Move the above class into a separate .h file to avoid circular dependency. Closes https://github.com/facebook/rocksdb/pull/1520 Differential Revision: D4187179 Pulled By: ajkr fbshipit-source-id: 10c2103	2016-11-28 11:54:12 -08:00
Mike Kolupaev	236d4c67e9	Less linear search in DBIter::Seek() when keys are overwritten a lot Summary: In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack: ``` rocksdb::MergingIterator::Next rocksdb::DBIter::FindNextUserEntryInternal rocksdb::DBIter::Seek ``` I think what's happening is: 1. we create a snapshot iterator, 2. we do lots of Put()s for the same key x; this creates lots of entries in memtable, 3. we seek the iterator to a key slightly smaller than x, 4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers. CC IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/1413 Differential Revision: D4083879 Pulled By: IslamAbdelRahman fbshipit-source-id: a83ddae	2016-11-28 10:24:11 -08:00
Siying Dong	cd7c4143d7	Improve Write Stalling System Summary: Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before. Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system. Closes https://github.com/facebook/rocksdb/pull/1562 Differential Revision: D4218519 Pulled By: siying fbshipit-source-id: 95e4088	2016-11-23 09:24:15 -08:00
Yi Wu	dfb6fe6755	Unified InlineSkipList::Insert algorithm with hinting Summary: This PR is based on nbronson's diff with small modifications to wire it up with existing interface. Comparing to previous version, this approach works better for inserting keys in decreasing order or updating the same key, and impose less restriction to the prefix extractor. ---- Summary from original diff ---- This diff introduces a single InlineSkipList::Insert that unifies the existing sequential insert optimization (prev_), concurrent insertion, and insertion using externally-managed insertion point hints. There's a deep symmetry between insertion hints (cursors) and the concurrent algorithm. In both cases we have partial information from the recent past that is likely but not certain to be accurate. This diff introduces the struct InlineSkipList::Splice, which encodes predecessor and successor information in the same form that was previously only used within a single call to InsertConcurrently. Splice holds information about an insertion point that can be used to levera Closes https://github.com/facebook/rocksdb/pull/1561 Differential Revision: D4217283 Pulled By: yiwu-arbug fbshipit-source-id: 33ee437	2016-11-22 14:09:13 -08:00

1 2 3 4 5 ...

2787 commits