rocksdb

Commit Graph

Author	SHA1	Message	Date
Peter Dillinger	a680a7ea37	Un-revert #7049 , revert #7022 (#7071 ) Summary: Even though local bisection gave me a clear signal (and still does) that reverting https://github.com/facebook/rocksdb/issues/7049 would fix the failures in MultiThreadedDBTest, https://github.com/facebook/rocksdb/issues/7022 seems to be the root cause. Reverting https://github.com/facebook/rocksdb/issues/7022 and keeping https://github.com/facebook/rocksdb/issues/7049 seems to fix the issue in local reproducer also. (Had these landed in opposite order, bisection would have found the root cause.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7071 Reviewed By: akankshamahajan15 Differential Revision: D22362857 Pulled By: pdillinger fbshipit-source-id: ed63df3d74e9d4ce1604de8fe43b216166c7a3f0	2020-07-02 13:30:41 -07:00
Peter Dillinger	52d59e0c93	Revert "Whole DBTest to skip fsync (#7049 )" (#7070 ) Summary: This reverts commit `4f1534bdb0`. This commit caused failures and deadlocks in MultiThreadedDBTest.MultiThreaded/69 and others. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7070 Reviewed By: riversand963 Differential Revision: D22358778 Pulled By: pdillinger fbshipit-source-id: faf8f2cb469a7063a113921c8e9c64a9f7610dac	2020-07-02 10:22:43 -07:00
sdong	4f1534bdb0	Whole DBTest to skip fsync (#7049 ) Summary: After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7049 Test Plan: Run all existing files. Reviewed By: pdillinger Differential Revision: D22301700 fbshipit-source-id: f9a9e3b3b26ce640665a47cb8bff33ba0c89b565	2020-07-01 19:37:56 -07:00
Akanksha Mahajan	5edfe3a3d8	Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7022 ) Summary: When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block. Fix: After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7022 Test Plan: 1. make check -j64 2. Added one unit test case Reviewed By: ajkr Differential Revision: D22197734 Pulled By: akankshamahajan15 fbshipit-source-id: d87e9e46bccab8e896ee6979d6b79c51f73d479e	2020-07-01 14:58:08 -07:00
sdong	d64cf0e4ee	Move away from direct TmpDir() call in some tests (#7030 ) Summary: Some tests directly uses TmpDir() as temporary directory without adding any randomize factor. This would cause failures when tests run in parallel. Fix it by moving some of them to test::PerThreadDBPath() Pull Request resolved: https://github.com/facebook/rocksdb/pull/7030 Test Plan: Watch existing tests pass Reviewed By: zhichao-cao Differential Revision: D22224710 fbshipit-source-id: 28c9932fede0a4a64670e5b5fdb08f4fb5dccdd0	2020-06-25 12:09:57 -07:00
sdong	d6b7b7712f	Fix a bug that causes iterator to return wrong result in a rare data race (#6973 ) Summary: The bug fixed in https://github.com/facebook/rocksdb/pull/1816/ is now applicable to iterator too. This was not an issue but https://github.com/facebook/rocksdb/pull/2886 caused the regression. If a put and DB flush happens just between iterator to get latest sequence number and getting super version, empty result for the key or an older value can be returned, which is wrong. Fix it in the same way as the fix in https://github.com/facebook/rocksdb/issues/1816, that is to get the sequence number after referencing the super version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6973 Test Plan: Will run stress tests for a while to make sure there is no general regression. Reviewed By: ajkr Differential Revision: D22029348 fbshipit-source-id: 94390f93630906796d6e2fec321f44a920953fd1	2020-06-18 10:16:38 -07:00
sdong	223b57eeb8	Fix the bug that compressed cache is disabled in read-only DBs (#6990 ) Summary: Compressed block cache is disabled in https://github.com/facebook/rocksdb/pull/4650 for no good reason. Re-enable it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6990 Test Plan: Add a unit test to make sure a general function works with read-only DB + compressed block cache. Reviewed By: ltamasi Differential Revision: D22072755 fbshipit-source-id: 2a55df6363de23a78979cf6c747526359e5dc7a1	2020-06-17 14:30:54 -07:00
Zhen Li	9c24a5cb4d	Fix persistent cache on windows (#6932 ) Summary: Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932 Test Plan: 1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed. 2. manually run db_bench.exe with "read_cache_path" and verified. Reviewed By: riversand963 Differential Revision: D21911608 Pulled By: cheng-chang fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2	2020-06-13 13:28:31 -07:00
Yanqin Jin	717749f4c0	Fail point-in-time WAL recovery upon IOError reading WAL (#6963 ) Summary: If `options.wal_recovery_mode == WALRecoveryMode::kPointInTimeRecovery`, RocksDB stops replaying WAL once hitting an error and discards the rest of the WAL. This can lead to data loss if the error occurs at an offset smaller than the last sync'ed offset. Ideally, RocksDB point-in-time recovery should permit recovery if the error occurs after last synced offset while fail recovery if error occurs before the last synced offset. However, RocksDB does not track the synced offset of WALs. Consequently, RocksDB does not know whether an error occurs before or after the last synced offset. An error can be one of the following. - WAL record checksum mismatch. This can result from both corruption of synced data and dropping of unsynced data during shutdown. We cannot be sure which one. In order not to defeat the original motivation to permit the latter case, we keep the original behavior of point-in-time WAL recovery. - IOError. This means the WAL can be bad, an indicator of whole file becoming unavailable, not to mention synced part of the WAL. Therefore, we choose to modify the behavior of point-in-time recovery and fail the database recovery. Test plan (devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6963 Reviewed By: ajkr Differential Revision: D22011083 Pulled By: riversand963 fbshipit-source-id: f9cbf29a37dc5cc40d3fa62f89eed1ad67ca1536	2020-06-11 18:42:10 -07:00
Levi Tamasi	722ebba834	Turn DBTest2.CompressionFailures into a parameterized test (#6968 ) Summary: `DBTest2.CompressionFailures` currently tests many configurations sequentially using nested loops, which often leads to timeouts in our test system. The patch turns it into a parameterized test instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6968 Test Plan: `make check` Reviewed By: siying Differential Revision: D22006954 Pulled By: ltamasi fbshipit-source-id: f71f2f7108086b7651ecfce3d79a7fab24620b2c	2020-06-11 16:35:39 -07:00
Zitan Chen	23e446a157	Disable OpenForReadOnly tests in the LITE mode (#6947 ) Summary: Disable two OpenForReadOnly tests in the LITE mode Pull Request resolved: https://github.com/facebook/rocksdb/pull/6947 Test Plan: passed db_test2 Reviewed By: cheng-chang Differential Revision: D21914345 Pulled By: gg814 fbshipit-source-id: 58e81baf5d8cf8adcedaef3966aa3a427bbdf7c2	2020-06-05 17:28:24 -07:00
Zitan Chen	02df00d97b	API change: DB::OpenForReadOnly will not write to the file system unless create_if_missing is true (#6900 ) Summary: DB::OpenForReadOnly will not write anything to the file system (i.e., create directories or files for the DB) unless create_if_missing is true. This change also fixes some subcommands of ldb, which write to the file system even if the purpose is for readonly. Two tests for this updated behavior of DB::OpenForReadOnly are also added. Other minor changes: 1. Updated HISTORY.md to include this API change of DB::OpenForReadOnly; 2. Updated the help information for the put and batchput subcommands of ldb with the option [--create_if_missing]; 3. Updated the comment of Env::DeleteDir to emphasize that it returns OK only if the directory to be deleted is empty. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6900 Test Plan: passed make check; also manually tested a few ldb subcommands Reviewed By: pdillinger Differential Revision: D21822188 Pulled By: gg814 fbshipit-source-id: 604cc0f0d0326a937ee25a32cdc2b512f9a3be6e	2020-06-03 18:57:49 -07:00
sdong	afa3518839	Revert "Update googletest from 1.8.1 to 1.10.0 (#6808 )" (#6923 ) Summary: This reverts commit `8d87e9cea1`. Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923 Reviewed By: pdillinger Differential Revision: D21864799 fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc	2020-06-03 15:55:03 -07:00
Adam Retter	8d87e9cea1	Update googletest from 1.8.1 to 1.10.0 (#6808 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6808 Reviewed By: anand1976 Differential Revision: D21483984 Pulled By: pdillinger fbshipit-source-id: 70c5eff2bd54ddba469761d95e4cd4611fb8e598	2020-06-01 20:33:42 -07:00
Yanqin Jin	b11a8b1b9a	Fix valgrind error by init memory region (#6842 ) Summary: As title. After allocating a memory buffer, initialize its content to 0s. This fixes valgrind issue introduced in https://github.com/facebook/rocksdb/issues/6709. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6842 Test Plan: ``` $make valgrind_test $valgrind --tool=memcheck --track-origins=yes ./db_test2 --gtest_filter=DBTest2.CompressionFailures ``` Reviewed By: pdillinger Differential Revision: D21551848 Pulled By: riversand963 fbshipit-source-id: e87a6f413e3f3d92d8e23d8ecc4cf93479c6674c	2020-05-14 18:50:03 -07:00
Ziyue Yang	c384c08a4f	Add tests for compression failure in BlockBasedTableBuilder (#6709 ) Summary: Currently there is no check for whether BlockBasedTableBuilder will expose compression error status if compression fails during the table building. This commit adds fake faulting compressors and a unit test to test such cases. This check finds 5 bugs, and this commit also fixes them: 1. Not handling compression failure well in BlockBasedTableBuilder::BGWorkWriteRawBlock. 2. verify_compression failing in BlockBasedTableBuilder when used with ZSTD. 3. Wrongly passing the same reference of block contents to BlockBasedTableBuilder::CompressAndVerifyBlock in parallel compression. 4. Wrongly setting block_rep->first_key_in_next_block to nullptr in BlockBasedTableBuilder::EnterUnbuffered when there are still incoming data blocks. 5. Not maintaining variables for compression ratio estimation and first_block in BlockBasedTableBuilder::EnterUnbuffered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6709 Reviewed By: ajkr Differential Revision: D21236254 fbshipit-source-id: 101f6e62b2bac2b7be72be198adf93cd32a1ff46	2020-05-12 09:27:35 -07:00
sdong	680c416348	Avoid Swallowing Some File Consistency Checking Bugs (#6793 ) Summary: We are swallowing some file consistency checking failures. This is not expected. We are fixing two cases: DB reopen and manifest dump. More places are not fixed and need follow-up. Error from CheckConsistencyForDeletes() is also swallowed, which is not fixed in this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6793 Test Plan: Add a unit test to cover the reopen case. Reviewed By: riversand963 Differential Revision: D21366525 fbshipit-source-id: eb438a322237814e8d5125f916a3c6de97f39ded	2020-05-04 14:18:11 -07:00
Ziyue Yang	03a781a90c	Add pipelined & parallel compression optimization (#6262 ) Summary: This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262 Reviewed By: ajkr Differential Revision: D20651306 fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb	2020-04-01 16:40:18 -07:00
Cheng Chang	afb97094ae	Skip high levels with no key falling in the range in CompactRange (#6482 ) Summary: In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped. This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482 Test Plan: A new test ManualCompactionTest::SkipLevel is added. Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index. Also updated db_compaction_test and db_test to use correct range for full compaction. Differential Revision: D20251149 Pulled By: cheng-chang fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b	2020-03-04 20:15:25 -08:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
sdong	800d24ddc5	Fix DBTest2.ChangePrefixExtractor LITE build (#6356 ) Summary: DBTest2.ChangePrefixExtractor fails in LITE build because LITE build doesn't support adaptive build. Fix it by removing the stats check but only check correctness. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6356 Test Plan: Run the test with both of LITE and non-LITE build. Differential Revision: D19669537 fbshipit-source-id: 6d7dd6c8a79f18e80ca1636864b9c71922030d8e	2020-01-31 15:44:14 -08:00
sdong	ec496347bc	Add a unit test for prefix extractor changes (#6323 ) Summary: Add a unit test for prefix extractor change, including a check that fails due to a bug. Also comment out the partitioned filter case which will fail the test too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6323 Test Plan: Run the test and it passes (and fails if the SeekForPrev() part is uncommented) Differential Revision: D19509744 fbshipit-source-id: 678202ca97b5503e9de73b54b90de9e5ba822b72	2020-01-31 11:02:03 -08:00
sdong	71874c5aaf	Fix LITE build with DBTest2.AutoPrefixMode1 (#6346 ) Summary: DBTest2.AutoPrefixMode1 doesn't pass because auto prefix mode is not supported there. Fix it by disabling the test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6346 Test Plan: Run DBTest2.AutoPrefixMode1 in lite mode Differential Revision: D19627486 fbshipit-source-id: fbde75260aeecb7e6fc406e09c19a71a95aa5f08	2020-01-29 16:43:42 -08:00
sdong	8f2bee6747	Add ReadOptions.auto_prefix_mode (#6314 ) Summary: Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314 Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run. Differential Revision: D19458717 fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce	2020-01-28 14:44:05 -08:00
sdong	d87cffaea4	Fix another bug caused by recent hash index fix (#6305 ) Summary: Recent bug fix related to hash index introduced a new bug: hash index can return NotFound but it is not handled by BlockBasedTable::Get(). The end result is that Get() stops being executed too early. Fix it by ignoring NotFound code in Get(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6305 Test Plan: A problematic DB used to return NotFound incorrectly, and now able to return correct result. Will try to construct a unit test too.0 Differential Revision: D19438925 fbshipit-source-id: e751afa8c13728d56511cfeb1bc811ecb99f3217	2020-01-17 01:41:04 -08:00
sdong	f8b5ef85ec	Fix a bug caused by recent fix of Prefix Hash (#6302 ) Summary: Recent fix to Prefix Hash https://github.com/facebook/rocksdb/pull/6292 caused a bug that the newly created NotFound status in hash index is never reset. This causes reseek or implict reseek to return wrong results sometimes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6302 Test Plan: Add a unit test that would fail. Not fix. crash test with hash test would fail in several seconds. With the fix, it will run about several minutes before failing with another failure. Differential Revision: D19424572 fbshipit-source-id: c5276f36a95fd0e2837e30190476d2fe21ed8566	2020-01-16 10:47:20 -08:00
sdong	d2b4d42d4b	Fix kHashSearch bug with SeekForPrev (#6297 ) Summary: When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target. Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297 Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix. Differential Revision: D19404695 fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3	2020-01-15 14:28:39 -08:00
sdong	76c117b24b	Fix LITE test build broken by recent commit (#6295 ) Summary: A recent commit adds a unit test that uses a function not available in LITE build. Fix it by avoiding the call Pull Request resolved: https://github.com/facebook/rocksdb/pull/6295 Test Plan: Run the test in LITE build and see it passes. Differential Revision: D19395678 fbshipit-source-id: 37b42835bae02511630d80f7cafb1179401bc033	2020-01-14 13:17:04 -08:00
sdong	894c6d21af	Bug when multiple files at one level contains the same smallest key (#6285 ) Summary: The fractional cascading index is not correctly generated when two files at the same level contains the same smallest or largest user key. The result would be that it would hit an assertion in debug mode and lower level files might be skipped. This might cause wrong results when the same user keys are of merge operands and Get() is called using the exact user key. In that case, the lower files would need to further checked. The fix is to fix the fractional cascading index. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6285 Test Plan: Add a unit test which would cause the assertion which would be fixed. Differential Revision: D19358426 fbshipit-source-id: 39b2b1558075fd95e99491d462a67f9f2298c48e	2020-01-13 16:27:42 -08:00
Maysam Yabandeh	eff5e076f5	unordered_write incompatible with max_successive_merges (#6284 ) Summary: unordered_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that and also reverts the changes performed by https://github.com/facebook/rocksdb/pull/6254, in which max_successive_merges was mistakenly declared incompatible with unordered_write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6284 Differential Revision: D19356115 Pulled By: maysamyabandeh fbshipit-source-id: f06dadec777622bd75f267361c022735cf8cecb6	2020-01-10 16:53:19 -08:00
Yanqin Jin	a8b1085ae2	Fix test in LITE mode (#6267 ) Summary: Currently, the recently-added test DBTest2.SwitchMemtableRaceWithNewManifest fails in LITE mode since SetOptions() returns "Not supported". I do not want to put `#ifndef ROCKSDB_LITE` because it reduces test coverage. Instead, just trigger compaction on a different column family. The bg compaction thread calling LogAndApply() may race with thread calling SwitchMemtable(). Test Plan (dev server): make check OPT=-DROCKSDB_LITE make check or run DBTest2.SwitchMemtableRaceWithNewManifest 100 times. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6267 Differential Revision: D19301309 Pulled By: riversand963 fbshipit-source-id: 88cedcca2f985968ed3bb234d324ffa2aa04ca50	2020-01-07 13:47:03 -08:00
Yanqin Jin	1aaa145877	Fix a data race for cfd->log_number_ (#6249 ) Summary: A thread calling LogAndApply may release db mutex when calling WriteCurrentStateToManifest() which reads cfd->log_number_. Another thread can call SwitchMemtable() and writes to cfd->log_number_. Solution is to cache the cfd->log_number_ before releasing mutex in LogAndApply. Test Plan (on devserver): ``` $COMPILE_WITH_TSAN=1 make db_stress $./db_stress --acquire_snapshot_one_in=10000 --avoid_unnecessary_blocking_io=1 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=zstd --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_pipelined_write=0 --flush_one_in=1000000 --format_version=5 --get_live_files_and_wal_files_one_in=1000000 --index_block_restart_interval=5 --index_type=0 --log2_keys_per_lock=22 --long_running_snapshots=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --memtablerep=skip_list --mmap_read=0 --nooverwritepercent=1 --open_files=500000 --ops_per_thread=100000000 --partition_filters=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --set_options_one_in=10000 --snapshot_hold_ops=100000 --subcompactions=2 --sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35 ``` Then repeat the following multiple times, e.g. 100 after compiling with tsan. ``` $./db_test2 --gtest_filter=DBTest2.SwitchMemtableRaceWithNewManifest ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6249 Differential Revision: D19235077 Pulled By: riversand963 fbshipit-source-id: 79467b52f48739ce7c27e440caa2447a40653173	2020-01-06 20:09:51 -08:00
Maysam Yabandeh	48a678b7c9	Prevent an incompatible combination of options (#6254 ) Summary: allow_concurrent_memtable_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6254 Differential Revision: D19265819 Pulled By: maysamyabandeh fbshipit-source-id: 47f2e2dc26fe0972c7152f4da15dadb9703f1179	2020-01-02 16:15:06 -08:00
解轶伦	39fcaf8246	delete superversions in BackgroundCallPurge (#6146 ) Summary: I found that CleanupSuperVersion() may block Get() for 30ms+ （per MemTable is 256MB）. Then I found "delete sv" in ~SuperVersion() takes the time. The backtrace looks like this DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() -> DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion() I think it's better to delete in a background thread, please review it。 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146 Differential Revision: D18972066 fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c	2019-12-17 13:22:57 -08:00
sdong	bb23bfe63c	Fix a regression bug on total order seek with prefix enabled and range delete (#6028 ) Summary: Recent change https://github.com/facebook/rocksdb/pull/5861 mistakely use "prefix_extractor_ != nullptr" as the condition to determine whehter prefix bloom filter isused. It fails to consider read_options.total_order_seek, so it is wrong. The result is that an optimization for non-total-order seek is mistakely applied to total order seek, and introduces a bug in following corner case: Because of RangeDelete(), a file's largest key is extended. Seek key falls into the range deleted file, so level iterator seeks into the previous file without getting any key. The correct behavior is to place the iterator to the first key of the next file. However, an optimization is triggered and invalidates the iterator because it is out of the prefix range, causing wrong results. This behavior is reproduced in the unit test added. Fix the bug by setting prefix_extractor to be null if total order seek is used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6028 Test Plan: Add a unit test which fails without the fix. Differential Revision: D18479063 fbshipit-source-id: ac075f013029fcf69eb3a598f14c98cce3e810b3	2019-11-13 10:11:34 -08:00
Yanqin Jin	2309fd63bf	Update column families' log number altogether after flushing during recovery (#5856 ) Summary: A bug occasionally shows up in crash test, and https://github.com/facebook/rocksdb/issues/5851 reproduces it. The bug can surface in the following way. 1. Database has multiple column families. 2. Between one DB restart, the last log file is corrupted in the middle (not the tail) 3. During restart, DB crashes between flushing between two column families. Then DB will fail to be opened again with error "SST file is ahead of WALs". Solution is to update the log number associated with each column family altogether after flushing all column families' memtables. The version edits should be written to a new MANIFEST. Only after writing to all these version edits succeed does RocksDB (atomically) points the CURRENT file to the new MANIFEST. Test plan (on devserver): ``` $make all && make check ``` Specifically ``` $make db_test2 $./db_test2 --gtest_filter=DBTest2.CrashInRecoveryMultipleCF ``` Also checked for compatibility as follows. Use this branch, run DBTest2.CrashInRecoveryMultipleCF and preserve the db directory. Then checkout 5.4, build ldb, and dump the MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5856 Differential Revision: D17620818 Pulled By: riversand963 fbshipit-source-id: b52ce5969c9a8052cacec2bd805fcfb373589039	2019-10-24 18:29:30 -07:00
sdong	a0cd920026	LevelIterator to avoid gap after prefix bloom filters out a file (#5861 ) Summary: Right now, when LevelIterator::Seek() is called, when a file is filtered out by prefix bloom filter, the position is put to the beginning of the next file. This is a confusing internal interface because many keys in the levels are skipped. Avoid this behavior by checking the key of the next file against the seek key, and invalidate the whole iterator if the prefix doesn't match. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5861 Test Plan: Add a new unit test to validate the behavior; run all exsiting tests; run crash_test Differential Revision: D17918213 fbshipit-source-id: f06b47d937c7cc8919001f18dcc3af5b28c9cdac	2019-10-21 11:40:57 -07:00
Andrew Kryczka	b00761eea6	Fix block cache ID uniqueness for Windows builds (#5844 ) Summary: Since we do not evict a file's blocks from block cache before that file is deleted, we require a file's cache ID prefix is both unique and non-reusable. However, the Windows functionality we were relying on only guaranteed uniqueness. That meant a newly created file could be assigned the same cache ID prefix as a deleted file. If the newly created file had block offsets matching the deleted file, full cache keys could be exactly the same, resulting in obsolete data blocks returned from cache when trying to read from the new file. We noticed this when running on FAT32 where compaction was writing out of order keys due to reading obsolete blocks from its input files. The functionality is documented as behaving the same on NTFS, although I wasn't able to repro it there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5844 Test Plan: we had a reliable repro of out-of-order keys on FAT32 that was fixed by this change Differential Revision: D17752442 fbshipit-source-id: 95d983f9196cf415f269e19293b97341edbf7e00	2019-10-11 18:19:31 -07:00
sdong	76e951dbb1	Add a unit test to reproduce a corruption bug (#5851 ) Summary: This is a bug occaionally shows up in crash test, and this unit test is to reproduce it. The bug is following: 1. Database has multiple CFs. 2. Between one DB restart, the last log file is corrupted in the middle (not the tail) 3. During restart, DB crashes between flushes between two CFs. The DB will fail to be opened again with error "SST file is ahead of WALs" Pull Request resolved: https://github.com/facebook/rocksdb/pull/5851 Test Plan: Run the test itself. Differential Revision: D17614721 fbshipit-source-id: 1b0abce49b203a76a039e38e76bc940429975f20	2019-09-26 16:18:42 -07:00
sdong	c06b54d0c6	Apply formatter on recent 45 commits. (#5827 ) Summary: Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827 Test Plan: Run all existing tests. Differential Revision: D17483727 fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f	2019-09-19 12:34:17 -07:00
sdong	43a340bebe	Merging iterator to disble reseek optimization in prefix seek (#5815 ) Summary: We are seeing a bug of wrong results with merging iterator's reseek avoidence feature and prefix extractor. Disable this optimization for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5815 Test Plan: Validated the same MyRocks case was fixed; run all existing tests. Differential Revision: D17430776 fbshipit-source-id: aef664277ba0ab8a2e68331ff0db6ae682535371	2019-09-17 17:10:29 -07:00
andrew	622683000c	Allow users to stop manual compactions (#3971 ) Summary: Manual compaction may bring in very high load because sometime the amount of data involved in a compaction could be large, which may affect online service. So it would be good if the running compaction making the server busy can be stopped immediately. In this implementation, stopping manual compaction condition is only checked in slow process. We let deletion compaction and trivial move go through. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3971 Test Plan: add tests at more spots. Differential Revision: D17369043 fbshipit-source-id: 575a624fb992ce0bb07d9443eb209e547740043c	2019-09-16 21:01:47 -07:00
Maysam Yabandeh	638d239507	Charge block cache for cache internal usage (#5797 ) Summary: For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor. This PR builds up on https://github.com/facebook/rocksdb/issues/4258 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797 Test Plan: - Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata. - The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata. Differential Revision: D17396833 Pulled By: maysamyabandeh fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f	2019-09-16 15:26:21 -07:00
Yanqin Jin	5d9a67e718	Support loading custom objects in unit tests (#5676 ) Summary: Most existing RocksDB unit tests run on `Env::Default()`. It will be useful to port the unit tests to non-default environments, e.g. `HdfsEnv`, etc. This pull request is one step towards this goal. If RocksDB unit tests are built with a static library exposing a function `RegisterCustomObjects()`, then it is possible to implement custom object registrar logic in the library. RocksDB unit test can call `RegisterCustomObjects()` at the beginning. By default, `ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS` is not defined, thus this PR has no impact on existing RocksDB because `RegisterCustomObjects()` is a noop. Test plan (on devserver): ``` $make clean && COMPILE_WITH_ASAN=1 make -j32 all $make check ``` All unit tests must pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5676 Differential Revision: D16679157 Pulled By: riversand963 fbshipit-source-id: aca571af3fd0525277cdc674248d0fe06e060f9d	2019-08-09 15:12:08 -07:00
Vijay Nadimpalli	d150e01474	New API to get all merge operands for a Key (#5604 ) Summary: This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases: 1. Update subset of columns and read subset of columns - Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU. 2. Updating very few attributes in a value which is a JSON-like document - Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge. ---------------------------------------------------------------------------------------------------- API : Status GetMergeOperands( const ReadOptions& options, ColumnFamilyHandle* column_family, const Slice& key, PinnableSlice* merge_operands, GetMergeOperandsOptions* get_merge_operands_options, int* number_of_operands) Example usage : int size = 100; int number_of_operands = 0; std::vector<PinnableSlice> values(size); GetMergeOperandsOptions merge_operands_info; db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands); Description : Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion. merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604 Test Plan: Added unit test and perf test in db_bench that can be run using the command: ./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist Differential Revision: D16657366 Pulled By: vjnadimpalli fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf	2019-08-06 14:26:44 -07:00
sdong	66b5613d0c	row_cache to share entry for recent snapshots (#5600 ) Summary: Right now, users cannot take advantage of row cache, unless no snapshot is used, or Get() is repeated for the same snapshots. This limits the usage of row cache. This change eliminate this restriction in some cases. If the snapshot used is newer than the largest sequence number in the file, and write callback function is not registered, the same row cache key is used as no snapshot is given. We still need the callback function restriction for now because the callback function may filter out different keys for different snapshots even if the snapshots are new. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5600 Test Plan: Add a unit test. Differential Revision: D16386616 fbshipit-source-id: 6b7d214bd215d191b03ccf55926ad4b703ec2e53	2019-07-22 18:56:19 -07:00
Sagar Vemuri	1b59a490ef	Fix flaky DBTest2.PresetCompressionDict test (#5378 ) Summary: Fix flaky DBTest2.PresetCompressionDict test. This PR fixes two issues with the test: 1. Replaces `GetSstFiles` with `TotalSize`, which is based on `DB::GetColumnFamilyMetaData` so that only the size of the live SST files is taken into consideration when computing the total size of all sst files. Earlier, with `GetSstFiles`, even obsolete files were getting picked up. 1. In ZSTD compression, it is sometimes possible that using a trained dictionary is not better than using an untrained one. Using a trained dictionary performs well in 99% of the cases, but still in the remaining ~1% of the cases (out of 10000 runs) using an untrained dictionary gets better compression results. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5378 Differential Revision: D15559100 Pulled By: sagar0 fbshipit-source-id: c35adbf13871f520a2cec48f8bad9ff27ff7a0b4	2019-05-30 16:11:27 -07:00
Siying Dong	4e0f2aadb0	DB::Close() to fail when there are unreleased snapshots (#5272 ) Summary: Sometimes, users might make mistake of not releasing snapshots before closing the DB. This is undocumented use of RocksDB and the behavior is unknown. We return DB::Close() to provide a way to check it for the users. Aborted() will be returned to users when they call DB::Close(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5272 Differential Revision: D15159713 Pulled By: siying fbshipit-source-id: 39369def612398d9f239d83d396b5a28e5af65cd	2019-05-01 10:17:30 -07:00
Zhongyi Xie	baa5302447	Avoid double-compacting data in bottom level in manual compactions (#5138 ) Summary: Depending on the config, manual compaction (leveled compaction style) does following compactions: L0->L1 L1->L2 ... Ln-1 -> Ln Ln -> Ln The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only. Resolves issue https://github.com/facebook/rocksdb/issues/4995 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138 Differential Revision: D14940106 Pulled By: miasantreble fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5	2019-04-16 23:32:20 -07:00
Siying Dong	beb44ec3eb	WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175 ) Summary: Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget. Change it to 256KB accordingly. There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175 Differential Revision: D14954289 Pulled By: siying fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94	2019-04-16 12:03:07 -07:00
Siying Dong	85b2bde3dd	Still implement StatisticsImpl::measureTime() (#5181 ) Summary: Since Statistics::measureTime() is deprecated, StatisticsImpl::measureTime() is not implemented. We realized that users might have a wrapped Statistics implementation in which measureTime() is implemented as forwarded to StatisticsImpl, and causes assert failure. In order to make the change less intrusive, we implement StatisticsImpl::measureTime(). We will revisit whether we need to remove it after several releases. Also, add a test to make sure that a Statistics implementation using the old interface still works. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5181 Differential Revision: D14907089 Pulled By: siying fbshipit-source-id: 29b6202fd04e30ed6f6adcaeb1000e87f10d1e1a	2019-04-12 11:00:35 -07:00
Siying Dong	ed9f5e21aa	Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165 ) Summary: Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory. Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165 Differential Revision: D14880709 Pulled By: siying fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d	2019-04-11 10:45:36 -07:00
Zhichao Cao	ebb9b2ed16	Fix the potential DB crash caused by call EndTrace before StartTrace (#5130 ) Summary: Although user should first call StartTrace to begin the RocksDB tracing function and call EndTrace to stop the tracing process, user can accidentally call EndTrace first. It will cause segment fault and crash the DB instance. The issue is fixed by checking the pointer first. Test case added in db_test2. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5130 Differential Revision: D14691420 Pulled By: zhichao-cao fbshipit-source-id: 3be13d2f944bc453728ef8eef67b68d7ad0939c8	2019-04-03 13:26:34 -07:00
Maysam Yabandeh	14b3f683a1	WriteUnPrepared: less virtual in iterator callback (#5049 ) Summary: WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads. The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call. Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots. The following benchmark shows +0.26% higher throughput in seekrandom benchmark. Benchmark: ./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench ./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100 seekrandom [AVG 10 runs] : 20355 ops/sec; 225.2 MB/sec seekrandom [MEDIAN 10 runs] : 20425 ops/sec; 225.9 MB/sec ./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100 seekrandom [AVG 10 runs] : 20409 ops/sec; 225.8 MB/sec seekrandom [MEDIAN 10 runs] : 20487 ops/sec; 226.6 MB/sec Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049 Differential Revision: D14366459 Pulled By: maysamyabandeh fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef	2019-04-02 14:47:16 -07:00
Shi Feng	01e6badbb6	Introduce CPU timers for iterator seek and next (#5076 ) Summary: Introduce CPU timers for iterator seek and next operations. Seek counter includes SeekToFirst, SeekToLast and SeekForPrev, w/ the caveat that SeekToLast timer doesn't include some post processing time if upper bound is defined. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5076 Differential Revision: D14525218 Pulled By: fredfsh fbshipit-source-id: 03ba25df3b22b06c072621e4de0eacfa1445f0d9	2019-03-26 16:32:13 -07:00
Wenjie Yang	36c2a7cfb1	Add an option to filter traces (#5082 ) Summary: Add an option to filter out READ or WRITE operations while tracing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5082 Differential Revision: D14515083 Pulled By: mrmiywj fbshipit-source-id: 2504c89a9abf1dd629cad44b4104092702d77610	2019-03-19 14:36:51 -07:00
Maysam Yabandeh	a661c0d208	WritePrepared: optimize read path by avoiding virtual (#5018 ) Summary: The read path includes a callback function, ReadCallback, which would eventually calls IsInSnapshot to figure if a particular seq is in the reading snapshot or not. This callback is virtual, which adds the cost of multiple virtual function call to each read. The first few checks in IsInSnapshot, however, are quite trivial and take care of majority of the cases. The patch moves those to a non-virtual function in the the parent class, ReadCallback, to lower the virtual callback cost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5018 Differential Revision: D14226562 Pulled By: maysamyabandeh fbshipit-source-id: 6feed5b34f3b082e52092c5ef143e29b49c46b44	2019-02-26 16:56:19 -08:00
Siying Dong	93f7e7a450	Temporarily Disable DBTest2.PresetCompressionDict (#5003 ) Summary: DBTest2.PresetCompressionDict is flaky. Temparily disable it for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5003 Differential Revision: D14139505 Pulled By: siying fbshipit-source-id: ebf1872d364b76b2cb021b489ea2f17ee997116a	2019-02-19 14:44:12 -08:00
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	2019-02-14 14:41:36 -08:00
Andrew Kryczka	62f70f6d14	Reduce scope of compression dictionary to single SST (#4952 ) Summary: Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio. So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include: - The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called. - After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up. - Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952 Differential Revision: D13967980 Pulled By: ajkr fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f	2019-02-11 19:47:32 -08:00
tang-jianfeng	08809f5e6c	Implement trace sampling (#4963 ) Summary: Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963 Differential Revision: D14011190 Pulled By: tang-jianfeng fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a	2019-02-08 18:08:18 -08:00
Alexander Zinoviev	32a6dd9a41	Add a new CPU time counter to compaction report (#4889 ) Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055	2019-01-29 17:24:00 -08:00
Andrew Kryczka	01013ae766	Digest ZSTD compression dictionary once when writing SST file (#4849 ) Summary: This is essentially a re-submission of #4251 with a few improvements: - Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict` - Eliminated `Init` functions. Instead do all initialization work in constructors. - Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849 Differential Revision: D13606039 Pulled By: ajkr fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447	2019-01-18 19:12:57 -08:00
Maysam Yabandeh	d56ac22b44	Remove duplicates from SnapshotList::GetAll (#4860 ) Summary: The vector returned by SnapshotList::GetAll could have duplicate entries if two separate snapshots have the same sequence number. However, when this vector is used in compaction the duplicate entires are of no use and could be safely ignored. Moreover not having duplicate entires simplifies reasoning in the compaction_iterator.cc code. For example when searching for the previous_snap we currently use the snapshot before the current one but the way the code uses that it expects it to be also less than the current snapshot, which would be simpler to read if there is no duplicate entry in the snapshot list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4860 Differential Revision: D13615502 Pulled By: maysamyabandeh fbshipit-source-id: d45bf01213ead5f39db811f951802da6fcc3332b	2019-01-09 16:25:42 -08:00
Siying Dong	8641e9adf7	Non-initial file preloading should always prefetch index and filter (#4852 ) Summary: https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1. It doesn't preload index and filter in non-initial file loading case. This is a little bit too complicated to understand. We observed in one MyRocks use case where the filter is expected to be preloaded but is not. To simplify the use case, we simply always prefetch the index and filter. They anyway is expected to be loaded in the file verification phase anyway. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852 Differential Revision: D13595402 Pulled By: siying fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31	2019-01-08 12:47:34 -08:00
Siying Dong	f0dda35d7d	Preload some files even if options.max_open_files (#3340 ) Summary: Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340 Differential Revision: D6686945 Pulled By: siying fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f	2018-12-28 18:02:28 -08:00
Siying Dong	da1c64b6e7	Introduce a CPU time counter in perf_context (#4741 ) Summary: Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future. Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform. Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741 Differential Revision: D13287798 Pulled By: siying fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f	2018-12-20 12:03:44 -08:00
Zhichao Cao	7125e24619	Add the max trace file size limitation option to Tracing (#4610 ) Summary: If user do not end the trace manually, the tracing will continue which can potential use up all the storage space and cause problem. In this PR, the max trace file size is added to the TraceOptions and user can set the value if they need or the default is 64GB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4610 Differential Revision: D12893400 Pulled By: zhichao-cao fbshipit-source-id: acf4b5a6076bb691778bdfbac4864e1006758953	2018-11-27 14:27:05 -08:00
Siying Dong	13579e8c5a	WriteBufferManger doens't cost to cache if no limit is set (#4695 ) Summary: WriteBufferManger is not invoked when allocating memory for memtable if the limit is not set even if a cache is passed. It is inconsistent from the comment syas. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4695 Differential Revision: D13112722 Pulled By: siying fbshipit-source-id: 0b27eef63867f679cd06033ea56907c0569597f4	2018-11-18 16:55:43 -08:00
Siying Dong	abb1a8fc23	Add a unit test to assert number of preads (#4657 ) Summary: We used to have a bug, which caused every block to be read twice, and none of our tests caught it. Add a very simply unit test to make sure that when reading a data block, we only issue one pread against the SST file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4657 Differential Revision: D13005260 Pulled By: siying fbshipit-source-id: 03167b554ad2451192b1707415536d7d05e9026c	2018-11-13 12:52:19 -08:00
Sagar Vemuri	2993cd2002	Fix RocksDB Lite build (#4675 ) Summary: Our internal CI test caught RocksDB Lite build failures. The failures are due to a new test introduced in #4665 using `SSTFileWriter` and `IngestExternalFile`, but these is not exposed under lite mode. Fixed by #ifdef'ing out the test. ``` db/db_test2.cc: In member function ‘virtual void rocksdb::DBTest2_TestCompactFiles_Test::TestBody()’: db/db_test2.cc:2907:3: error: ‘SstFileWriter’ is not a member of ‘rocksdb’ rocksdb::SstFileWriter sst_file_writer{rocksdb::EnvOptions(), options}; ^ In file included from ./util/testharness.h:15:0, from ./table/mock_table.h:23, from ./db/db_test_util.h:44, from db/db_test2.cc:13: db/db_test2.cc:2912:13: error: ‘sst_file_writer’ was not declared in this scope ASSERT_OK(sst_file_writer.Open(external_file1)); ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4675 Differential Revision: D13035984 Pulled By: sagar0 fbshipit-source-id: c1ceac550dfac1a85eeea436693dc7dd467519a6	2018-11-12 19:01:37 -08:00
DorianZheng	0f88160f67	Fix `CompactFiles` bug (#4665 ) Summary: `CompactFiles` gets `SuperVersion` before `WaitForIngestFile`, while `IngestExternalFile` may add files that overlap with `input_file_names` The timeline of execution flow is as follow: Let's say that level N has two file [1,2] and [5,6] ``` timeline user_thread1 user_thread2 t0 \| CompactFiles([1, 2], [5, 6]) begin t1 \| GetReferencedSuperVersion() t2 \| IngestExternalFile([3,4]) to level N begin t3 \| CompactFiles resume V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4665 Differential Revision: D13030674 Pulled By: ajkr fbshipit-source-id: 8be19477fd6e505032267a979d32f3097cc3be51	2018-11-12 14:32:18 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Peter Pei	09814f2cfc	support OnCompactionBegin (#4431 ) Summary: fix #4288 Add `OnCompactionBegin` support to `rocksdb::EventListener`. Currently, we only have these three callbacks: - OnFlushBegin - OnFlushCompleted - OnCompactionCompleted As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`. This PR is a try to implement the support of `OnCompactionBegin`. Hope it is useful to you. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431 Differential Revision: D10055515 Pulled By: yiwu-arbug fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334	2018-10-10 17:32:27 -07:00
DorianZheng	27090ae8f6	Fix DBImpl::GetColumnFamilyHandleUnlocked race condition (#4391 ) Summary: - Fix DBImpl API race condition The timeline of execution flow is as follow: ``` timeline user_thread1 user_thread2 t1 \| cfh = GetColumnFamilyHandleUnlocked(0) t2 \| id1 = cfh->GetID() t3 \| GetColumnFamilyHandleUnlocked(1) t4 \| id2 = cfh->GetID() V ``` The original implementation return a pointer to a stateful variable, so that the return `ColumnFamilyHandle` will be changed when another thread calls `GetColumnFamilyHandleUnlocked` with different `column family id` - Expose ColumnFamily ID to compaction event listener - Fix the return status of `DBImpl::GetLatestSequenceForKey` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4391 Differential Revision: D10221243 Pulled By: yiwu-arbug fbshipit-source-id: dec60ee9ff0c8261a2f2413a8506ec1063991993	2018-10-08 14:24:16 -07:00
Sagar Vemuri	3db584059c	Remove sync point from Block destructor (#4370 ) Summary: AddressSanitizer: heap-use-after-free in std::__atomic_base<bool>::load(std::memory_order) const ==1798517==ABORTING ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4370 Differential Revision: D9844146 Pulled By: sagar0 fbshipit-source-id: 18a2970b1d504b4f6c8fb04857f26e0f32124dd1	2018-09-15 00:12:57 -07:00
Siying Dong	f3d91a0b57	Add a unit test to verify iterators release data blocks after using them (#4170 ) Summary: Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170 Differential Revision: D8962513 Pulled By: siying fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407	2018-08-13 17:43:14 -07:00
Zhichao Cao	6d75319d95	Add tracing function of Seek() and SeekForPrev() to trace_replay (#4228 ) Summary: In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file. Replay of Iterator is not supported in the current implementation. Tested with trace_analyzer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228 Differential Revision: D9201381 Pulled By: zhichao-cao fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab	2018-08-10 17:57:40 -07:00
Sagar Vemuri	12b6cdeed3	Trace and Replay for RocksDB (#3837 ) Summary: A framework for tracing and replaying RocksDB operations. A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench. - Column-families are supported - Multi-threaded tracing is supported. - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it. - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations. Currently supported DB operations: - Writes: -- Put -- Merge -- Delete -- SingleDelete -- DeleteRange -- Write - Reads: -- Get (point lookups) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837 Differential Revision: D7974837 Pulled By: sagar0 fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72	2018-08-01 00:27:08 -07:00
Siying Dong	8425c8bd4d	BlockBasedTableReader: automatically adjust tail prefetch size (#4156 ) Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156 Differential Revision: D8916847 Pulled By: siying fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758	2018-07-20 14:43:37 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Maysam Yabandeh	0a5b5d88b2	Remove ReadOnly part of PinnableSliceAndMmapReads from Lite (#4070 ) Summary: Lite does not support readonly DBs. Closes https://github.com/facebook/rocksdb/pull/4070 Differential Revision: D8677858 Pulled By: maysamyabandeh fbshipit-source-id: 536887d2363ee2f5d8e1ea9f1a511e643a1707fa	2018-06-28 08:42:17 -07:00
Maysam Yabandeh	235ab9dd32	Pin mmap files in ReadOnlyDB (#4053 ) Summary: https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case. Closes https://github.com/facebook/rocksdb/pull/4053 Differential Revision: D8662546 Pulled By: maysamyabandeh fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd	2018-06-27 17:13:34 -07:00
Manuel Ung	a16e00b7b9	WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955 ) Summary: This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data. There are the places where reads had to be modified. - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible. - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key. - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case. Closes https://github.com/facebook/rocksdb/pull/3955 Differential Revision: D8286688 Pulled By: lth fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe	2018-06-27 12:23:07 -07:00
Andrew Kryczka	fea2b1dfb2	Copy Get() result when file reads use mmap Summary: For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region. For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`). This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied. This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime. Closes https://github.com/facebook/rocksdb/pull/3881 Differential Revision: D8076288 Pulled By: ajkr fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08	2018-06-01 16:57:58 -07:00
Andrew Kryczka	072ae671a7	Apply use_direct_io_for_flush_and_compaction to writes only Summary: Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used. This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously. Closes https://github.com/facebook/rocksdb/pull/3829 Differential Revision: D7915443 Pulled By: ajkr fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279	2018-05-09 19:42:58 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00
Phani Shekhar Mantripragada	446b32cfc3	Support for Column family specific paths. Summary: In this change, an option to set different paths for different column families is added. This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path. To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path. Changes : 1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed. 2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting. 3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths. 4) Unit tests are added appropriately. Closes https://github.com/facebook/rocksdb/pull/3102 Differential Revision: D6951697 Pulled By: ajkr fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d	2018-04-05 19:58:20 -07:00
Dmitri Smirnov	c364eb42b5	Windows cumulative patch Summary: This patch addressed several issues. Portability including db_test std::thread -> port::Thread Cc: @ and %z to ROCKSDB portable macro. Cc: maysamyabandeh Implement Env::AreFilesSame Make the implementation of file unique number more robust Get rid of C-runtime and go directly to Windows API when dealing with file primitives. Implement GetSectorSize() and aling unbuffered read on the value if available. Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976 Fix test running script issue where $status var was of incorrect scope so the failures were swallowed and not reported. DestroyDB() creates a logger and opens a LOG file in the directory being cleaned up. This holds a lock on the folder and the cleanup is prevented. This fails one of the checkpoin tests. We observe the same in production. We close the log file in this change. Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test attempts to open a directory with NewRandomAccessFile which does not work on Windows. Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug Closes https://github.com/facebook/rocksdb/pull/3552 Differential Revision: D7156304 Pulled By: siying fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b	2018-03-06 11:57:43 -08:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
Andrew Kryczka	ab5ab36ac2	fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose file ID support check Summary: Updated the test case to handle tmpfs mounted at directories different from "/dev/shm/". Closes https://github.com/facebook/rocksdb/pull/3440 Differential Revision: D6848213 Pulled By: ajkr fbshipit-source-id: 465e9dbf0921d0930161f732db6b3766bb030589	2018-01-30 16:50:42 -08:00
Andrew Kryczka	46e599fc6b	fix live WALs purged while file deletions disabled Summary: When calling `DisableFileDeletions` followed by `GetSortedWalFiles`, we guarantee the files returned by the latter call won't be deleted until after file deletions are re-enabled. However, `GetSortedWalFiles` didn't omit files already planned for deletion via `PurgeObsoleteFiles`, so the guarantee could be broken. We fix it by making `GetSortedWalFiles` wait for the number of pending purges to hit zero if file deletions are disabled. This condition is eventually met since `PurgeObsoleteFiles` is guaranteed to be called for the existing pending purges, and new purges cannot be scheduled while file deletions are disabled. Once the condition is met, `GetSortedWalFiles` simply returns the content of DB and archive directories, which nobody can delete (except for deletion scheduler, for which I plan to fix this bug later) until deletions are re-enabled. Closes https://github.com/facebook/rocksdb/pull/3341 Differential Revision: D6681131 Pulled By: ajkr fbshipit-source-id: 90b1e2f2362ea9ef715623841c0826611a817634	2018-01-17 17:42:04 -08:00
Andrew Kryczka	93f69cb93a	use bottommost compression when base level is bottommost Summary: The previous compression type selection caused unexpected behavior when the base level was also the bottommost level. The following sequence of events could happen: - full compaction generates files with `bottommost_compression` type - now base level is bottommost level since all files are in the same level - any compaction causes files to be rewritten `compression_per_level` type since bottommost compression didn't apply to base level I changed the code to make bottommost compression apply to base level. Closes https://github.com/facebook/rocksdb/pull/3141 Differential Revision: D6264614 Pulled By: ajkr fbshipit-source-id: d7aaa8675126896684154a1f2c9034d6214fde82	2017-11-09 17:42:00 -08:00
Andrew Kryczka	24ad430600	pass key/value samples through zstd compression dictionary generator Summary: Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before. Note this is the first step of #2987, extracted into a separate PR per reviewer request. Closes https://github.com/facebook/rocksdb/pull/3057 Differential Revision: D6116891 Pulled By: ajkr fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497	2017-11-02 22:56:36 -07:00
Yi Wu	8e63cad078	fix lite build Summary: * make `checksum_type_string_map` available for lite * comment out `FilesPerLevel` in lite mode. * travis and legocastle lite build also build `all` target and run tests Closes https://github.com/facebook/rocksdb/pull/3015 Differential Revision: D6069822 Pulled By: yiwu-arbug fbshipit-source-id: 9fe92ac220e711e9e6ed4e921bd25ef4314796a0	2017-10-17 08:57:09 -07:00
Maysam Yabandeh	f46464d383	write-prepared txn: call IsInSnapshot Summary: This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot. Closes https://github.com/facebook/rocksdb/pull/2850 Differential Revision: D5787375 Pulled By: maysamyabandeh fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c	2017-09-11 09:14:48 -07:00
Artem Danilov	8a6708f5f2	Extend property map with compaction stats Summary: This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch. Closes https://github.com/facebook/rocksdb/pull/2794 Differential Revision: D5717676 Pulled By: Tema fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a	2017-08-30 15:26:55 -07:00
Yi Wu	3c840d1a6d	Allow DB reopen with reduced options.num_levels Summary: Allow user to reduce number of levels in LSM by issue a full CompactRange() and put the result in a lower level, and then reopen DB with reduced options.num_levels. Previous this will fail on reopen on when recovery replaying the previous MANIFEST and found a historical file was on a higher level than the new options.num_levels. The workaround was after CompactRange(), reopen the DB with old num_levels, which will create a new MANIFEST, and then reopen the DB again with new num_levels. This patch relax the check of levels during recovery. It allows DB to open if there was a historical file on level > options.num_levels, but was also deleted. Closes https://github.com/facebook/rocksdb/pull/2740 Differential Revision: D5629354 Pulled By: yiwu-arbug fbshipit-source-id: 545903f6b36b6083e8cbaf777176aef2f488021d	2017-08-24 16:10:54 -07:00
Siying Dong	666a005f9b	Support prefetch last 512KB with direct I/O in block based file reader Summary: Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too. Closes https://github.com/facebook/rocksdb/pull/2708 Differential Revision: D5593091 Pulled By: siying fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7	2017-08-11 12:16:45 -07:00
Siying Dong	e7697b8ce8	Fix LITE unit tests Summary: Closes https://github.com/facebook/rocksdb/pull/2649 Differential Revision: D5505778 Pulled By: siying fbshipit-source-id: 7e935603ede3d958ea087ed6b8cfc4121e8797bc	2017-07-26 21:11:47 -07:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit `1d7048c598`, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Ewout Prangsma	51778612c9	Encryption at rest support Summary: This PR adds support for encrypting data stored by RocksDB when written to disk. It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files. The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done. Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!). The Counter operation mode uses an initial counter & random initialization vector (IV). Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize). The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there. To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests. Typically you would run it like this: ``` ENCRYPTED_ENV=1 make check_some ``` There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings. Closes https://github.com/facebook/rocksdb/pull/2424 Differential Revision: D5322178 Pulled By: sdwilsh fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f	2017-06-26 16:56:24 -07:00
Andrew Kryczka	c217e0b9c7	Call RateLimiter for compaction reads Summary: Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed. Closes https://github.com/facebook/rocksdb/pull/2433 Differential Revision: D5216946 Pulled By: ajkr fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e	2017-06-13 14:56:46 -07:00
Siying Dong	52a7f38b19	WriteOptions.low_pri which can throttle low pri writes if needed Summary: If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown. Closes https://github.com/facebook/rocksdb/pull/2369 Differential Revision: D5127619 Pulled By: siying fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0	2017-06-05 15:02:35 -07:00
Siying Dong	95b0e89b5d	Improve write buffer manager (and allow the size to be tracked in block cache) Summary: Improve write buffer manager in several ways: 1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower. 2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit. 3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable. Closes https://github.com/facebook/rocksdb/pull/2350 Differential Revision: D5110648 Pulled By: siying fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017	2017-06-02 14:26:56 -07:00
Andrew Kryczka	bb01c1880c	Introduce max_background_jobs mutable option Summary: - `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility - `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure. - `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions. - The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled. Closes https://github.com/facebook/rocksdb/pull/2205 Differential Revision: D4937461 Pulled By: ajkr fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9	2017-05-24 11:29:08 -07:00
Aaron Gao	492fc49a86	fix readampbitmap tests Summary: fix test failure of ReadAmpBitmap and ReadAmpBitmapLiveInCacheAfterDBClose. test ReadAmpBitmapLiveInCacheAfterDBClose individually and make check Closes https://github.com/facebook/rocksdb/pull/2271 Differential Revision: D5038133 Pulled By: lightmark fbshipit-source-id: 803cd6f45ccfdd14a9d9473c8af311033e164be8	2017-05-10 12:29:23 -07:00
Andrew Kryczka	be421b0b16	portable sched_getcpu calls Summary: - added a feature test in build_detect_platform to check whether sched_getcpu() is available. glibc offers it only on some platforms (e.g., linux but not mac); this way should be easier than maintaining a list of platforms on which it's available. - refactored PhysicalCoreID() to be simpler / less repetitive. ordered the conditional compilation clauses from most-to-least preferred Closes https://github.com/facebook/rocksdb/pull/2272 Differential Revision: D5038093 Pulled By: ajkr fbshipit-source-id: 81d7db3cc620250de220bdeb3194b2b3d7673de7	2017-05-10 12:29:23 -07:00
Aaron Gao	259a00eaca	unbiase readamp bitmap Summary: Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a) and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31. An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block. It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i32, (i+1)32). This diff makes each bit represent a single byte: i32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte. But there is one exception: the last bit will always set with the old way.* (*) - assuming read_amp_bytes_per_bit = 32. Closes https://github.com/facebook/rocksdb/pull/2259 Differential Revision: D5035652 Pulled By: lightmark fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356	2017-05-10 01:49:52 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Tomas Kolda	04d58970cb	AIX and Solaris Sparc Support Summary: Replacement of #2147 The change was squashed due to a lot of conflicts. Closes https://github.com/facebook/rocksdb/pull/2194 Differential Revision: D4929799 Pulled By: siying fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581	2017-04-21 20:48:04 -07:00
Aaron Gao	44fa8ece9b	change use_direct_writes to use_direct_io_for_flush_and_compaction Summary: Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get(). Closes https://github.com/facebook/rocksdb/pull/2117 Differential Revision: D4860912 Pulled By: lightmark fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19	2017-04-13 16:12:04 -07:00
Sagar Vemuri	6a6723ee1e	Move MergeOperatorPinning tests to be with other merge operator tests Summary: Moved MergeOperatorPinning tests from db_test2.cc to db_merge_operator_test.cc. [This is the same code as PR #2104 , which has already been reviewed, but I am creating a new PR as I cannot import from #2104 onto phabricator anymore even after rebasing. I'll close and discard #2104.] Closes https://github.com/facebook/rocksdb/pull/2125 Differential Revision: D4863312 Pulled By: sagar0 fbshipit-source-id: 0f71a7690aa09c1d03ee85ce2bc1d2d89e4f4399	2017-04-11 16:15:06 -07:00
Maysam Yabandeh	8b0097b49b	Readers for partition filter Summary: This is the last split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes the reader part as well as the tests. Closes https://github.com/facebook/rocksdb/pull/1961 Differential Revision: D4672216 Pulled By: maysamyabandeh fbshipit-source-id: 6a2b829	2017-03-22 09:24:15 -07:00
Siying Dong	8f5bf04468	Flush triggered by DB write buffer size picks the oldest unflushed CF Summary: Previously, when DB write buffer size triggers, we always pick the CF with most data in its memtable to flush. This approach can minimize total flush happens. Change the behavior to always pick the oldest unflushed CF, which makes it the same behavior when max_total_wal_size hits. This approach will minimize size used by max_total_wal_size. Closes https://github.com/facebook/rocksdb/pull/1987 Differential Revision: D4703214 Pulled By: siying fbshipit-source-id: 9ff8b09	2017-03-21 11:09:10 -07:00
Sagar Vemuri	97edc72d39	Add a memtable-only iterator Summary: This PR is to support a way to iterate over all the keys that are just in memtables. Closes https://github.com/facebook/rocksdb/pull/1953 Differential Revision: D4663500 Pulled By: sagar0 fbshipit-source-id: 144e177	2017-03-07 11:54:10 -08:00
Aaron Gao	286a36db7f	posix writablefile truncate Summary: we occasionally missing this call so the file size will be wrong Closes https://github.com/facebook/rocksdb/pull/1894 Differential Revision: D4598446 Pulled By: lightmark fbshipit-source-id: 42b6ef5	2017-02-22 10:09:14 -08:00
Dmitri Smirnov	0a4cdde50a	Windows thread Summary: introduce new methods into a public threadpool interface, - allow submission of std::functions as they allow greater flexibility. - add Joining methods to the implementation to join scheduled and submitted jobs with an option to cancel jobs that did not start executing. - Remove ugly `#ifdefs` between pthread and std implementation, make it uniform. - introduce pimpl for a drop in replacement of the implementation - Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread. - Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation. - should be no functionality changes. Closes https://github.com/facebook/rocksdb/pull/1823 Differential Revision: D4492902 Pulled By: siying fbshipit-source-id: c74cb11	2017-02-06 14:54:18 -08:00
Siying Dong	036d668b19	Fix wrong result in data race case related to Get() Summary: In theory, Get() can get a wrong result, if it races in a special with with flush. The bug can be reproduced in DBTest2.GetRaceFlush. Fix this bug by getting snapshot after referencing the super version. Closes https://github.com/facebook/rocksdb/pull/1816 Differential Revision: D4475958 Pulled By: siying fbshipit-source-id: bd9e67a	2017-02-03 11:39:15 -08:00
Siying Dong	f25f1ec60b	Add test DBTest2.GetRaceFlush which can expose a data race bug Summary: A current data race issue in Get() and Flush() can cause a Get() to return wrong results when a flush happened in the middle. Disable the test for now. Closes https://github.com/facebook/rocksdb/pull/1813 Differential Revision: D4472310 Pulled By: siying fbshipit-source-id: 5755ebd	2017-01-26 16:39:14 -08:00
Siying Dong	0e8dfd6062	Fix OptimizeForPointLookup() Summary: If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead. Closes https://github.com/facebook/rocksdb/pull/1791 Differential Revision: D4442836 Pulled By: siying fbshipit-source-id: bf6c9cd	2017-01-20 10:54:12 -08:00
Yi Wu	5d1457dbbf	Dump persistent cache options Summary: Dump persistent cache options Closes https://github.com/facebook/rocksdb/pull/1679 Differential Revision: D4337019 Pulled By: yiwu-arbug fbshipit-source-id: 3812f8a	2016-12-19 14:09:12 -08:00
Karthikeyan Radhakrishnan	4118e13330	Persistent Cache: Expose stats to user via public API Summary: Exposing persistent cache stats (counters) to the user via public API. Closes https://github.com/facebook/rocksdb/pull/1485 Differential Revision: D4155274 Pulled By: siying fbshipit-source-id: 30a9f50	2016-11-21 17:39:13 -08:00
Siying Dong	972e3ff295	Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default Summary: Closes https://github.com/facebook/rocksdb/pull/1496 Differential Revision: D4168080 Pulled By: siying fbshipit-source-id: 056ae62	2016-11-16 09:39:09 -08:00
Maysam Yabandeh	361010d447	Exporting compaction stats in the form of a map Summary: Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present the stats in any format required by the them. Closes https://github.com/facebook/rocksdb/pull/1477 Differential Revision: D4149836 Pulled By: maysamyabandeh fbshipit-source-id: b3df19f	2016-11-11 20:54:14 -08:00
Islam AbdelRahman	5691a1d8a4	Fix compaction conflict with running compaction Summary: Issue scenario: (1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2 (2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files) (3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile() Test Plan: unit tests Reviewers: sdong Reviewed By: sdong Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D64947	2016-10-13 10:49:06 -07:00
Aaron Gao	715256338a	forbid merge during recovery Summary: Mitigate regression bug of options.max_successive_merges hit during DB Recovery For https://reviews.facebook.net/D62625 Test Plan: make all check Reviewers: horuff, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62655	2016-09-21 11:05:07 -07:00
sdong	b666f85445	Consider more factors when determining preallocation size of WAL files Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller. Test Plan: Add a unit test. Reviewers: andrewkr, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63957	2016-09-19 12:04:35 -07:00
sdong	607628d349	Support ZSTD with finalized format Summary: ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility. Still keep ZSTDNotFinal for compatibility reason. Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older. Reviewers: andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63141	2016-09-06 12:22:16 -07:00
sdong	32149059f9	Merge options source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes. Test Plan: Add two new unit tests. Run all existing tests, including jtest. Reviewers: yhchiang, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59829	2016-09-01 14:33:24 -07:00
Islam AbdelRahman	b49b92cf28	Introduce Read amplification bitmap (read amp statistics) Summary: Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read the bitmap will contain (block_size / read_amp_bytes_per_bit) bits. We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification Test Plan: added new tests Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: yiwu, leveldb, march, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58707	2016-08-26 18:55:58 -07:00
sdong	dade61ac26	Mitigate regression bug of options.max_successive_merges hit during DB Recovery Summary: After `1b8a2e8fdd`, DB Pointer is passed to WriteBatchInternal::InsertInto() while DB recovery. This can cause deadlock if options.max_successive_merges hits. In that case DB::Get() will be called. Get() will try to acquire the DB mutex, which is already held by the DB::Open(), causing a deadlock condition. This commit mitigates the problem by not passing the DB pointer unless 2PC is allowed. Test Plan: Add a new test and run it. Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, horuff Reviewed By: kradhakrishnan Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62625	2016-08-25 17:30:34 -07:00
Islam AbdelRahman	d11c09d9e2	Eliminate memcpy from ForwardIterator Summary: This diff update ForwardIterator to support pinning keys and values, which will allow DBIter to take advantage of that and eliminate memcpy when executing merge operators This diff is stacked on D61305 Test Plan: existing tests (updated them to test tailing iterator) new test Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60009	2016-08-11 19:10:16 -07:00
omegaga	e70020e4f6	Only cache level 0 indexes and filter when opening table reader Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1). Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?) Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59913	2016-07-20 11:23:31 -07:00
Islam AbdelRahman	68a8e6b8fa	Introduce FullMergeV2 (eliminate memcpy from merge operators) Summary: This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice> This diff is stacked on top of D56493 and D56511 In this diff we - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future - Replace std::deque<std::string> with std::vector<Slice> to pass operands - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187) - Allow FullMergeV2 output to be an existing operand ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 1 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s [master] readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s ``` ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s [master] readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 1 operand per key] [FullMergeV2] $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s [master] readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions [FullMergeV2] readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s [master] readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: lovro, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57075	2016-07-20 09:49:03 -07:00
omegaga	cd4178a015	Add a new feature to enforce a sync point only active on a thread Summary: Add markers to sync points. A marked sync point will only be active when it is on the same thread as the marker sync point. Test Plan: Write a unit test to validate. Reviewers: sdong, IslamAbdelRahman, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60375	2016-07-07 11:29:14 -07:00
sdong	32df9733d1	Add options.write_buffer_manager: control total memtable size across DB instances Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances. Test Plan: Add a new unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59925	2016-07-05 18:11:25 -07:00
Islam AbdelRahman	c70a9335de	Fix mutex unlock issue between scheduled compaction and ReleaseCompactionFiles() Summary: NotifyOnCompactionCompleted can unlock the mutex. That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles(). Test Plan: added unittest existing unittest Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: yoshinorim, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58065	2016-05-18 14:56:30 -07:00
krad	a08c8c851a	Added PersistentCache abstraction Summary: Added a new abstraction to cache page to RocksDB designed for the read cache use. RocksDB current block cache is more of an object cache. For the persistent read cache project, what we need is a page cache equivalent. This changes adds a cache abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache uncompressed pages or raw pages (content as in filesystem). The user can choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode. Blame Rev: Test Plan: Run unit tests Reviewers: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55707	2016-05-15 22:17:18 -07:00
Islam AbdelRahman	d86f9b9c3f	Fix lite build Summary: Fix lite build Test Plan: run under lite Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57945	2016-05-09 16:08:30 -07:00
Islam AbdelRahman	4b31723433	Add bottommost_compression option Summary: Add a new option that can be used to set a specific compression algorithm for bottommost level. This option will only affect levels larger than base level. I have also updated CompactionJobInfo to include the compression algorithm used in compaction Test Plan: added new unittest existing unittests Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: lightmark, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D57669	2016-05-09 15:57:19 -07:00
Andrew Kryczka	a06faa6327	Skip PresetCompressionDict test for lite Summary: This test relies on "rocksdb.num-files-at-levelN" property that isn't implemented in rocksdb lite. So we will compile it only for non-lite builds. Test Plan: $ make -j40 check 'OPT=-g -DROCKSDB_LITE' Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57387	2016-04-28 15:11:28 -07:00
Andrew Kryczka	0f428c5619	Fix compression dictionary clang osx error Summary: There was one narrowing conversion in D52287 that only showed up with clang on osx. Test Plan: $ make clean && USE_CLANG=1 DISABLE_JEMALLOC=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j32 check Reviewers: sdong, lightmark, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57357	2016-04-28 10:42:10 -07:00
Andrew Kryczka	843d2e3137	Shared dictionary compression using reference block Summary: This adds a new metablock containing a shared dictionary that is used to compress all data blocks in the SST file. The size of the shared dictionary is configurable in CompressionOptions and defaults to 0. It's currently only used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of the compression type if the user chooses a nonzero dictionary size. During compaction, computes the dictionary by randomly sampling the first output file in each subcompaction. It pre-computes the intervals to sample by assuming the output file will have the maximum allowable length. In case the file is smaller, some of the pre-computed sampling intervals can be beyond end-of-file, in which case we skip over those samples and the dictionary will be a bit smaller. After the dictionary is generated using the first file in a subcompaction, it is loaded into the compression library before writing each block in each subsequent file of that subcompaction. On the read path, gets the dictionary from the metablock, if it exists. Then, loads that dictionary into the compression library before reading each block. Test Plan: new unit test Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52287	2016-04-27 17:36:03 -07:00
Yi Wu	792762c42c	Split db_test.cc Summary: Split db_test.cc into several files. Moving several helper functions into DBTestBase. Test Plan: make check Reviewers: sdong, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, andrewkr, kradhakrishnan, yhchiang, leveldb, sdong Differential Revision: https://reviews.facebook.net/D56715	2016-04-18 09:42:50 -07:00
Victor Tyutyunov	e5c614e1df	Fixing snapshot 0 assertion Summary: Solution is not to change db sequence number to start from 1 because 0 value is used in multiple other places. Fix covers only compact_iterator::findEarliestVisibleSnapshot with updated logic to support snapshot's numbering starting from 0. Test Plan: run: make all check it should pass all tests Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: lgalanis, mgalushka, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56601	2016-04-16 01:47:15 -07:00

1 2 3 4 5 ...

260 Commits