rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-30 04:41:49 +00:00

Author	SHA1	Message	Date
mrambacher	3dff28cf9b	Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033 ) Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67	2021-03-15 04:34:11 -07:00
Levi Tamasi	a46f080cce	Break down the amount of data written during flushes/compactions per file type (#8013 ) Summary: The patch breaks down the "bytes written" (as well as the "number of output files") compaction statistics into two, so the values are logged separately for table files and blob files in the info log, and are shown in separate columns (`Write(GB)` for table files, `Wblob(GB)` for blob files) when the compaction statistics are dumped. This will also come in handy for fixing the write amplification statistics, which currently do not consider the amount of data read from blob files during compaction. (This will be fixed by an upcoming patch.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013 Test Plan: Ran `make check` and `db_bench`. Reviewed By: riversand963 Differential Revision: D26742156 Pulled By: ltamasi fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5	2021-03-02 09:48:00 -08:00
mrambacher	12f1137355	Add a SystemClock class to capture the time functions of an Env (#7858 ) Summary: Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock. Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock. There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc). Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858 Reviewed By: pdillinger Differential Revision: D26006406 Pulled By: mrambacher fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90	2021-01-25 22:09:11 -08:00
Yanqin Jin	e062a719cc	Fix assertion failure in bg flush (#7362 ) Summary: https://github.com/facebook/rocksdb/issues/7340 reports and reproduces an assertion failure caused by a combination of the following: - atomic flush is disabled. - a column family can appear multiple times in the flush queue at the same time. This behavior was introduced in release 5.17. Consequently, it is possible that two flushes race with each other. One bg flush thread flushes all memtables. The other thread calls `FlushMemTableToOutputFile()` afterwards, and hits the assertion error below. ``` assert(cfd->imm()->NumNotFlushed() != 0); assert(cfd->imm()->IsFlushPending()); ``` Fix this by reverting the behavior. In non-atomic-flush case, a column family can appear in the flush queue at most once at the same time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7362 Test Plan: make check Also run stress test successfully for 10 times. ``` make crash_test ``` Reviewed By: ajkr Differential Revision: D25172996 Pulled By: riversand963 fbshipit-source-id: f1559b6366cc609e961e3fc83fae548f1fad08ce	2020-12-02 09:31:14 -08:00
Yanqin Jin	76ef894f9f	Add full_history_ts_low_ to FlushJob (#7655 ) Summary: https://github.com/facebook/rocksdb/issues/7556 enables `CompactionIterator` to perform garbage collection during compaction according to a lower bound (user-defined) timestamp `full_history_ts_low_`. This PR adds a data member `full_history_ts_low_` of type `std::string` to `FlushJob`, and `full_history_ts_low_` does not change during flush. `FlushJob` will pass a pointer to this data member to the `CompactionIterator` used during flush. Also refactored flush_job_test.cc to re-use some existing code, which is actually the majority of this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7655 Test Plan: make check Reviewed By: ltamasi Differential Revision: D24933340 Pulled By: riversand963 fbshipit-source-id: 2e584bfd0cf6e5c295ab1af264e68e9d6a12fca3	2020-11-12 18:44:34 -08:00
Levi Tamasi	a7a04b6898	Integrate BlobFileBuilder into the compaction process (#7573 ) Summary: Similarly to how https://github.com/facebook/rocksdb/issues/7345 integrated blob file writing into the flush process, the patch adds support for writing blob files to the compaction logic. Namely, if `enable_blob_files` is set, large values encountered during compaction are extracted to blob files and replaced with blob indexes. The resulting blob files are then logged to the MANIFEST as part of the compaction job's `VersionEdit` and added to the `Version` alongside any table files written by the compaction. Any errors during blob file building fail the compaction job. There will be a separate follow-up patch to perform blob garbage collection during compactions. In addition, the patch continues to chip away at the mess around computing various compaction related statistics by eliminating some code duplication and by making the `num_output_files` and `bytes_written` stats more consistent for flushes, compactions, and recovery. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7573 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D24404696 Pulled By: ltamasi fbshipit-source-id: 21216af3a172ad3ce8f85d11cd30923784ae426c	2020-10-26 13:51:55 -07:00
Zhichao Cao	d71cfe04e4	Add flush_job_test to the list of ASSERT_STATUS_CHECKED tests (#7445 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7445 Test Plan: pass ASSERT_STATUS_CHECKED=1 make -j48 flush_job_test Reviewed By: akankshamahajan15 Differential Revision: D23969372 Pulled By: zhichao-cao fbshipit-source-id: 498ff45ef84e07ec27a8f35d0874d3371412afe9	2020-09-28 14:59:02 -07:00
Levi Tamasi	b0e7834100	Integrate blob file writing with the flush logic (#7345 ) Summary: The patch adds support for writing blob files during flush by integrating `BlobFileBuilder` with the flush logic, most importantly, `BuildTable` and `CompactionIterator`. If `enable_blob_files` is set, large values are extracted to blob files and replaced with references. The resulting blob files are then logged to the MANIFEST as part of the flush job's `VersionEdit` and added to the `Version`, similarly to table files. Errors related to writing blob files fail the flush, and any blob files written by such jobs are immediately deleted (again, similarly to how SST files are handled). In addition, the patch extends the logging and statistics around flushes to account for the presence of blob files (e.g. `InternalStats::CompactionStats::bytes_written`, which is used for calculating write amplification, now considers the blob files as well). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7345 Test Plan: Tested using `make check` and `db_bench`. Reviewed By: riversand963 Differential Revision: D23506369 Pulled By: ltamasi fbshipit-source-id: 646885f22dfbe063f650d38a1fedc132f499a159	2020-09-14 21:11:43 -07:00
mrambacher	7d472accdc	Bring the Configurable options together (#5753 ) Summary: This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753 Reviewed By: ajkr Differential Revision: D23385030 Pulled By: zhichao-cao fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a	2020-09-14 17:01:01 -07:00
Akanksha Mahajan	b175eceb09	Store FSWritableFilePtr object in WritableFileWriter (#7193 ) Summary: Replace FSWritableFile pointer with FSWritableFilePtr object in WritableFileWriter. This new object wraps FSWritableFile pointer. Objective: If tracing is enabled, FSWritableFile Ptr returns FSWritableFileTracingWrapper pointer that includes all necessary information in IORecord and calls underlying FileSystem and invokes IOTracer to dump that record in a binary file. If tracing is disabled then, underlying FileSystem pointer is returned directly. FSWritableFilePtr wrapper class is added to bypass the FSWritableFileWrapper when tracing is disabled. Test Plan: make check -j64 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7193 Reviewed By: anand1976 Differential Revision: D23355915 Pulled By: akankshamahajan15 fbshipit-source-id: e62a27a13c1fd77e36a6dbafc7006d969bed25cf	2020-09-08 10:56:08 -07:00
Peter Dillinger	6ac1d25fd0	Fix+clean up handling of mock sleeps (#7101 ) Summary: We have a number of tests hanging on MacOS and windows due to mishandling of code for mock sleeps. In addition, the code was in terrible shape because the same variable (addon_time_) would sometimes refer to microseconds and sometimes to seconds. One test even assumed it was nanoseconds but was written to pass anyway. This has been cleaned up so that DB tests generally use a SpecialEnv function to mock sleep, for either some number of microseconds or seconds depending on the function called. But to call one of these, the test must first call SetMockSleep (precondition enforced with assertion), which also turns sleeps in RocksDB into mock sleeps. To also removes accounting for actual clock time, call SetTimeElapseOnlySleepOnReopen, which implies SetMockSleep (on DB re-open). This latter setting only works by applying on DB re-open, otherwise havoc can ensue if Env goes back in time with DB open. More specifics: Removed some unused test classes, and updated comments on the general problem. Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead of mock time. For this we have the only modification to production code, inserting a sync point callback in flush_job.cc, which is not a change to production behavior. Removed unnecessary resetting of mock times to 0 in many tests. RocksDB deals in relative time. Any behaviors relying on absolute date/time are likely a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one clearly injecting a specific absolute time for actual testing convenience.) Just in case I misunderstood some test, I put this note in each replacement: // NOTE: Presumed unnecessary and removed: resetting mock time in env Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and FilterCompactionTimeTest in db_test.cc stats_history_test and blob_db_test are each their own beast, rather deeply dependent on MockTimeEnv. Each gets its own variant of a work-around for TimedWait in a mock time environment. (Reduces redundancy and inconsistency in stats_history_test.) Intended follow-up: Remove TimedWait from the public API of InstrumentedCondVar, and only make that accessible through Env by passing in an InstrumentedCondVar and a deadline. Then the Env implementations mocking time can fix this problem without using sync points. (Test infrastructure using sync points interferes with individual tests' control over sync points.) With that change, we can simplify/consolidate the scattered work-arounds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101 Test Plan: make check on Linux and MacOS Reviewed By: zhichao-cao Differential Revision: D23032815 Pulled By: pdillinger fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b	2020-08-11 12:41:30 -07:00
Zitan Chen	94d04529de	Store DB identity and DB session ID in SST files (#6983 ) Summary: `db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file. The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`. In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`. A table property test is added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983 Test Plan: make check and some manual tests. Reviewed By: zhichao-cao Differential Revision: D22048826 Pulled By: gg814 fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9	2020-06-17 10:57:40 -07:00
sdong	80979f81c7	Make options.bottommost_compression, compression_opts and bottommost_compression_opts dynamically changeable. (#6615 ) Summary: These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615 Test Plan: Add a unit test to make sure that SetOptions() can change the options. Reviewed By: riversand963 Differential Revision: D20755951 fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff	2020-03-31 12:11:42 -07:00
Yanqin Jin	18cf0de640	Use flush time for the props.creation_time for FIFO compaction (#6612 ) Summary: For FIFO compaction, we use flush time instead of oldest key time as the creation time. This is to prevent FIFO compaction dropping files whose oldest key time is older than TTL but which has newer keys than TTL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6612 Test Plan: make check Reviewed By: siying Differential Revision: D20748217 Pulled By: riversand963 fbshipit-source-id: 3f7b00a847020760537cdddd12f6fe039e5bc663	2020-03-30 18:59:17 -07:00
Zhichao Cao	4246888101	Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487 ) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0	2020-03-27 16:04:43 -07:00
Zhichao Cao	8d73137ae8	Replace Directory with FSDirectory in DB (#6468 ) Summary: In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468 Test Plan: pass make asan_check Differential Revision: D20195261 Pulled By: zhichao-cao fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1	2020-03-02 16:16:26 -08:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
Zhichao Cao	4369f2c7bb	Checksum for each SST file and stores in MANIFEST (#6216 ) Summary: In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata. Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216 Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check. Differential Revision: D19171461 Pulled By: zhichao-cao fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87	2020-02-10 15:52:52 -08:00
Sagar Vemuri	4f6c86226c	Use the same oldest ancestor time in table properties and manifest Summary: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions passed 96 / 100 times. ``` With the fix: all runs (tried 100, 1000, 10000) succeed. ``` $ TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=1000 [1000/1000] DBCompactionTest.LevelTtlCascadingCompactions (1895 ms) ``` Test Plan: Build: ``` COMPILE_WITH_TSAN=1 make db_compaction_test -j100 ``` Without the fix: a few runs out of 100 fail: ``` $ TEST_TMPDIR=/dev/shm KEEP_DB=1 ~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.LevelTtlCascadingCompactions --repeat=100 ... ... Note: Google Test filter = DBCompactionTest.LevelTtlCascadingCompactions [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBCompactionTest [ RUN ] DBCompactionTest.LevelTtlCascadingCompactions db/db_compaction_test.cc:3687: Failure Expected equality of these values: oldest_time Which is: 1580155869 level_to_files[6][0].oldest_ancester_time Which is: 1580155870 DB is still at /dev/shm//db_compaction_test_6337001442947696266 [ FAILED ] DBCompactionTest.LevelTtlCascadingCompactions (1432 ms) [----------] 1 test from DBCompactionTest (1432 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (1433 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] DBCompactionTest.LevelTtlCascadingCompactions 1 FAILED TEST [80/100] DBCompactionTest.LevelTtlCascadingCompactions returned/aborted with exit code 1 (1489 ms) [100/100] DBCompactionTest.LevelTtlCascadingCompactions (1522 ms) FAILED TESTS (4/100): 1419 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/90) 1434 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/84) 1457 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/82) 1489 ms: ./db_compaction_test DBCompactionTest.LevelTtlCascadingCompactions (try https://github.com/facebook/rocksdb/issues/74) Differential Revision: D19587040 Pulled By: sagar0 fbshipit-source-id: 11191ae9940837643bff47ebe18b299b4be3d950	2020-01-27 19:58:53 -08:00
Maysam Yabandeh	28e5a9a9fb	Increase max_log_size in FlushJob to 1024 bytes (#6258 ) Summary: When measure_io_stats_ is enabled, the volume of logging is beyond the default limit of 512 size. The patch allows the EventLoggerStream to change the limit, and also sets it to 1024 for FlushJob. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6258 Differential Revision: D19279269 Pulled By: maysamyabandeh fbshipit-source-id: 3fb5d468dad488f289ac99d713378177eb7504d6	2020-01-06 10:16:52 -08:00
anand76	afa2420c2b	Introduce a new storage specific Env API (#5761 ) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f	2019-12-13 14:48:41 -08:00
sdong	aa1857e2df	Support options.max_open_files = -1 with periodic_compaction_seconds (#6090 ) Summary: options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090 Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2. Differential Revision: D18702268 fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb	2019-11-26 21:39:56 -08:00
sdong	d8c28e692a	Support options.ttl with options.max_open_files = -1 (#6060 ) Summary: Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead. Note that, this change will break forward compatibility for release 5.1 and older. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060 Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0. Differential Revision: D18631623 fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830	2019-11-22 21:23:00 -08:00
Levi Tamasi	a44670e71b	Use aggregate initialization for FlushJobInfo/CompactionJobInfo (#5997 ) Summary: FlushJobInfo and CompactionJobInfo are aggregates; we should use the aggregate initialization syntax to ensure members (specifically those of built-in types) are value-initialized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5997 Test Plan: make check Differential Revision: D18273398 Pulled By: ltamasi fbshipit-source-id: 35b1a63ad9ca01605d288329858af72fffd7f392	2019-11-01 11:46:19 -07:00
Levi Tamasi	f7e7b34ebe	Propagate SST and blob file numbers through the EventListener interface (#5962 ) Summary: This patch adds a number of new information elements to the FlushJobInfo and CompactionJobInfo structures that are passed to EventListeners via the OnFlush{Begin, Completed} and OnCompaction{Begin, Completed} callbacks. Namely, for flushes, the file numbers of the new SST and the oldest blob file it references are propagated. For compactions, the new pieces of information are the file number, level, and the oldest blob file referenced by each compaction input and output file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5962 Test Plan: Extended the EventListener unit tests with logic that checks that these information elements are correctly propagated from the corresponding FileMetaData. Differential Revision: D18095568 Pulled By: ltamasi fbshipit-source-id: 6874359a6aadb53366b5fe87adcb2f9bd27a0a56	2019-10-24 14:44:15 -07:00
Yi Wu	1f9d7c0f54	Fix OnFlushCompleted fired before flush result write to MANIFEST (#5908 ) Summary: When there are concurrent flush job on the same CF, `OnFlushCompleted` can be called before the flush result being install to LSM. Fixing the issue by passing `FlushJobInfo` through `MemTable`, and the thread who commit the flush result can fetch the `FlushJobInfo` and fire `OnFlushCompleted` on behave of the thread actually writing the SST. Fix https://github.com/facebook/rocksdb/issues/5892 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5908 Test Plan: Add new test. The test will fail without the fix. Differential Revision: D17916144 Pulled By: riversand963 fbshipit-source-id: e18df67d9533b5baee52ae3605026cdeb05cbe10	2019-10-16 10:40:23 -07:00
Levi Tamasi	5f025ea832	BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903 ) Summary: This is groundwork for adding garbage collection support to BlobDB. The patch adds logic that keeps track of the oldest blob file referred to by each SST file. The oldest blob file is identified during flush/ compaction (similarly to how the range of keys covered by the SST is identified), and persisted in the manifest as a custom field of the new file edit record. Blob indexes with TTL are ignored for the purposes of identifying the oldest blob file (since such blob files are cleaned up by the TTL logic in BlobDB). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5903 Test Plan: Added new unit tests; also ran db_bench in BlobDB mode, inspected the manifest using ldb, and confirmed (by scanning the SST files using sst_dump) that the value of the oldest blob file number field matches the contents of the file for each SST. Differential Revision: D17859997 Pulled By: ltamasi fbshipit-source-id: 21662c137c6259a6af70446faaf3a9912c550e90	2019-10-14 15:21:01 -07:00
Zhongyi Xie	d68f9f4580	simplify include directive involving inttypes (#5402 ) Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03	2019-06-06 13:56:07 -07:00
Siying Dong	000b9ec217	Move some logging related files to logging/ (#5387 ) Summary: Many logging related source files are under util/. It will be more structured if they are together. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387 Differential Revision: D15579036 Pulled By: siying fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f	2019-05-31 17:23:59 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Vijay Nadimpalli	50e470791d	Organizing rocksdb/table directory by format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373 Differential Revision: D15559425 Pulled By: vjnadimpalli fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55	2019-05-30 14:51:11 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
Siying Dong	545d206040	Move some file related files outside util/ (#5375 ) Summary: util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375 Differential Revision: D15550935 Pulled By: siying fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2	2019-05-29 20:47:06 -07:00
Vijay Nadimpalli	931c9df886	Use separate status code for column family drop and db shutdown in progress (#5275 ) Summary: Currently RocksDB uses Status::ShutdownInProgress to inform about column family drop. I would like to have a separate Status code for this event. https://github.com/facebook/rocksdb/blob/master/include/rocksdb/status.h#L55 Comment on this: `abc4202e47/db/version_set.cc (L2742)`:L2743 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5275 Differential Revision: D15204583 Pulled By: vjnadimpalli fbshipit-source-id: 95e99e34b27bc165b554ecb8a48a7f8e60f21e2a	2019-05-20 10:47:32 -07:00
Sagar Vemuri	d3d20dcdca	Periodic Compactions (#5166 ) Summary: Introducing Periodic Compactions. This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold. - Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF. - This works across all levels. - The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used). - Compaction filters, if any, are invoked as usual. - A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS). This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166 Differential Revision: D14884441 Pulled By: sagar0 fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47	2019-04-10 19:31:18 -07:00
Zhongyi Xie	a291f3a1e5	Collect compaction stats by priority and dump to info LOG (#5050 ) Summary: In order to better understand compaction done by different priority thread pool, we now collect compaction stats by priority and also print them to info LOG through stats dump. ``` Compaction Stats [default] Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Low 0/0 0.00 KB 0.0 16.8 11.3 5.5 5.6 0.1 0.0 0.0 406.4 136.1 42.24 34.96 45 0.939 13M 8865K High 0/0 0.00 KB 0.0 0.0 0.0 0.0 11.4 11.4 0.0 0.0 0.0 76.2 153.00 35.74 12185 0.013 0 0 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5050 Differential Revision: D14408583 Pulled By: miasantreble fbshipit-source-id: e53746586ea27cb8abc9fec35805bd80ed30f608	2019-03-19 17:28:19 -07:00
Shobhit Dayal	b45b1cde3e	Feature for sampling and reporting compressibility (#4842 ) Summary: This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms: 1. lz4 or snappy for fast compression 2. zstd or zlib for slow but higher compression. The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType. The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842 Differential Revision: D13629011 Pulled By: shobhitdayal fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa	2019-03-18 12:15:34 -07:00
Siying Dong	5e298f865b	Add two more StatsLevel (#5027 ) Summary: Statistics cost too much CPU for some use cases. Add two stats levels so that people can choose to skip two types of expensive stats, timers and histograms. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027 Differential Revision: D14252765 Pulled By: siying fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592	2019-02-28 10:27:59 -08:00
Siying Dong	26a33ee5bd	flush_job logs data size too (#4979 ) Summary: Right now when a flush is triggered, the memory consumption is logged but data size is not. It's useful to log both when we debug unexpected small flushed file size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4979 Differential Revision: D14071979 Pulled By: siying fbshipit-source-id: 0cd60449c5205eb00e0fbc299084418f609904ed	2019-02-15 16:33:19 -08:00
Alexander Zinoviev	32a6dd9a41	Add a new CPU time counter to compaction report (#4889 ) Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055	2019-01-29 17:24:00 -08:00
Abhishek Madan	abf931afa6	Add compaction logic to RangeDelAggregatorV2 (#4758 ) Summary: RangeDelAggregatorV2 now supports ShouldDelete calls on snapshot stripes and creation of range tombstone compaction iterators. RangeDelAggregator is no longer used on any non-test code path, and will be removed in a future commit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758 Differential Revision: D13439254 Pulled By: abhimadan fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401	2018-12-17 13:20:51 -08:00
Abhishek Madan	8fe1e06ca0	Clean up FragmentedRangeTombstoneList (#4692 ) Summary: Removed `one_time_use` flag, which removed the need for some tests, and changed all `NewRangeTombstoneIterator` methods to return `FragmentedRangeTombstoneIterators`. These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones` and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692 Differential Revision: D13106570 Pulled By: abhimadan fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845	2018-11-28 15:29:02 -08:00
Yanqin Jin	e633983cf1	Add support to flush multiple CFs atomically (#4262 ) Summary: Leverage existing `FlushJob` to implement atomic flush of multiple column families. This PR depends on other PRs and is a subset of #3752 . This PR itself is not sufficient in fulfilling atomic flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4262 Differential Revision: D9283109 Pulled By: riversand963 fbshipit-source-id: 65401f913e4160b0a61c0be6cd02adc15dad28ed	2018-10-15 20:01:17 -07:00
Maysam Yabandeh	21b51dfec4	Add inline comments to flush job (#4464 ) Summary: It also renames InstallMemtableFlushResults to MaybeInstallMemtableFlushResults to clarify its contract. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4464 Differential Revision: D10224918 Pulled By: maysamyabandeh fbshipit-source-id: 04e3f2d8542002cb9f8010cb436f5152751b3cbe	2018-10-05 15:41:17 -07:00
Anand Ananthabhotla	a27fce408e	Auto recovery from out of space errors (#4164 ) Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a	2018-09-15 13:43:04 -07:00
Yanqin Jin	8959063c9c	Store the return value of Fsync for check Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4361 Differential Revision: D9803723 Pulled By: riversand963 fbshipit-source-id: 5a0d4cd3e57fd195571dcd5822895ee00547fa6a	2018-09-14 13:29:56 -07:00
Yanqin Jin	54de56844d	Remove random writes from SST file ingestion (#4172 ) Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071	2018-07-27 16:12:23 -07:00
Anand Ananthabhotla	52d4c9b7f6	Allow DB resume after background errors (#3997 ) Summary: Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts - 1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not 2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place 3. Provide an API for the user to clear the error and resume the DB instance This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors. Closes https://github.com/facebook/rocksdb/pull/3997 Differential Revision: D8653831 Pulled By: anand1976 fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd	2018-06-28 12:34:40 -07:00
Siying Dong	d59549298f	Skip deleted WALs during recovery Summary: This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction) This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2. Closes https://github.com/facebook/rocksdb/pull/3765 Differential Revision: D7747618 Pulled By: siying fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729	2018-05-03 15:43:09 -07:00
Yanqin Jin	d42bd041c5	Improve visibility into the reasons for compaction. Summary: Add `compaction_reason` as part of event log for event `compaction started`. Add counters for each `CompactionReason`. Closes https://github.com/facebook/rocksdb/pull/3679 Differential Revision: D7550348 Pulled By: riversand963 fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d	2018-04-11 10:58:44 -07:00
Zhongyi Xie	1cbc96d236	FlushReason improvement Summary: Right now flush reason "SuperVersion Change" covers a few different scenarios which is a bit vague. For example, the following db_bench job should trigger "Write Buffer Full" > $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 $ grep 'flush_reason' /dev/shm/dbbench/LOG ... 2018/03/06-17:30:42.543638 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242543634, "job": 192, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018024, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.569541 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242569536, "job": 193, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.596396 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242596392, "job": 194, "event": "flush_started", "num_memtables": 1, "num_entries": 7008, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.622444 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242622440, "job": 195, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"} With the fix: > 2018/03/19-14:40:02.341451 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602341444, "job": 98, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018008, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.379655 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602379642, "job": 100, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018016, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.418479 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602418474, "job": 101, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.455084 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602455079, "job": 102, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.492293 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602492288, "job": 104, "event": "flush_started", "num_memtables": 1, "num_entries": 7007, "num_deletes": 0, "memory_usage": 1018056, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.528720 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602528715, "job": 105, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.566255 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602566238, "job": 107, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018112, "flush_reason": "Write Buffer Full"} Closes https://github.com/facebook/rocksdb/pull/3627 Differential Revision: D7328772 Pulled By: miasantreble fbshipit-source-id: 67c94065fbdd36930f09930aad0aaa6d2c152bb8	2018-03-22 18:42:18 -07:00
Zhongyi Xie	945f618ba5	log flush reason for better debugging experience Summary: It's always a mystery from the logs why flush was triggered -- user triggered it manually, WriteBufferManager triggered it, logs were full, write buffer was full, etc. This PR logs Flush reason whenever a flush is scheduled. Closes https://github.com/facebook/rocksdb/pull/3401 Differential Revision: D6788142 Pulled By: miasantreble fbshipit-source-id: a867e54d493c06adf5172bd36a180fb3faae3511	2018-02-09 12:12:43 -08:00
Anand Ananthabhotla	fccc12f386	Add a histogram stat for memtable flush Summary: Add a new histogram stat called rocksdb.db.flush.micros for memtable flush Closes https://github.com/facebook/rocksdb/pull/3269 Differential Revision: D6559496 Pulled By: anand1976 fbshipit-source-id: f5c771ba2568630458751795e8c37a493ff9b14d	2017-12-15 18:57:00 -08:00
Siying Dong	def6a00740	Print out compression type of new SST files in logging Summary: Closes https://github.com/facebook/rocksdb/pull/3264 Differential Revision: D6552768 Pulled By: siying fbshipit-source-id: 6303110aff22f341d5cff41f8d2d4f138a53652d	2017-12-14 10:27:43 -08:00
Sagar Vemuri	bbef8c3884	Log GetCurrentTime failures during Flush and Compaction Summary: `GetCurrentTime()` is used to populate `creation_time` table property during flushes and compactions. It is safe to ignore `GetCurrentTime()` failures here but they should be logged. (Note that `creation_time` property was introduced as part of TTL-based FIFO compaction in #2480.) Tes Plan: `make check` Closes https://github.com/facebook/rocksdb/pull/3231 Differential Revision: D6501935 Pulled By: sagar0 fbshipit-source-id: 376adcf4ab801d3a43ec4453894b9a10909c8eb6	2017-12-06 20:56:53 -08:00
Shaohua Li	eefd75a228	Stream Summary: Add a simple policy for NVMe write time life hint Closes https://github.com/facebook/rocksdb/pull/3095 Differential Revision: D6298030 Pulled By: shligit fbshipit-source-id: 9a72a42e32e92193af11599eb71f0cf77448e24d	2017-11-10 09:26:24 -08:00
Yi Wu	84a04af9a9	TableProperty::oldest_key_time defaults to 0 Summary: We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0. Also revert db_sst_test back to before #2842. Closes https://github.com/facebook/rocksdb/pull/3079 Differential Revision: D6165702 Pulled By: yiwu-arbug fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0	2017-10-27 15:00:05 -07:00
Prashant D	47166baeac	Fix coverity uninitialized fields warnings Pulled By: ajkr Differential Revision: D6170448 fbshipit-source-id: 5fd6d1608fc0df27c94d9f5059315ce7f79b8f5c	2017-10-26 21:11:50 -07:00
Yi Wu	66a2c44ef4	Add DB::Properties::kEstimateOldestKeyTime Summary: With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property. My plan is to override the property with a more accurate value with blob db, where we actually have timestamp. Closes https://github.com/facebook/rocksdb/pull/2842 Differential Revision: D5770600 Pulled By: yiwu-arbug fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a	2017-10-23 15:27:27 -07:00
Yi Wu	d1b74b0c82	WritePrepared Txn: Compaction/Flush Summary: Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator: * check if key has been committed. If not, output uncommitted keys AS-IS. * use SnapshotChecker to check if key is visible to a snapshot when in need. * do not output key with seq = 0 if the key is not committed. Closes https://github.com/facebook/rocksdb/pull/2926 Differential Revision: D5902907 Pulled By: yiwu-arbug fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320	2017-10-06 10:41:53 -07:00
Quinn Jarrell	6a541afcc4	Make bytes_per_sync and wal_bytes_per_sync mutable Summary: SUMMARY Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed. TEST PLAN ran make check all passed Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value. Closes https://github.com/facebook/rocksdb/pull/2893 Reviewed By: yiwu-arbug Differential Revision: D5845814 Pulled By: TheRushingWookie fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd	2017-09-27 17:49:45 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Andrew Kryczka	33042573db	Fix GetCurrentTime() initialization for valgrind Summary: Valgrind had false positive complaints about the initialization pattern for `GetCurrentTime()`'s argument in #2480. We can instead have the client initialize the time variable before calling `GetCurrentTime()`, and have `GetCurrentTime()` promise to only overwrite it in success case. Closes https://github.com/facebook/rocksdb/pull/2526 Differential Revision: D5358689 Pulled By: ajkr fbshipit-source-id: 857b189f24c19196f6bb299216f3e23e7bc4be42	2017-07-05 12:12:00 -07:00
Sagar Vemuri	1cd45cd1b3	FIFO Compaction with TTL Summary: Introducing FIFO compactions with TTL. FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size. To address that request: - Added a new TTL option to FIFO compaction options. - Updated FIFO compaction score to take TTL into consideration. - Added a new table property, creation_time, to keep track of when the SST file is created. - Creation_time is set as below: - On Flush: Set to the time of flush. - On Compaction: Set to the max creation_time of all the files involved in the compaction. - On Repair and Recovery: Set to the time of repair/recovery. - Old files created prior to this code change will have a creation_time of 0. - FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day. - FIFO compaction will fall back to the prior way of deleting files based on size if: - the creation_time of all files involved in compaction is 0. - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted. This feature is not supported if max_open_files != -1 or with table formats other than Block-based. Test Plan: Added tests. Benchmark results: Base: FIFO with max size: 100MB :: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 readwhilewriting : 1.924 micros/op 519858 ops/sec; 13.6 MB/s (1176277 of 5000000 found) ``` With TTL (a low one for testing) :: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20 readwhilewriting : 1.902 micros/op 525817 ops/sec; 13.7 MB/s (1185057 of 5000000 found) ``` Example Log lines: ``` 2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion 2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files ... 2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK 2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40} ``` SST Files remaining in the dbbench dir, after db_bench execution completed: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ ls -l /dev/shm//dbbench/*.sst -rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst -rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst -rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst ``` Closes https://github.com/facebook/rocksdb/pull/2480 Differential Revision: D5305116 Pulled By: sagar0 fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174	2017-06-27 17:11:48 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Aaron Gao	44fa8ece9b	change use_direct_writes to use_direct_io_for_flush_and_compaction Summary: Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get(). Closes https://github.com/facebook/rocksdb/pull/2117 Differential Revision: D4860912 Pulled By: lightmark fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19	2017-04-13 16:12:04 -07:00
Yi Wu	df6f5a3772	Move memtable related files into memtable directory Summary: Move memtable related files into memtable directory. Closes https://github.com/facebook/rocksdb/pull/2087 Differential Revision: D4829242 Pulled By: yiwu-arbug fbshipit-source-id: ca70ab6	2017-04-06 14:09:13 -07:00
Siying Dong	d2dce5611a	Move some files under util/ to separate dirs Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef	2017-04-05 19:09:16 -07:00
Siying Dong	6ef8c620d3	Move auto_roll_logger and filename out of db/ Summary: It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency. Closes https://github.com/facebook/rocksdb/pull/2080 Differential Revision: D4821141 Pulled By: siying fbshipit-source-id: ca7d768	2017-04-03 18:39:14 -07:00
Islam AbdelRahman	e19163688b	Add macros to include file name and line number during Logging Summary: current logging ``` 2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25 2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2. 2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK 2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK 2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0. 2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6 2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1 2017/03/14-14:20:31. Closes https://github.com/facebook/rocksdb/pull/1990 Differential Revision: D4708695 Pulled By: IslamAbdelRahman fbshipit-source-id: cb8968f	2017-03-15 19:39:12 -07:00
Sagar Vemuri	eb912a927e	Remove disableDataSync option Summary: Remove disableDataSync, and another similarly named disable_data_sync options. This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods. Closes https://github.com/facebook/rocksdb/pull/1859 Differential Revision: D4541292 Pulled By: sagar0 fbshipit-source-id: 5b3a6ca	2017-02-13 11:09:13 -08:00
Islam AbdelRahman	574b543f80	Rename merger.h -> merging_iterator.h Summary: merger.h was always a confusing name for me, simply give the file a better name Closes https://github.com/facebook/rocksdb/pull/1836 Differential Revision: D4505357 Pulled By: IslamAbdelRahman fbshipit-source-id: 07b28d8	2017-02-02 16:54:19 -08:00
Yi Wu	9239103cd4	Flush job should release reference current version if sync log failed Summary: Fix the bug when sync log fail, FlushJob::Run() will not be execute and reference to cfd->current() will not be release. Closes https://github.com/facebook/rocksdb/pull/1792 Differential Revision: D4441316 Pulled By: yiwu-arbug fbshipit-source-id: 5523e28	2017-01-19 23:09:15 -08:00
Mike Kolupaev	8c2b921fdf	Fixed a crash in debug build in flush_job.cc Summary: It was doing `&range_del_iters[0]` on an empty vector. Even though the resulting pointer is never dereferenced, it's still bad for two reasons: * the practical reason: it crashes with `std::out_of_range` exception in our debug build, * the "C++ standard lawyer" reason: it's undefined behavior because, in `std::vector` implementation, it probably "dereferences" a null pointer, which is invalid even though it doesn't actually read the pointed memory, just converts a pointer into a reference (and then flush_job.cc converts it back to pointer); nullptr references are undefined behavior. Closes https://github.com/facebook/rocksdb/pull/1612 Differential Revision: D4265625 Pulled By: al13n321 fbshipit-source-id: db26fb9	2016-12-09 10:39:12 -08:00
Andrew Kryczka	fd43ee09da	Range deletion microoptimizations Summary: - Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter. - Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly. Closes https://github.com/facebook/rocksdb/pull/1548 Differential Revision: D4208169 Pulled By: ajkr fbshipit-source-id: 2fd65cf	2016-11-21 12:24:13 -08:00
Andrew Kryczka	fe349db57b	Remove Arena in RangeDelAggregator Summary: The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator. Closes https://github.com/facebook/rocksdb/pull/1547 Differential Revision: D4207781 Pulled By: ajkr fbshipit-source-id: 9d1c130	2016-11-19 14:24:12 -08:00
Andrew Kryczka	40a2e406f8	DeleteRange flush support Summary: Changed BuildTable() (used for flush) to (1) add range tombstones to the aggregator, which is used by CompactionIterator to determine which keys can be removed; and (2) add aggregator's range tombstones to the table that is output for the flush. Closes https://github.com/facebook/rocksdb/pull/1438 Differential Revision: D4100025 Pulled By: ajkr fbshipit-source-id: cb01a70	2016-10-31 20:54:18 -07:00
Yi Wu	9ed928e7a9	Split DBOptions into ImmutableDBOptions and MutableDBOptions Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs. Test Plan: make all check Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64065	2016-09-23 16:34:04 -07:00
Yi Wu	0a88f38b7e	Remove ColumnFamilyData::options() Summary: One more small refactor before I split DBOptions into mutable and immutable parts. Test Plan: existing unit tests. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64047	2016-09-16 15:09:14 -07:00
sdong	d5a51d4de3	Need to make sure log file synced before flushing memtable of one column family Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost. Test Plan: Add a new unit test which will fail without the diff. Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu Reviewed By: yiwu Subscribers: yiwu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60915	2016-07-21 16:29:06 -07:00
Yi Wu	296545a2c7	Fix clang analyzer errors Summary: Fixing erros reported by clang static analyzer. * Removing some unused variables. * Adding assertions to fix false positives reported by clang analyzer. * Adding `__clang_analyzer__` macro to suppress false positive warnings. Test Plan: USE_CLANG=1 OPT=-g make analyze -j64 Reviewers: andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60549	2016-07-08 17:50:51 -07:00
Mike Kolupaev	936973d145	Small tweaks to logging to track the number of immutable memtables Summary: We see some write stalls because of number of unflushed memtables. With existing logging I couldn't figure out what's happening exactly. See internal task t11446054 for details if interested. This diff adds: - logging of memtable creation at info level; I wanted it on multiple occasions for different reasons; also include number of immutable memtables, - logging of number of remaining immutable memtables after a flush. Test Plan: ran tests Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58833	2016-06-01 11:11:33 -07:00
Aaron Gao	43afd72bee	[rocksdb] make more options dynamic Summary: make more ColumnFamilyOptions dynamic: - compression - soft_pending_compaction_bytes_limit - hard_pending_compaction_bytes_limit - min_partial_merge_operands - report_bg_io_stats - paranoid_file_checks Test Plan: Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit. All passed. Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57519	2016-05-17 13:11:56 -07:00
Yi Wu	a92049e3e7	Added EventListener::OnTableFileCreationStarted() callback Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status. Test Plan: unit test. Reviewers: dhruba, yhchiang, ott, sdong Reviewed By: sdong Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba Differential Revision: https://reviews.facebook.net/D56337	2016-04-29 11:35:00 -07:00
Yueh-Hsuan Chiang	24110ce90c	Correct Statistics FLUSH_WRITE_BYTES Summary: In https://reviews.facebook.net/D56271, we fixed an issue where we consider flush as compaction. However, that makes us mistakenly count FLUSH_WRITE_BYTES twice (one in flush_job and one in db_impl.) This patch removes the one incremented in db_impl. Test Plan: db_test Reviewers: yiwu, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57111	2016-04-25 12:01:01 -07:00
sdong	4b6833aec1	Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too. Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it. Test Plan: Try db_bench and see the stats are printed out. Reviewers: yhchiang Reviewed By: yhchiang Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56769	2016-04-15 10:22:18 -07:00
Yueh-Hsuan Chiang	ae21d71e94	Fixed a bug in RocksDB Statistics where flush is considered as compaction Summary: Fixed a bug in RocksDB Statistics where flush is considered as compaction Test Plan: unit test Reviewers: sdong, IslamAbdelRahman, rven, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D56271	2016-04-11 19:59:25 -07:00
Andrew Kryczka	2391ef7214	Embed column family name in SST file Summary: Added the column family name to the properties block. This property is omitted only if the property is unavailable, such as when RepairDB() writes SST files. In a next diff, I will change RepairDB to use this new property for deciding to which column family an existing SST file belongs. If this property is missing, it will add it to the "unknown" column family (same as its existing behavior). Test Plan: New unit test: $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55605	2016-04-06 23:10:32 -07:00
Marton Trencseni	9b51987521	Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. Test Plan: 'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK. I didn't run the Java tests, I don't have Java set up on my devserver. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56133	2016-04-01 10:42:39 -07:00
sdong	b1fafcaca6	Revert "Adding pin_l0_filter_and_index_blocks_in_cache feature." This reverts commit `522de4f59e`. It has bug of index block cleaning up.	2016-03-21 11:50:42 -07:00
Marton Trencseni	522de4f59e	Adding pin_l0_filter_and_index_blocks_in_cache feature. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction. Test Plan: Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false). DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32 Mac: OK. Linux: with D55287 patched in it's OK. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54801	2016-03-17 22:40:01 +00:00
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	2016-02-09 15:12:00 -08:00
Andrew Kryczka	acd7d58695	[directory includes cleanup] Remove util->db dependency for ThreadStatusUtil Summary: We can avoid the dependency by forward-declaring ColumnFamilyData and then treating it as a black box. That means callers of ThreadStatusUtil need to explicitly provide more options, even if they can be derived from the ColumnFamilyData, since ThreadStatusUtil doesn't include the definition. This is part of a series of diffs to eliminate circular dependencies between directories (e.g., db/* files depending on util/* files and vice-versa). Test Plan: $ ./db_test --gtest_filter=DBTest.GetThreadStatus $ make -j32 commit-prereq Reviewers: sdong, yhchiang, IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53361	2016-01-26 10:49:16 -08:00
Venkatesh Radhakrishnan	b7ecf3d214	Fix intermittent hang in ColumnFamilyTest.FlushAndDropRaceCondition Summary: ColumnFamilyTest.FlushAndDropRaceCondition sometimes hangs because the sync point, "FlushJob::InstallResults", sleeps holding the DB mutex. Fixing it by releasing the mutex before sleeping. Test Plan: seq 1000 \|parallel --gnu --eta 't=/dev/shm/rdb-{}; rm -rf $t; mkdir $t && export TEST_TMPDIR=$t; ./column_family_test -gtest_filter=FlushAndDropRaceCondition > $t/log-{}' Reviewers: IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53349	2016-01-26 09:12:20 -08:00
agiardullo	3bfd3d39a3	Use SST files for Transaction conflict detection Summary: Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings. With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged). Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files. Test Plan: unit tests, db bench Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb, yoshinorim Differential Revision: https://reviews.facebook.net/D50475	2015-12-11 12:34:11 -08:00
sdong	35ad531be3	Seperate InternalIterator from Iterator Summary: Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type. This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's. At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it. Test Plan: Run all existing tests. Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48549	2015-10-13 15:32:13 -07:00
Alexey Maykov	3d07b815f6	Passing table properties to compaction callback Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online. Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats. Reviewers: igor, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48267	2015-10-09 18:10:55 -07:00
sdong	776bd8d5eb	Pass column family ID to table property collector Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently. Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct. Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48411	2015-10-09 14:36:51 -07:00
Yueh-Hsuan Chiang	63e0f86797	Fixed a bug which causes rocksdb.flush.write.bytes stat is always zero Summary: Fixed a bug which causes rocksdb.flush.write.bytes stat is always zero Test Plan: augment existing db_test Reviewers: sdong, anthony, IslamAbdelRahman, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47595	2015-09-25 13:34:49 -07:00
Igor Canadi	a7e80379b0	LogAndApply() should fail if the column family has been dropped Summary: This patch finally fixes the ColumnFamilyTest.ReadDroppedColumnFamily test. The test has been failing very sporadically and it was hard to repro. However, I managed to write a new tests that reproes the failure deterministically. Here's what happens: 1. We start the flush for the column family 2. We check if the column family was dropped here: `a3fc49bfdd/db/flush_job.cc (L149)` 3. This check goes through, ends up in InstallMemtableFlushResults() and it goes into LogAndApply() 4. At about this time, we start dropping the column family. Dropping the column family process gets to LogAndApply() at about the same time as LogAndApply() from flush process 5. Drop column family goes through LogAndApply() first, marking the column family as dropped. 6. Flush process gets woken up and gets a chance to write to the MANIFEST. However, this is where it gets stuck: `a3fc49bfdd/db/version_set.cc (L1975)` 7. We see that the column family was dropped, so there is no need to write to the MANIFEST. We return OK. 8. Flush gets OK back from LogAndApply() and it deletes the memtable, thinking that the data is now safely persisted to sst file. The fix is pretty simple. Instead of OK, we return ShutdownInProgress. This is not really true, but we have been using this status code to also mean "this operation was canceled because the column family has been dropped". The fix is only one LOC. All other code is related to tests. I added a new test that reproes the failure. I also moved SleepingBackgroundTask to util/testutil.h (because I needed it in column_family_test for my new test). There's plenty of other places where we reimplement SleepingBackgroundTask, but I'll address that in a separate commit. Test Plan: 1. new test 2. make check 3. Make sure the ColumnFamilyTest.ReadDroppedColumnFamily doesn't fail on Travis: https://travis-ci.org/facebook/rocksdb/jobs/79952386 Reviewers: yhchiang, anthony, IslamAbdelRahman, kradhakrishnan, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46773	2015-09-15 11:28:44 -07:00

1 2 3 4

182 commits