mirror of https://github.com/facebook/rocksdb.git
167 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Levi Tamasi | 51d7e6a49e |
Clean up WriteBatchWithIndexInternal a bit (#11930)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11930 The patch cleans up and refactors the logic in/around `WriteBatchWithIndexInternal` a bit as groundwork for further changes. Specifically, the class is turned back into a stateless collection of static helpers (which is the way it was before PR 6851). Note that there were two apparent reasons for introducing this instance state in PR 6851: a) encapsulating `MergeContext` and b) resolving objects like `Logger` and `Statistics` based on a variety of handles. However, neither reason seems justified at this point. Regarding a), the `MultiGetFromBatchAndDB` logic passes in its own `MergeContext` objects via a second set of methods that do not use the member `MergeContext`. As for b), `Logger` and friends are only needed for Merge, which is only supported if a column family handle is provided; in turn, the column family handle enables us to resolve all the necessary objects without the need for any other handles like `DB` or `DBOptions`. In addition to the above, the patch changes the type of `BaseDeltaIterator::merge_result_` to `std::string` from `PinnableSlice` (since no pinning is ever done) and makes some other small code quality improvements. Reviewed By: jaykorean Differential Revision: D50038302 fbshipit-source-id: 5f34abe2e808bdaea0f3a8033b5764ebd446b85d |
|
Yu Zhang | 39a4ff2cab |
Track full_history_ts_low per SuperVersion (#11784)
Summary: As discussed in https://github.com/facebook/rocksdb/issues/11730 , this PR tracks the effective `full_history_ts_low` per SuperVersion and update existing sanity checks for `ReadOptions.timestamp >= full_history_ts_low` to use this per SuperVersion `full_history_ts_low` instead. This also means the check is moved to happen after acquiring SuperVersion. There are two motivations for this: 1) Each time `full_history_ts_low` really come into effect to collapse history, a new SuperVersion is always installed, because it would involve either a Flush or Compaction, both of which change the LSM tree shape. We can take advantage of this to ensure that as long as this sanity check is passed, even if `full_history_ts_low` can be concurrently increased and collapse some history above the requested `ReadOptions.timestamp`, a read request won’t have visibility to that part of history through this SuperVersion that it already acquired. 2) the existing sanity check uses `ColumnFamilyData::GetFullHistoryTsLow` without locking the db mutex, which is the mutex all `IncreaseFullHistoryTsLow` operation is using when mutating this field. So there is a race condition. This also solve the race condition on the read path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11784 Test Plan: `make all check` // Checks success scenario really provide the read consistency attribute as mentioned above. `./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckPassReadIsConsistent*` // Checks failure scenario cleans up SuperVersion properly. `./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*` `./db_secondary_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*` `./db_readonly_with_timestamp_test --gtest_filter=*FullHistoryTsLowSanitchCheckFail*` Reviewed By: ltamasi Differential Revision: D48894795 Pulled By: jowlyzhang fbshipit-source-id: 1f801fe8e1bc8e63ca76c03cbdbd0974e5ff5bf6 |
|
Yu Zhang | 9c2ebcc2c3 |
Log user_defined_timestamps_persisted flag in event logger (#11683)
Summary: As titled, and also removed an undefined and unused member function in for ColumnFamilyData Pull Request resolved: https://github.com/facebook/rocksdb/pull/11683 Reviewed By: ajkr Differential Revision: D48156290 Pulled By: jowlyzhang fbshipit-source-id: cc99aaafe69db6611af3854cb2b2ebc5044941f7 |
|
Yu Zhang | 4ea7b796b7 |
Respect cutoff timestamp during flush (#11599)
Summary: Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval. This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`. For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`. Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it. Tet plan: ``` ./column_family_test --gtest_filter="*RetainUDT*" ./memtable_list_test --gtest_filter="*WithTimestamp*" ./flush_job_test --gtest_filter="*WithTimestamp*" ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11599 Reviewed By: ajkr Differential Revision: D47561586 Pulled By: jowlyzhang fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89 |
|
Yu Zhang | 4dafa5b220 |
switch to use RocksDB UnorderedMap (#11507)
Summary: Switch from std::unordered_map to RocksDB UnorderedMap for all the places that logging user-defined timestamp size in WAL used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11507 Test Plan: ``` make all check ``` Reviewed By: ltamasi Differential Revision: D46448975 Pulled By: jowlyzhang fbshipit-source-id: bdb4d56a723b697a33daaf0f856a61d49a367a99 |
|
Yu Zhang | 56ca9e3106 |
Logging timestamp size record in WAL and use it during recovery (#11471)
Summary: Start logging the timestamp size record in WAL and use the record during recovery. Currently, user comparator cannot be different from what was used to create a column family, so the timestamp size record is just used to confirm it's consistent with the timestamp size the running user comparator indicates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11471 Test Plan: ``` make all check ./db_secondary_test ./db_wal_test --gtest_filter="*WithTimestamp*" ./repair_test --gtest_filter="*WithTimestamp*" ``` Reviewed By: ltamasi Differential Revision: D46236769 Pulled By: jowlyzhang fbshipit-source-id: f6c60b5c8defdb05021c63df302ccc0be1275ad0 |
|
Hui Xiao | cb58477185 |
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300)
Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b |
|
sdong | 4720ba4391 |
Remove RocksDB LITE (#11147)
Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2 |
|
Hui Xiao | 86fa2592be |
Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job owns flush_reason instead of CFD (#11111)
Summary: **Context:** Concurrent flushes on the same CF can set on `ColumnFamilyData::flush_reason` before each other flush finishes. An symptom is one CF has different flush_reason with others though all of them are in an atomic flush `db_stress: db/db_impl/db_impl_compaction_flush.cc:423: rocksdb::Status rocksdb::DBImpl::AtomicFlushMemTablesToOutputFiles(const rocksdb::autovector<rocksdb::DBImpl::BGFlushArg>&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority): Assertion cfd->GetFlushReason() == cfds[0]->GetFlushReason() failed. ` **Summary:** Suggested by ltamasi, we now refactor and let FlushRequest/Job to own flush_reason as there is no good way to define `ColumnFamilyData::flush_reason` in face of concurrent flushes on the same CF (which wasn't the case a long time ago when `ColumnFamilyData::flush_reason ` first introduced`) **Tets:** - new unit test - make check - aggressive crash test rehearsal Pull Request resolved: https://github.com/facebook/rocksdb/pull/11111 Reviewed By: ajkr Differential Revision: D42644600 Pulled By: hx235 fbshipit-source-id: 8589c8184869d3415e5b780c887f877818a5ebaf |
|
Hui Xiao | 98d5db5c2e |
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run
|
|
Andrew Kryczka | 5cf6ab6f31 |
Ran clang-format on db/ directory (#10910)
Summary: Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910 Reviewed By: riversand963 Differential Revision: D40880683 Pulled By: ajkr fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174 |
|
Jay Zhuang | a3acf2ef87 |
Add seqno to time mapping (#10338)
Summary: Which will be used for tiered storage to preclude hot data from compacting to the cold tier (the last level). Internally, adding seqno to time mapping. A periodic_task is scheduled to record the current_seqno -> current_time in certain cadence. When memtable flush, the mapping informaiton is stored in sstable property. During compaction, the mapping information are merged and get the approximate time of sequence number, which is used to determine if a key is recently inserted or not and preclude it from the last level if it's recently inserted (within the `preclude_last_level_data_seconds`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338 Test Plan: CI Reviewed By: siying Differential Revision: D37810187 Pulled By: jay-zhuang fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f |
|
Baptiste Lemaire | 5879053fd0 |
Dynamically changeable `MemPurge` option (#10011)
Summary: **Summary** Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled. **Motivation** RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible. Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement. **Content of this PR** This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes. **Benchmarking** I will add numbers to prove that there is no performance impact within the next 12 hours. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011 Reviewed By: pdillinger Differential Revision: D36462357 Pulled By: bjlemaire fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802 |
|
Bo Wang | c073ed7601 |
Fix typo in comments and code (#10233)
Summary: Fix typo in comments and code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10233 Test Plan: Existing unit tests should pass. Reviewed By: jay-zhuang, anand1976 Differential Revision: D37356702 Pulled By: gitbw95 fbshipit-source-id: 32c019adcc6dcc95a9882b38147a310091368e51 |
|
Gang Liao | deff48bcef |
Add blob source to retrieve blobs in RocksDB (#10198)
Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost. Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel! This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198 Reviewed By: ltamasi Differential Revision: D37294735 Pulled By: gangliao fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e |
|
Hui Xiao | d665afdbf3 |
Account memory of FileMetaData in global memory limit (#9924)
Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625 |
|
Peter Dillinger | efd035164b |
Meta-internal folly integration with F14FastMap (#9546)
Summary: Especially after updating to C++17, I don't see a compelling case for *requiring* any folly components in RocksDB. I was able to purge the existing hard dependencies, and it can be quite difficult to strip out non-trivial components from folly for use in RocksDB. (The prospect of doing that on F14 has changed my mind on the best approach here.) But this change creates an optional integration where we can plug in components from folly at compile time, starting here with F14FastMap to replace std::unordered_map when possible (probably no public APIs for example). I have replaced the biggest CPU users of std::unordered_map with compile-time pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set. USE_FOLLY is always set in the Meta-internal buck build, and a simulation of that is in the Makefile for public CI testing. A full folly build is not needed, but checking out the full folly repo is much simpler for getting the dependency, and anything else we might want to optionally integrate in the future. Some picky details: * I don't think the distributed mutex stuff is actually used, so it was easy to remove. * I implemented an alternative to `folly::constexpr_log2` (which is much easier in C++17 than C++11) so that I could pull out the hard dependencies on `ConstexprMath.h` * I had to add noexcept move constructors/operators to some types to make F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a macro to make that easier in some common cases. * Updated Meta-internal buck build to use folly F14Map (always) No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a production integration for open source users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546 Test Plan: CircleCI tests updated so that a couple of them use folly. Most internal unit & stress/crash tests updated to use Meta-internal latest folly. (Note: they should probably use buck but they currently use Makefile.) Example performance improvement: when filter partitions are pinned in cache, they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build a test that exercises that heavily. Build DB with ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters ``` and test with (simultaneous runs with & without folly, ~20 times each to see convergence) ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache ``` Average ops/s no folly: 26229.2 Average ops/s with folly: 26853.3 (+2.4%) Reviewed By: ajkr Differential Revision: D34181736 Pulled By: pdillinger fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94 |
|
Peter Dillinger | cad809978a |
Fix heap use-after-free race with DropColumnFamily (#9730)
Summary: Although ColumnFamilySet comments say that DB mutex can be freed during iteration, as long as you hold a ref while releasing DB mutex, this is not quite true because UnrefAndTryDelete might delete cfd right before it is needed to get ->next_ for the next iteration of the loop. This change solves the problem by making a wrapper class that makes such iteration easier while handling the tricky details of UnrefAndTryDelete on the previous cfd only after getting next_ in operator++. FreeDeadColumnFamilies should already have been obsolete; this removes it for good. Similarly, ColumnFamilySet::iterator doesn't need to check for cfd with 0 refs, because those are immediately deleted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9730 Test Plan: was reported with ASAN on unit tests like DBLogicalBlockSizeCacheTest.CreateColumnFamily (very rare); keep watching Reviewed By: ltamasi Differential Revision: D35038143 Pulled By: pdillinger fbshipit-source-id: 0a5478d5be96c135343a00603711b7df43ae19c9 |
|
slk | 95305c44a1 |
Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410)
Summary: As disscussed in (https://github.com/facebook/rocksdb/issues/9223), Here added a new API named DB::OpenAndTrimHistory, this API will open DB and trim data to the timestamp specofied by **trim_ts** (The data with newer timestamp than specified trim bound will be removed). This API should only be used at a timestamp-enabled db instance recovery. And this PR implemented a new iterator named HistoryTrimmingIterator to support trimming history with a new API named DB::OpenAndTrimHistory. HistoryTrimmingIterator wrapped around the underlying InternalITerator such that keys whose timestamps newer than **trim_ts** should not be returned to the compaction iterator while **trim_ts** is not null. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9410 Reviewed By: ltamasi Differential Revision: D34410207 Pulled By: riversand963 fbshipit-source-id: e54049dc234eccd673244c566b15df58df5a6236 |
|
Zhiyi Zhang | 0cb0fc6fd3 |
Add DB properties for BlobDB (#8734)
Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346 |
|
Levi Tamasi | 3f7e929865 |
Fix a race in ColumnFamilyData::UnrefAndTryDelete (#8605)
Summary: The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk unlocks the DB mutex before destroying the `ThreadLocalPtr` holding the per-thread `SuperVersion` pointers when the only remaining reference is the back reference from `super_version_`. The idea behind this was to break the circular dependency between `ColumnFamilyData` and `SuperVersion`: when the penultimate reference goes away, `ColumnFamilyData` can clean up the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there is a `SuperVersion` and it is not referenced by anything else.) However, unlocking the mutex throws a wrench in this plan by making it possible for another thread to jump in and take another reference to the `ColumnFamilyData`, keeping the object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like https://github.com/facebook/rocksdb/issues/8440 , https://github.com/facebook/rocksdb/issues/8382 , and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet` destructor we sometimes observe during close in our stress tests. Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call `SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone. https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced the circular dependency and associated cleanup logic described above (in order to enable iterators to keep the `ColumnFamilyData` for dropped column families alive), and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`. Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete` in `SuperVersion::Cleanup`. The patch simply eliminates the unlocking and relocking, which has been unnecessary ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free. This closes the window during which another thread could increase the reference count, and hopefully fixes the issues above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605 Test Plan: Ran `make check` and stress tests locally. Reviewed By: pdillinger Differential Revision: D30051035 Pulled By: ltamasi fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528 |
|
Peter Dillinger | 74b7c0d249 |
Fix use-after-free on implicit temporary FileOptions (#8571)
Summary: FileOptions has an implicit conversion from EnvOptions and some internal APIs take `const FileOptions&` and save the reference, which is counter to Google C++ guidelines, > Avoid defining functions that require a const reference parameter to outlive the call, because const reference parameters bind to temporaries. Instead, find a way to eliminate the lifetime requirement (for example, by copying the parameter), or pass it by const pointer and document the lifetime and non-null requirements. This is at least a problem for repair.cc, which passes an EnvOptions to TableCache(), which would save a reference to the temporary copy as FileOptions. This was unfortunately only caught as a side effect of changes in https://github.com/facebook/rocksdb/issues/8544. This change fixes the repair.cc case and updates the involved internal APIs that save a reference to use `const FileOptions*` instead. Unfortunately, I don't know how to get any of our sanitizers to reliably report bugs like this, so I can't rule out more existing in our codebase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8571 Test Plan: Test that issues seen with https://github.com/facebook/rocksdb/issues/8544 are fixed (can reproduce on AWS EC2) Reviewed By: ajkr Differential Revision: D29943890 Pulled By: pdillinger fbshipit-source-id: 95f9c5251548777b4dc994c1a083dd2add5799c9 |
|
Baptiste Lemaire | c521a9ab2b |
Retire superfluous functions introduced in earlier mempurge PRs. (#8558)
Summary: The main challenge to make the memtable garbage collection prototype (nicknamed `mempurge`) was to not get rid of WAL files that contain unflushed (but mempurged) data. That was successfully guaranteed by not writing the VersionEdit to the MANIFEST file after a successful mempurge. By not writing VersionEdits to the `MANIFEST` file after a succesful mempurge operation, we do not change the earliest log file number that contains unflushed data: `cfd->GetLogNumber()` (`cfd->SetLogNumber()` is only called in `VersionSet::ProcessManifestWrites`). As a result, a number of functions introduced earlier just for the mempurge operation are not obscolete/redundant. (e.g.: `FlushJob::ExtractEarliestLogFileNumber`), and this PR aims at cleaning up all these now-unnecessary functions. In particular, we no longer need to store the earliest log file number in the `MemTable` struct itself. This PR therefore also reverts the `MemTable` struct to its original form. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8558 Test Plan: Already included in `db_flush_test.cc`. Reviewed By: anand1976 Differential Revision: D29764351 Pulled By: bjlemaire fbshipit-source-id: 0f43b260fa270251862512f397d3f24ee62e8437 |
|
Baptiste Lemaire | 206845c057 |
Mempurge support for wal (#8528)
Summary: In this PR, `mempurge` is made compatible with the Write Ahead Log: in case of recovery, the DB is now capable of recovering the data that was "mempurged" and kept in the `imm()` list of immutable memtables. The twist was to add a uint64_t to the `memtable` struct to store the number of the earliest log file containing entries from the `memtable`. When a `Flush` operation is replaced with a `MemPurge`, the `VersionEdit` (which usually contains the new min log file number to pick up for recovery and the level 0 file path of the newly created SST file) is no longer appended to the manifest log, and every time the `deleteWal` method is called, a check is made on the list of immutable memtables. This PR also includes a unit test that verifies that no data is lost upon Reopening of the database when the mempurge feature is activated. This extensive unit test includes two column families, with valid data contained in the imm() at time of "crash"/reopening (recovery). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8528 Reviewed By: pdillinger Differential Revision: D29701097 Pulled By: bjlemaire fbshipit-source-id: 072a900fb6ccc1edcf5eef6caf88f3060238edf9 |
|
Baptiste Lemaire | 837705ad80 |
Make mempurge a background process (equivalent to in-memory compaction). (#8505)
Summary: In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype. In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...). Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge. MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`. The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505 Reviewed By: pdillinger Differential Revision: D29619283 Pulled By: bjlemaire fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95 |
|
Baptiste Lemaire | 9dc887ece0 |
Memtable "MemPurge" prototype (#8454)
Summary: Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage. The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time . Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested). One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454 Reviewed By: anand1976 Differential Revision: D29433971 Pulled By: bjlemaire fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5 |
|
Zhichao Cao | f44e69c64a |
Use DbSessionId as cache key prefix when secondary cache is enabled (#8360)
Summary: Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360 Test Plan: add the test to lru_cache_test Reviewed By: pdillinger Differential Revision: D29006215 Pulled By: zhichao-cao fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814 |
|
Levi Tamasi | d83542ca83 |
Make it possible to apply only a subrange of table property collectors (#8298)
Summary: This patch does two things: 1) Introduces some aliases in order to eliminate/prevent long-winded type names w/r/t the internal table property collectors (see e.g. `std::vector<std::unique_ptr<IntTblPropCollectorFactory>>`). 2) Makes it possible to apply only a subrange of table property collectors during table building by turning `TableBuilderOptions::int_tbl_prop_collector_factories` from a pointer to a `vector` into a range (i.e. a pair of iterators). Rationale: I plan to introduce a BlobDB related table property collector, which should only be applied during table creation if blob storage is enabled at the moment (which can be changed dynamically). This change will make it possible to include/ exclude the BlobDB related collector as needed without having to introduce a second `vector` of collectors in `ColumnFamilyData` with pretty much the same contents. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8298 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D28430910 Pulled By: ltamasi fbshipit-source-id: a81d28f2c59495865300f43deb2257d2e6977c8e |
|
mrambacher | 8948dc8524 |
Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262)
Summary: The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form). Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes. Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262 Reviewed By: pdillinger Differential Revision: D28226540 Pulled By: mrambacher fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf |
|
storagezhang | 711881bc25 |
Fix some typos in comments (#8066)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8066 Reviewed By: jay-zhuang Differential Revision: D27280799 Pulled By: mrambacher fbshipit-source-id: 68f91f5af4ffe0a84be581961bf9366887f47702 |
|
fanrui03 | 67d72fb5dc |
Fix checkpoint stuck (#7921)
Summary: ## 1. Bug description: When RocksDB Checkpoint, it may be stuck in `WaitUntilFlushWouldNotStallWrites` method. ## 2. Simple analysis of the reasons: ### 2.1 Configuration parameters: ```yaml Compaction Style : Universal max_write_buffer_number : 4 min_write_buffer_number_to_merge : 3 ``` Checkpoint is usually very fast. When the Checkpoint is executed, `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable MemTables, which are less than `min_write_buffer_number_to_merge`, they will not be flushed. But will enter this code. ```c++ // method: GetWriteStallConditionAndCause if (mutable_cf_options.max_write_buffer_number> 3 && num_unflushed_memtables >= mutable_cf_options.max_write_buffer_number-1) { return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit}; } ``` code link: |
|
Cheng Chang | fbce7a3808 |
Track WAL obsoletion when updating empty CF's log number (#7781)
Summary: In the write path, there is an optimization: when a new WAL is created during SwitchMemtable, we update the internal log number of the empty column families to the new WAL. `FindObsoleteFiles` marks a WAL as obsolete if the WAL's log number is less than `VersionSet::MinLogNumberWithUnflushedData`. After updating the empty column families' internal log number, `VersionSet::MinLogNumberWithUnflushedData` might change, so some WALs might become obsolete to be purged from disk. For example, consider there are 3 column families: 0, 1, 2: 1. initially, all the column families' log number is 1; 2. write some data to cf0, and flush cf0, but the flush is pending; 3. now a new WAL 2 is created; 4. write data to cf1 and WAL 2, now cf0's log number is 1, cf1's log number is 2, cf2's log number is 2 (because cf1 and cf2 are empty, so their log numbers will be set to the highest log number); 5. now cf0's flush hasn't finished, flush cf1, a new WAL 3 is created, and cf1's flush finishes, now cf0's log number is 1, cf1's log number is 3, cf2's log number is 3, since WAL 1 still contains data for the unflushed cf0, no WAL can be deleted from disk; 6. now cf0's flush finishes, cf0's log number is 2 (because when cf0 was switching memtable, WAL 3 does not exist yet), cf1's log number is 3, cf2's log number is 3, so WAL 1 can be purged from disk now, but WAL 2 still cannot because `MinLogNumberToKeep()` is 2; 7. write data to cf2 and WAL 3, because cf0 is empty, its log number is updated to 3, so now cf0's log number is 3, cf1's log number is 3, cf2's log number is 3; 8. now if the background threads want to purge obsolete files from disk, WAL 2 can be purged because `MinLogNumberToKeep()` is 3. But there are only two flush results written to MANIFEST: the first is for flushing cf1, and the `MinLogNumberToKeep` is 1, the second is for flushing cf0, and the `MinLogNumberToKeep` is 2. So without this PR, if the DB crashes at this point and try to recover, `WalSet` will still expect WAL 2 to exist. When WAL tracking is enabled, we assume WALs will only become obsolete after a flush result is written to MANIFEST in `MemtableList::TryInstallMemtableFlushResults` (or its atomic flush counterpart). The above situation breaks this assumption. This PR tracks WAL obsoletion if necessary before updating the empty column families' log numbers. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7781 Test Plan: watch existing tests and stress tests to pass. `make -j48 blackbox_crash_test` on devserver Reviewed By: ltamasi Differential Revision: D25631695 Pulled By: cheng-chang fbshipit-source-id: ca7fff967bdb42204b84226063d909893bc0a4ec |
|
Peter Dillinger | b1ee191405 |
Fix memory leak for ColumnFamily drop with live iterator (#7749)
Summary: Uncommon bug seen by ASAN with ColumnFamilyTest.LiveIteratorWithDroppedColumnFamily, if the last two references to a ColumnFamilyData are both SuperVersions (during InstallSuperVersion). The fix is to use UnrefAndTryDelete even in SuperVersion::Cleanup but with a parameter to avoid re-entering Cleanup on the same SuperVersion being cleaned up. ColumnFamilyData::Unref is considered unsafe so removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7749 Test Plan: ./column_family_test --gtest_filter=*LiveIter* --gtest_repeat=100 Reviewed By: jay-zhuang Differential Revision: D25354304 Pulled By: pdillinger fbshipit-source-id: e78f3a3f67c40013b8432f31d0da8bec55c5321c |
|
Yanqin Jin | eee0af9af1 |
Add full_history_ts_low to column family (#7740)
Summary: Following https://github.com/facebook/rocksdb/issues/7655 and https://github.com/facebook/rocksdb/issues/7657, this PR adds `full_history_ts_low_` to `ColumnFamilyData`. `ColumnFamilyData::full_history_ts_low_` will be used to create `FlushJob` and `CompactionJob`. `ColumnFamilyData::full_history_ts_low` is persisted to the MANIFEST file. An application can only increase its value. Consider the following case: > > The database has a key at ts=950. `full_history_ts_low` is first set to 1000, and then a GC is triggered > and cleans up all data older than 1000. If the application sets `full_history_ts_low` to 900 afterwards, > and tries to read at ts=960, the key at 950 is not seen. From the perspective of the read, the result > is hard to reason. For simplicity, we just do now allow decreasing full_history_ts_low for now. > During recovery, the value of `full_history_ts_low` is restored for each column family if applicable. Note that version edits in the MANIFEST file for the same column family may have `full_history_ts_low` unsorted due to the potential interleaving of `LogAndApply` calls. Only the max will be used to restore the state of the column family. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7740 Test Plan: make check Reviewed By: ltamasi Differential Revision: D25296217 Pulled By: riversand963 fbshipit-source-id: 24acda1df8262cd7cfdc6ce7b0ec56438abe242a |
|
Ramkumar Vadivelu | 9a690a74e1 |
In ParseInternalKey(), include corrupt key info in Status (#7515)
Summary: Fixes Issue https://github.com/facebook/rocksdb/issues/7497 When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()` Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later. Tests: - make check - some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test - some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test Example of new status returns: - Key too small - `Corrupted Key: Internal Key too small. Size=5` - Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3` - Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515 Reviewed By: ajkr Differential Revision: D24240264 Pulled By: ramvadiv fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7 |
|
Levi Tamasi | e8cb32ed67 |
Introduce BlobFileCache and add support for blob files to Get() (#7540)
Summary: The patch adds blob file support to the `Get` API by extending `Version` so that whenever a blob reference is read from a file, the blob is retrieved from the corresponding blob file and passed back to the caller. (This is assuming the blob reference is valid and the blob file is actually part of the given `Version`.) It also introduces a cache of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so `max_open_files` (if specified) limits the total number of open (table + blob) files. TODO: proactively open/cache blob files and pin the cache handles of the readers in the metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for table files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D24260219 Pulled By: ltamasi fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba |
|
Akanksha Mahajan | 8e0df9050c |
Store FSRandomAccessPtr object in RandomAccessFileReader (#7192)
Summary: Replace FSRandomAccessFile pointer with FSRandomAccessFilePtr object in RandomAccessFileReader. This new object wraps FSRandomAccessFile pointer. Objective: If tracing is enabled, FSRandomAccessFile Ptr returns FSRandomAccessFileTracingWrapper pointer that includes all necessary information in IORecord and calls underlying FileSystem and invokes IOTracer to dump that record in a binary file. If tracing is disabled then, underlying FileSystem pointer is returned directly. FSRandomAccessFilePtr wrapper class is added to bypass the FSRandomAccessFileWrapper when tracing is disabled. Test Plan: make check -j64 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7192 Reviewed By: anand1976 Differential Revision: D23356867 Pulled By: akankshamahajan15 fbshipit-source-id: 48f31168166a17a7444b40be44a9a9d4a5c7182c |
|
Jay Zhuang | b0c5ecd6b3 |
Make max_subcompactions dynamically changeable (#7159)
Summary: Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159 Reviewed By: siying Differential Revision: D22671238 Pulled By: jay-zhuang fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42 |
|
Yanqin Jin | fb09ef05dc |
Attempt to recover from db with missing table files (#6334)
Summary: There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status. This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version. `DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed. To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush. Test plan (on devserver): ``` $make check $COMPILE_WITH_ASAN=1 make all && make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334 Reviewed By: anand1976 Differential Revision: D19778960 Pulled By: riversand963 fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8 |
|
Cheng Chang | 2d9efc9ab2 |
Cache result of GetLogicalBufferSize in Linux (#6457)
Summary: In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down. This PR introduces two new APIs: 1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes. 2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes. Other modifications: 1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms. 2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`. 3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457 Test Plan: 1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`. 2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`. `make check` Differential Revision: D20131243 Pulled By: cheng-chang fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04 |
|
Zhichao Cao | 8d73137ae8 |
Replace Directory with FSDirectory in DB (#6468)
Summary: In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468 Test Plan: pass make asan_check Differential Revision: D20195261 Pulled By: zhichao-cao fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1 |
|
sdong | fdf882ded2 |
Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e |
|
sdong | 36c504be17 |
Avoid create directory for every column families (#6358)
Summary: A relatively recent regression causes for every CF, create and open directory is called for the DB directory, unless CF has a private directory. This doesn't scale well with large number of column families. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6358 Test Plan: Run all existing tests and see it pass. strace with db_bench --num_column_families and observe it doesn't open directory for number of column families. Differential Revision: D19675141 fbshipit-source-id: da01d9216f1dae3f03d4064fbd88ce71245bd9be |
|
解轶伦 | 39fcaf8246 |
delete superversions in BackgroundCallPurge (#6146)
Summary: I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB). Then I found "delete sv" in ~SuperVersion() takes the time. The backtrace looks like this DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() -> DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion() I think it's better to delete in a background thread, please review it。 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146 Differential Revision: D18972066 fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c |
|
anand76 | afa2420c2b |
Introduce a new storage specific Env API (#5761)
Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f |
|
Jermy Li | c2029f9716 |
Support concurrent CF iteration and drop (#6147)
Summary: It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI. This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early. fixed https://github.com/facebook/rocksdb/issues/5982 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147 Differential Revision: D18926378 fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15 |
|
sdong | 4c74dba5fa |
Bump up memory order of ref counting of ColumnFamilyData (#5723)
Summary: We see this TSAN warning: WARNING: ThreadSanitizer: data race (pid=282806) Write of size 8 at 0x7b6c00000e38 by thread T16 (mutexes: write M1023578822185846136): #0 operator delete(void*) <null> (libtsan.so.0+0x0000000795f8) https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2202 (db_flush_test+0x00000060b462) https://github.com/facebook/rocksdb/issues/2 rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2226 (db_flush_test+0x00000060cbd8) https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::BGWorkFlush(void*) db/db_impl/db_impl_compaction_flush.cc:2073 (db_flush_test+0x00000060d5ac) ...... Previous atomic write of size 4 at 0x7b6c00000e38 by main thread: #0 __tsan_atomic32_fetch_sub <null> (libtsan.so.0+0x00000006d721) https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<int>::fetch_sub(int, std::memory_order) /mnt/gvfs/third-party2/libgcc/c67031f0f739ac61575a061518d6ef5038f99f90/7.x/platform007/5620abc/include/c++/7.3.0/bits/atomic_base.h:524 (db_flush_test+0x0000005f9e38) https://github.com/facebook/rocksdb/issues/2 rocksdb::ColumnFamilyData::Unref() db/column_family.h:286 (db_flush_test+0x0000005f9e38) https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&, rocksdb::FlushReason, bool) db/db_impl/db_impl_compaction_flush.cc:1624 (db_flush_test+0x0000005f9e38) https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::TEST_FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&) db/db_impl/db_impl_debug.cc:127 (db_flush_test+0x00000061ace9) https://github.com/facebook/rocksdb/issues/5 rocksdb::DBFlushTest_CFDropRaceWithWaitForFlushMemTables_Test::TestBody() db/db_flush_test.cc:320 (db_flush_test+0x0000004b44e5) https://github.com/facebook/rocksdb/issues/6 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.7.0/fused-src/gtest/gtest-all.cc:3824 (db_flush_test+0x000000be2988) ...... It's still very clear the cause of the warning is because that TSAN treats results from relaxed atomic::fetch_sub() as non-atomic with the operation itself. We can make it more explicit by bumping up the order to CS. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5723 Test Plan: Run all existing test. Differential Revision: D16908250 fbshipit-source-id: bf17d39ed19058372bdf97f6440a743f88153021 |
|
haoyuhuang | bb4178066d |
Integrate block cache tracer into db_impl (#5433)
Summary: This PR integrates the block cache tracer class into db_impl.cc. db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433 Differential Revision: D15728016 Pulled By: HaoyuHuang fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71 |
|
Maysam Yabandeh | ae05a83e19 |
Call ValidateOptions from SetOptions (#5368)
Summary: Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368 Differential Revision: D15540101 Pulled By: maysamyabandeh fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4 |
|
Siying Dong | 596cc1547a |
Update comments in column_family.h (#5347)
Summary: Document relationships of data structures declared in column_family.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/5347 Differential Revision: D15496941 Pulled By: siying fbshipit-source-id: 47b37835abba26aa31a94fabea6b2775483e0ccb |