Summary:
This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier.
A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following:
1) Setting the ratio to be equal to or smaller than 0 disables this collector
2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met.
3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction.
For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760
Test Plan: Added unit tests
Reviewed By: pdillinger
Differential Revision: D58734976
Pulled By: jowlyzhang
fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81
Summary:
Fix issue https://github.com/facebook/rocksdb/issues/12687.
A block cache may be shared by multiple column families. Therefore, when getting the aggregated property of the block cache, we need to deduplicate by instances of the block cache, meaning the same instance should only be counted once.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12755
Reviewed By: jowlyzhang
Differential Revision: D58508819
Pulled By: ajkr
fbshipit-source-id: 3b746841d7eac59f900387ec3b8c19dbcd20aae4
Summary:
As title. Changes include the following
- `Refresh()` moved from `Iterator` interface to `IteratorBase` so that `AttributeGroupIterator` can also have Refresh() API (implemention will be added in the future PR)
- `TestIterate()`'s main logic refactored into `TestIterateImpl()` so that it can be shared with `TestIterateAttributeGroups()`
- `VerifyIterator()` also changed so that verification code can be shared between `Iterator` and `AttributeGroupIterator`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12776
Test Plan:
Single CF Iterator
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=0 --verify_iterator_with_expected_state_one_in=2
```
CoalescingIterator
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1 --verify_iterator_with_expected_state_one_in=2
```
AttributeGroupIterator
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1 --verify_iterator_with_expected_state_one_in=2
```
Reviewed By: cbi42
Differential Revision: D58626165
Pulled By: jaykorean
fbshipit-source-id: 3e0a6ff72e51ecef9e06b65acfa53605a24d742e
Summary:
This is not currently caught by our public CI so adding a form of this check to `make check-headers`, which is part of the build-linux-unity-and-headers GHA job.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12774
Test Plan: manually added a violation, which was caught. Also caught an existing trivial violation (fixed). CI will verify it plays nice with GHA.
Reviewed By: jowlyzhang
Differential Revision: D58616601
Pulled By: pdillinger
fbshipit-source-id: e656ce82709660c088a3d3a5e41dd07655cb40e0
Summary:
Instead of completely disallowing `MultiCfIterator` when one or more child iterators will do manual prefix iteration (as suggested in https://github.com/facebook/rocksdb/issues/12770 ), just let `MultiCfIterator` operate as is even when there's a possibility of undefined result from child iterators. If one or more child iterators cause the heap to be empty, just return early and `Valid()` will return false.
It is still possible that heap is not empty when one or more child iterators are returning wrong keys. Basically, MultiCfIterator behaves the same as what we described in https://github.com/facebook/rocksdb/wiki/Prefix-Seek#manual-prefix-iterating - "RocksDB will not return error when it is misused and the iterating result will be undefined."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12773
Test Plan:
MultiCfIterator added back to the stress test
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1 --verify_iterator_with_expected_state_one_in=2
```
Reviewed By: cbi42
Differential Revision: D58612055
Pulled By: jaykorean
fbshipit-source-id: e0dd942bed98382c59d463412dd8f163e6790b93
Summary:
This PR fix a possible manual flush hanging scenario because of its expectation that others will clear out excessive memtables was not met. The root cause is the FlushRequest rescheduling logic is using a stricter criteria for what a write stall is about to happen means than `WaitUntilFlushWouldNotStallWrites` does. Currently, the former thinks a write stall is about to happen when the last memtable is half full, and it will instead reschedule queued FlushRequest and not actually proceed with the flush. While the latter thinks if we already start to use the last memtable, we should wait until some other background flush jobs clear out some memtables before proceed this manual flush.
If we make them use the same criteria, we can guarantee that at any time when`WaitUntilFlushWouldNotStallWrites` is waiting, it's not because the rescheduling logic is holding it back.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12771
Test Plan: Added unit test
Reviewed By: ajkr
Differential Revision: D58603746
Pulled By: jowlyzhang
fbshipit-source-id: 9fa1c87c0175d47a40f584dfb1b497baa576755b
Summary:
When user-defined timestamps in Memtable only feature is enabled, all scheduled flushes go through a check to see if it's eligible to be rescheduled to retain user-defined timestamps. However when the user makes a manual flush request, their intention is for all the in memory data to be persisted into SST files as soon as possible. These two sides have some conflict of interest, the user can implement some workaround like https://github.com/facebook/rocksdb/issues/12631 to explicitly mark which one takes precedence. The implementation for this can be nuanced since the user needs to be aware of all the scenarios that can trigger a manual flush and handle the concurrency well etc.
In this PR, we updated the default behavior to give manual flush precedence when it's requested. The user-defined timestamps rescheduling mechanism is turned off when a manual flush is requested. Likewise, all error recovery triggered flushes skips the rescheduling mechanism too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12737
Test Plan: Add unit tests
Reviewed By: ajkr
Differential Revision: D58538246
Pulled By: jowlyzhang
fbshipit-source-id: 0b9b3d1af3e8d882f2d6a2406adda19324ba0694
Summary:
LLVM has detected a violation of `-Wdeprecated-dynamic-exception-spec`. Dynamic exceptions were removed in C++17. This diff fixes the deprecated instance(s).
See [Dynamic exception specification](https://en.cppreference.com/w/cpp/language/except_spec) and [noexcept specifier](https://en.cppreference.com/w/cpp/language/noexcept_spec).
Reviewed By: palmje
Differential Revision: D58528375
fbshipit-source-id: 130fecd3aa556e4cdb955feea53c442bd9fbc864
Summary:
Unknown why these would ignore options like deadline and read_tier. Setting total_order_seek=true is unnecessary because of the disable_prefix_seek (= true) parameter to NewIndexIterator. This is only used by the hash index, which uses total order seek if either the ReadOption or the parameter is true. The parameter is arguably redundant with the total_order_seek option, meaning it could be eliminated, but I think this case is exceptional (compared to e.g. no_io):
* Prefix seek is particular to user iterators, though might be usable, carefully, for other read operations.
* The historical default of total_order_seek=false in a sense is "wrong result by default" so cannot be interpreted as an intent to force prefix seek in an operation for which it might be usual or give bad results.
Also added a generic release note to cover this and related PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12764
Test Plan: existing tests
Reviewed By: hx235
Differential Revision: D58474240
Pulled By: pdillinger
fbshipit-source-id: 79014d9822ba8f09d57ce4524363aa0973017b68
Summary:
... in Index and CompressionDict readers (Filters in another PR). no_io and verify_checksums should be inferred from ReadOptions rather than specified redundantly.
Fixes incomplete propagation of ReadOptions in
UncompressionDictReader::GetOrReadUncompressionDictionar so is technically a functional change. (Related to https://github.com/facebook/rocksdb/issues/12757)
Also there was hardcoded no verify_checksums in DumpTable, but only for UncompressionDict, which doesn't make sense. Now using consistent ReadOptions and verify_checksum can be controlled for more reads together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12761
Test Plan: existing tests
Reviewed By: hx235
Differential Revision: D58450392
Pulled By: pdillinger
fbshipit-source-id: 0faed22832d664cb3b04a4c03ee77119977c200b
Summary:
POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731
to bring more hygiene to our handling of WAL files in Checkpoint.
Specifically, we now Close WAL files as soon as they are either
(a) inactive and fully synced, or (b) inactive and obsolete (so maybe
never fully synced), rather than letting Close() happen in handling
obsolete files (maybe a background thread). This should not be a
performance issue as Close() should be trivial cost relative to other
IO ops, but just in case:
* We don't Close() while holding a mutex, to avoid blocking, and
* The old behavior is available with a new kill switch option
`background_close_inactive_wals`.
Stacked on https://github.com/facebook/rocksdb/issues/12731
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734
Test Plan:
Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.
Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)
Reviewed By: cbi42
Differential Revision: D58295284
Pulled By: pdillinger
fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6
Summary:
The crash test revealed a case in which the uncache functionality in ~BlockBasedTableReader could initiate an block read (IO), despite setting ReadOptions::read_tier = kBlockCacheTier.
The root cause is a place in the code where many people have over time decided to opt-in propagating ReadOptions and no one took the initiative to propagate ReadOptions by default (opt out / override only as needed). The fix is in partitioned_index_reader.cc. Here,
ReadOptions::readahead_size is opted-out to avoid churn in prefetch_test that is not clearly an improvement or regression. It's hard to tell given the poor state of relevant documentation https://github.com/facebook/rocksdb/issues/12756. The affected unit test was added in https://github.com/facebook/rocksdb/issues/10602.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12757
Test Plan: (Now postponed to a follow-up diff) I have added some new infrastructure to DEBUG builds to catch this specific kind of violation in unit tests and in the stress/crash test. `EnforceReadOpts` establishes a thread-local context under which we assert no IOs are performed if ReadOptions said it should be forbidden. With this new checking, the Uncache unit test would catch the critical step toward a violation (inner ReadOptions allowing IO, even if no IO is actually performed), which is fixed with the production code change.
Reviewed By: hx235
Differential Revision: D58421526
Pulled By: pdillinger
fbshipit-source-id: 9e9917a0e320c78967e751bd887926a2ed231d37
Summary:
Data race reported on
BlockBasedTableReader::Rep::uncache_aggressiveness because apparently a file can be marked obsolete through multiple table cache references in parallel. Using a relaxed atomic should resolve the race quite reasonably, especially considering this is a rare case and the racing writes should be storing the same value anyway.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12753
Test Plan: watch for TSAN crash test results
Reviewed By: ltamasi
Differential Revision: D58397473
Pulled By: pdillinger
fbshipit-source-id: 3e78b6adac4f7a7056790754bee42b3cb244f037
Summary:
Crash test showed a potential use-after-free where a file marked as obsolete and eligible for uncache on destruction is destroyed in the VersionSet destructor, which only happens as part of DB shutdown. At that point, the in-memory column families have already been destroyed, so attempting to uncache could use-after-free on stuff like getting the `user_comparator()` from the `internal_comparator()`.
I attempted to make it smarter, but wasn't able to untangle the destruction dependencies in a way that was safe, understandable, and maintainable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12751
Test Plan:
Reproduced by adding uncache_aggressiveness to an existing (but otherwise unrelated) test. This makes it a fair regression test.
Also added testing to ensure that trivial moves and DB close & reopen are well behaved with uncache_aggressiveness. Specifically, this issue doesn't seem to be because things are uncached inappropriately in those cases.
Reviewed By: ltamasi
Differential Revision: D58390058
Pulled By: pdillinger
fbshipit-source-id: 66ac9cb13bf02638fa80ee5b7218153d8bc7cfd3
Summary:
This option is recommended to be set for production use:
We recommend to set track_and_verify_wals_in_manifest to true
for production
https://github.com/facebook/rocksdb/wiki/Track-WAL-in-MANIFEST
This adds this setting to the C API, so it can be used by other languages.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12749
Reviewed By: ltamasi
Differential Revision: D58382892
Pulled By: ajkr
fbshipit-source-id: 885de4539745a3119b6b2a162ab4fca9fa975283
Summary:
I haven't been able to reproduce the failure, seen in https://github.com/facebook/rocksdb/actions/runs/9420830905/job/25953696902?pr=12734
```
[ RUN ] DBBlockCacheTypeTestInstance/DBBlockCacheTypeTest.Uncache/2
db/db_block_cache_test.cc:1415: Failure
Expected equality of these values:
cache->GetOccupancyCount()
Which is: 37
kBaselineCount + kNumDataBlocks + meta_blocks_per_file
Which is: 15
Google Test trace:
db/db_block_cache_test.cc:1346: ua=10000
db/db_block_cache_test.cc:1344: partitioned=1
db/db_block_cache_test.cc:1418: Failure
...
```
But it's consistent with a SuperVersion reference sticking around beyond the CompactRange, as I can reproduce the result with a dangling Iterator. Like some other tests have had trouble with periodic stats popping up randomly, I suspect that could be the explanation in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12748
Test Plan: Watch for similar future failures
Reviewed By: ltamasi
Differential Revision: D58366031
Pulled By: pdillinger
fbshipit-source-id: b812ca8837b8c8b9cbda1b201d76316d145fa3ec
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/12567 disabled reopen with un-synced data loss in crash test since we discovered un-synced WAL loss and we currently don't support prefix recovery in reopen. This PR explicitly sync WAL data before close to avoid such data loss case from happening and add back the testing coverage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12746
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D58326890
Pulled By: hx235
fbshipit-source-id: 0865f715e97c5948d7cb3aea62fe2a626cb6522a
Summary:
The write_dbid_to_manifest option is documented as "We recommend setting this flag to true". However, there is no way to set this flag from the C API.
Add the following functions to the C API:
* rocksdb_get_db_identity
* rocksdb_options_get_write_dbid_to_manifest
* rocksdb_options_set_write_dbid_to_manifest
Add a test that this option preserves the ID across checkpoints.
c.cc:
* Remove outdated comments about missing C API functions that exist.
* Document that CopyString is intended for binary data and is not NUL terminated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12736
Reviewed By: ltamasi
Differential Revision: D58202117
Pulled By: ajkr
fbshipit-source-id: 707b110df5c4bd118d65548327428a53a9dc3019
Summary:
Currently, when files become obsolete, the block cache entries associated with them just age out naturally. With pure LRU, this is not too bad, as once you "use" enough cache entries to (re-)fill the cache, you are guranteed to have purged the obsolete entries. However, HyperClockCache is a counting clock cache with a somewhat longer memory, so could be more negatively impacted by previously-hot cache entries becoming obsolete, and taking longer to age out than newer single-hit entries.
Part of the reason we still have this natural aging-out is that there's almost no connection between block cache entries and the file they are associated with. Everything is hashed into the same pool(s) of entries with nothing like a secondary index based on file. Keeping track of such an index could be expensive.
This change adds a new, mutable CF option `uncache_aggressiveness` for erasing obsolete block cache entries. The process can be speculative, lossy, or unproductive because not all potential block cache entries associated with files will be resident in memory, and attempting to remove them all could be wasted CPU time. Rather than a simple on/off switch, `uncache_aggressiveness` basically tells RocksDB how much CPU you're willing to burn trying to purge obsolete block cache entries. When such efforts are not sufficiently productive for a file, we stop and move on.
The option is in ColumnFamilyOptions so that it is dynamically changeable for already-open files, and customizeable by CF.
Note that this block cache removal happens as part of the process of purging obsolete files, which is often in a background thread (depending on `background_purge_on_iterator_cleanup` and `avoid_unnecessary_blocking_io` options) rather than along CPU critical paths.
Notable auxiliary code details:
* Possibly fixing some issues with trivial moves with `only_delete_metadata`: unnecessary TableCache::Evict in that case and missing from the ObsoleteFileInfo move operator. (Not able to reproduce an current failure.)
* Remove suspicious TableCache::Erase() from VersionSet::AddObsoleteBlobFile() (TODO follow-up item)
Marked EXPERIMENTAL until more thorough validation is complete.
Direct stats of this functionality are omitted because they could be misleading. Block cache hit rate is a better indicator of benefit, and CPU profiling a better indicator of cost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12694
Test Plan:
* Unit tests added, including refactoring an existing test to make better use of parameterized tests.
* Added to crash test.
* Performance, sample command:
```
for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 | grep -E 'micros/op|rocksdb.block.cache.data.(hit|miss)|rocksdb.number.keys.(read|written)|maxresident' | awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' | tee -a results-$CT-$UA; done; done; done
```
Averaging 10 runs each case, block cache data block hit rates
```
lru_cache
UA=0 -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0
UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0
fixed_hyper_clock_cache
UA=0 -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9
UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2
auto_hyper_clock_cache
UA=0 -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5
UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8
```
Conclusion: up to roughly 1 percentage point of improved block cache hit rate, likely leading to overall improved efficiency (because the foreground CPU cost of cache misses likely outweighs the background CPU cost of erasure, let alone I/O savings).
Reviewed By: ajkr
Differential Revision: D57932442
Pulled By: pdillinger
fbshipit-source-id: 84a243ca5f965f731f346a4853009780a904af6c
Summary:
As titled. Also added the newest user-defined timestamp into the `MemTableInfo`. This can be a useful info in the callback.
Added some unit tests as examples for how users can use two separate approaches to allow manual flush / manual compactions to go through when the user-defined timestamps in memtable only feature is enabled. One approach relies on selectively increase cutoff timestamp in `OnMemtableSeal` callback when it's initiated by a manual flush. Another approach is to increase cutoff timestamp in `OnManualFlushScheduled` callback. The caveats of the approaches are also documented in the unit test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12631
Reviewed By: ajkr
Differential Revision: D58260528
Pulled By: jowlyzhang
fbshipit-source-id: bf446d7140affdf124744095e0a179fa6e427532
Summary:
**Context/Summary:** a better API design is decided lately so we decided to revert these two changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12738
Test Plan: - CI
Reviewed By: ajkr
Differential Revision: D58162165
Pulled By: hx235
fbshipit-source-id: 9bbe4d2fe9fbe39213f4cf137a2d419e6ffb8e16
Summary:
Background: there is one active WAL file but there can be
several more WAL files in various states. Those other WALs are always
in a "flushed" state but could be on the `logs_` list not yet fully
synced. We currently allow any WAL that is not the active WAL to be
hard-linked when creating a Checkpoint, as although it might still be
open for write, we are not appending any more data to it.
The problem is that a created Checkpoint is supposed to be fully synced
on return of that function, and a hard-linked WAL in the state described
above might not be fully synced. (Through some prudence in https://github.com/facebook/rocksdb/issues/10083,
it would synced if using track_and_verify_wals_in_manifest=true.)
The fix is a step toward a long term goal of removing the need to query
the filesystem to determine WAL files and their state. (I consider it
dubious any time we independently read from or query metadata from a
file we have open for writing, as this makes us more susceptible to
FileSystem deficiencies or races.) More specifically:
* Detect which WALs might not be fully synced, according to our DBImpl
metadata, and prevent hard linking those (with `trim_to_size=true`
from `GetLiveFilesStorageInfo()`. And while we're at it, use our known
flushed sizes for those WALs.
* To avoid a race between that and GetSortedWalFiles(), track a maximum
needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
* Because of the level of consistency provided by those two, we no
longer need to consider syncing as part of the FlushWAL in
GetLiveFilesStorageInfo. (We determine the max WAL number consistent
with the manifest file size, while holding DB mutex. Should make
track_and_verify_wals_in_manifest happy.) This makes the premise of
test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback
no longer hit) so the test is removed, with crash test as backstop for
related issues. See https://github.com/facebook/rocksdb/issues/10185
Stacked on https://github.com/facebook/rocksdb/issues/12729
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12731
Test Plan:
Expanded an existing test, which now fails before fix.
Also long runs of blackbox_crash_test with amplified checkpoint frequency.
Reviewed By: cbi42
Differential Revision: D58199629
Pulled By: pdillinger
fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382
Summary:
In places (e.g. GetSortedWals()) RocksDB relies on querying the file size or even reading the contents of files currently open for writing, and as in POSIX semantics, expects to see the flushed size and contents regardless of what has been synced. FaultInjectionTestFS historically did not emulate this behavior, only showing synced data from such read operations. (Different from FaultInjectionTestEnv--sigh.)
This change makes the "proper" behavior the default behavior, at least for GetFileSize and FSSequentialFile. However, this new functionality is disabled in db_stress because of undiagnosed, unresolved issues.
Also removes unused and confusing field `pos_at_last_flush_`
This change is needed to support testing a relevant bug fix (in a follow-up diff). Other suggested follow-up:
* Fix db_stress not to rely on the old behavior, and fix a related FIXME in db_stress_test_base.cc in LockWAL testing.
* Fill in some corner cases in the FileSystem API for reading unsynced data (see new TODO items).
* Consider deprecating and removing Flush() API functions from FileSystem APIs. It is not clear to me that there is a supported scenario in which they do anything but confuse API users and developers. If there is a use for them, it doesn't appear to be tested.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12729
Test Plan: applies to all unit tests successfully, just updating the unit test from https://github.com/facebook/rocksdb/issues/12556 due to relying on the errant behavior. Also added a specific unit test
Reviewed By: hx235
Differential Revision: D58091835
Pulled By: pdillinger
fbshipit-source-id: f47a63b2b000f5875b6293a98577bff663d7fd33
Summary:
When https://github.com/facebook/rocksdb/issues/12343 added support to bulk load external files while column family enables user-defined timestamps, it's a requirement that the external file doesn't overlap with the DB in key ranges. More specifically, the external file should not contain a user key (without timestamp) that already have some entries in the DB.
All the `*Overlap*` functions like `RangeOverlapWithMemtable`, `RangeOverlapWithCompaction` are using `CompareWithoutTimestamp` to check for overlap already. One thing that is missing here is we need to extend the external file's user key boundary for this check to avoid missing the checks for the boundary user keys. For example, with the current way of checking things where `external_file_info.smallest.user_key()` is used as the left boundary, and `external_file_info.largest.user_key()` is used as the right boundary, a file with this entry: (b, 40) can fit into a DB with these two entries: (b, 30), (c, 20).
To avoid this, we extend the user key boundaries used for overlap check, by updating the left boundary with the maximum timestamp and the right boundary with the minimum timestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12735
Test Plan: Added unit test
Reviewed By: ltamasi
Differential Revision: D58152117
Pulled By: jowlyzhang
fbshipit-source-id: 9cba61e7357f6d76ad44c258381c35073ebbf347
Summary:
```
ERROR: AddressSanitizer: container-overflow on address 0x506000682221 at pc 0x5583da569f76 bp 0x7f0ec8a9ffb0 sp 0x7f0ec8a9f780
WRITE of size 53 at 0x506000682221 thread T29
#0 0x5583da569f75 in pread
https://github.com/facebook/rocksdb/issues/1 0x5583e334fde4 in rocksdb::PosixRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const /rocksdb/env/io_posix.cc:580:9
https://github.com/facebook/rocksdb/issues/2 0x5583e2cac42b in rocksdb::(anonymous namespace)::CompositeRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const /rocksdb/env/composite_env.cc:61:21
https://github.com/facebook/rocksdb/issues/3 0x5583e2c8a8e4 in rocksdb::(anonymous namespace)::LegacyRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const /rocksdb/env/env.cc:152:41
https://github.com/facebook/rocksdb/issues/4 0x5583e2d6cbfb in rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, unsigned long, unsigned long, rocksdb::Slice*, char*, std::__2::unique_ptr<char [], std::__2::default_delete<char []>>*, rocksdb::Env::IOPriority) const /rocksdb/file/random_access_file_reader.cc:204:25
https://github.com/facebook/rocksdb/issues/5 0x5583e307c614 in rocksdb::ReadFooterFromFile(rocksdb::IOOptions const&, rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, unsigned long, rocksdb::Footer*, unsigned long) /rocksdb/table/format.cc:383:17
https://github.com/facebook/rocksdb/issues/6 0x5583e2f88456 in rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::CacheReservationManager>, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, unsigned long) /rocksdb/table/block_based/block_based_table_reader.cc:610:9
https://github.com/facebook/rocksdb/issues/7 0x5583e2ef7837 in rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, bool) const /rocksdb/table/block_based/block_based_table_factory.cc:599:10
https://github.com/facebook/rocksdb/issues/8 0x5583e2ab873c in rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, bool, bool, rocksdb::HistogramImpl*, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:142:34
https://github.com/facebook/rocksdb/issues/9 0x5583e2aba5f6 in rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Cache::Handle**, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:190:16
https://github.com/facebook/rocksdb/issues/10 0x5583e2abb7e1 in rocksdb::TableCache::NewIterator(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::RangeDelAggregator*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, rocksdb::TableReader**, rocksdb::HistogramImpl*, rocksdb::TableReaderCaller, rocksdb::Arena*, bool, int, unsigned long, rocksdb::InternalKey const*, rocksdb::InternalKey const*, bool) /rocksdb/db/table_cache.cc:235:9
https://github.com/facebook/rocksdb/issues/11 0x5583e28d14cf in rocksdb::BuildTable(std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::__2::vector<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>, std::__2::allocator<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>>>, rocksdb::FileMetaData*, std::__2::vector<rocksdb::BlobFileAddition, std::__2::allocator<rocksdb::BlobFileAddition>>*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::__2::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const*, rocksdb::BlobFileCompletionCallback*, unsigned long*, unsigned long*, unsigned long*) /rocksdb/db/builder.cc:335:57
https://github.com/facebook/rocksdb/issues/12 0x5583e29bf29d in rocksdb::FlushJob::WriteLevel0Table() /rocksdb/db/flush_job.cc:919:11
https://github.com/facebook/rocksdb/issues/13 0x5583e29b33ac in rocksdb::FlushJob::Run(rocksdb::LogsWithPrepTracker*, rocksdb::FileMetaData*, bool*) /rocksdb/db/flush_job.cc:276:9
https://github.com/facebook/rocksdb/issues/14 0x5583e27a4781 in rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>&, unsigned long, rocksdb::SnapshotChecker*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:258:19
https://github.com/facebook/rocksdb/issues/15 0x5583e27a7a96 in rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:377:14
https://github.com/facebook/rocksdb/issues/16 0x5583e27d6777 in rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2778:14
https://github.com/facebook/rocksdb/issues/17 0x5583e27d14e2 in rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2817:16
https://github.com/facebook/rocksdb/issues/18 0x5583e323d353 in std::__2::__function::__policy_func<void ()>::operator()[abi:ne180100]() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:714:12
https://github.com/facebook/rocksdb/issues/19 0x5583e323d353 in std::__2::function<void ()>::operator()() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:981:10
https://github.com/facebook/rocksdb/issues/20 0x5583e323d353 in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) /rocksdb/util/threadpool_imp.cc:266:5
https://github.com/facebook/rocksdb/issues/21 0x5583e3243d18 in decltype(std::declval<void (*)(void*)>()(std::declval<rocksdb::BGThreadMetadata*>())) std::__2::__invoke[abi:ne180100]<void (*)(void*), rocksdb::BGThreadMetadata*>(void (*&&)(void*), rocksdb::BGThreadMetadata*&&) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__type_traits/invoke.h:344:25
https://github.com/facebook/rocksdb/issues/22 0x5583e3243d18 in void std::__2::__thread_execute[abi:ne180100]<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*, 2ul>(std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>&, std::__2::__tuple_indices<2ul>) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:193:3
https://github.com/facebook/rocksdb/issues/23 0x5583e3243d18 in void* std::__2::__thread_proxy[abi:ne180100]<std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>>(void*) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:202:3
https://github.com/facebook/rocksdb/issues/24 0x5583da5e819e in asan_thread_start(void*) crtstuff.c
https://github.com/facebook/rocksdb/issues/25 0x7f0eda362a93 in start_thread nptl/pthread_create.c:447:8
https://github.com/facebook/rocksdb/issues/26 0x7f0eda3efc3b in clone3 misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
0x506000682221 is located 1 bytes inside of 56-byte region [0x506000682220,0x506000682258)
allocated by thread T29 here:
#0 0x5583da6281d1 in operator new(unsigned long)
https://github.com/facebook/rocksdb/issues/1 0x5583da6c987d in __libcpp_operator_new<unsigned long> /root/build/3rdParty/llvm/runtimes/include/c++/v1/new:271:10
https://github.com/facebook/rocksdb/issues/2 0x5583da6c987d in __libcpp_allocate /root/build/3rdParty/llvm/runtimes/include/c++/v1/new:295:10
https://github.com/facebook/rocksdb/issues/3 0x5583da6c987d in allocate /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocator.h:125:32
https://github.com/facebook/rocksdb/issues/4 0x5583da6c987d in allocate_at_least /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocator.h:131:13
https://github.com/facebook/rocksdb/issues/5 0x5583da6c987d in allocate_at_least<std::__2::allocator<char> > /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocate_at_least.h:34:20
https://github.com/facebook/rocksdb/issues/6 0x5583da6c987d in __allocate_at_least<std::__2::allocator<char> > /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocate_at_least.h:42:10
https://github.com/facebook/rocksdb/issues/7 0x5583da6c987d in std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>::__shrink_or_extend[abi:ne180100](unsigned long) /root/build/3rdParty/llvm/runtimes/include/c++/v1/string:3236:27
https://github.com/facebook/rocksdb/issues/8 0x5583e307c5aa in std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>::reserve(unsigned long) /root/build/3rdParty/llvm/runtimes/include/c++/v1/string:3207:3
https://github.com/facebook/rocksdb/issues/9 0x5583e307c5aa in rocksdb::ReadFooterFromFile(rocksdb::IOOptions const&, rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, unsigned long, rocksdb::Footer*, unsigned long) /rocksdb/table/format.cc:382:18
https://github.com/facebook/rocksdb/issues/10 0x5583e2f88456 in rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::CacheReservationManager>, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, unsigned long) /rocksdb/table/block_based/block_based_table_reader.cc:610:9
https://github.com/facebook/rocksdb/issues/11 0x5583e2ef7837 in rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, bool) const /rocksdb/table/block_based/block_based_table_factory.cc:599:10
https://github.com/facebook/rocksdb/issues/12 0x5583e2ab873c in rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, bool, bool, rocksdb::HistogramImpl*, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:142:34
https://github.com/facebook/rocksdb/issues/13 0x5583e2aba5f6 in rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Cache::Handle**, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:190:16
https://github.com/facebook/rocksdb/issues/14 0x5583e2abb7e1 in rocksdb::TableCache::NewIterator(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::RangeDelAggregator*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, rocksdb::TableReader**, rocksdb::HistogramImpl*, rocksdb::TableReaderCaller, rocksdb::Arena*, bool, int, unsigned long, rocksdb::InternalKey const*, rocksdb::InternalKey const*, bool) /rocksdb/db/table_cache.cc:235:9
https://github.com/facebook/rocksdb/issues/15 0x5583e28d14cf in rocksdb::BuildTable(std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::__2::vector<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>, std::__2::allocator<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>>>, rocksdb::FileMetaData*, std::__2::vector<rocksdb::BlobFileAddition, std::__2::allocator<rocksdb::BlobFileAddition>>*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::__2::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const*, rocksdb::BlobFileCompletionCallback*, unsigned long*, unsigned long*, unsigned long*) /rocksdb/db/builder.cc:335:57
https://github.com/facebook/rocksdb/issues/16 0x5583e29bf29d in rocksdb::FlushJob::WriteLevel0Table() /rocksdb/db/flush_job.cc:919:11
https://github.com/facebook/rocksdb/issues/17 0x5583e29b33ac in rocksdb::FlushJob::Run(rocksdb::LogsWithPrepTracker*, rocksdb::FileMetaData*, bool*) /rocksdb/db/flush_job.cc:276:9
https://github.com/facebook/rocksdb/issues/18 0x5583e27a4781 in rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>&, unsigned long, rocksdb::SnapshotChecker*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:258:19
https://github.com/facebook/rocksdb/issues/19 0x5583e27a7a96 in rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:377:14
https://github.com/facebook/rocksdb/issues/20 0x5583e27d6777 in rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2778:14
https://github.com/facebook/rocksdb/issues/21 0x5583e27d14e2 in rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2817:16
https://github.com/facebook/rocksdb/issues/22 0x5583e323d353 in std::__2::__function::__policy_func<void ()>::operator()[abi:ne180100]() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:714:12
https://github.com/facebook/rocksdb/issues/23 0x5583e323d353 in std::__2::function<void ()>::operator()() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:981:10
https://github.com/facebook/rocksdb/issues/24 0x5583e323d353 in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) /rocksdb/util/threadpool_imp.cc:266:5
https://github.com/facebook/rocksdb/issues/25 0x5583e3243d18 in decltype(std::declval<void (*)(void*)>()(std::declval<rocksdb::BGThreadMetadata*>())) std::__2::__invoke[abi:ne180100]<void (*)(void*), rocksdb::BGThreadMetadata*>(void (*&&)(void*), rocksdb::BGThreadMetadata*&&) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__type_traits/invoke.h:344:25
https://github.com/facebook/rocksdb/issues/26 0x5583e3243d18 in void std::__2::__thread_execute[abi:ne180100]<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*, 2ul>(std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>&, std::__2::__tuple_indices<2ul>) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:193:3
https://github.com/facebook/rocksdb/issues/27 0x5583e3243d18 in void* std::__2::__thread_proxy[abi:ne180100]<std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>>(void*) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:202:3
https://github.com/facebook/rocksdb/issues/28 0x5583da5e819e in asan_thread_start(void*) crtstuff.c
HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_container_overflow=0.
If you suspect a false positive see also: https://github.com/google/sanitizers/wiki/AddressSanitizerContainerOverflow.
AddressSanitizer:container-overflow in pread
Shadow bytes around the buggy address:
0x506000681f80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x506000682000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x506000682080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x506000682100: fa fa fa fa fa fa fa fa fa fa fa fa 00 00 00 00
0x506000682180: 00 00 00 fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x506000682200: fa fa fa fa[01]fc fc fc fc fc fc fa fa fa fa fa
0x506000682280: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x506000682300: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 01
0x506000682380: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
0x506000682400: fd fd fd fd fd fd fd fa fa fa fa fa fd fd fd fd
0x506000682480: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12722
Reviewed By: hx235
Differential Revision: D58118264
Pulled By: ajkr
fbshipit-source-id: 0dd914c886c022d82697b769d664ba52de0770de
Summary:
These messages indicate that SST file was created by a pre-9.0.0 RocksDB. Eventually, `TailPrefetchStats` might be removed, so it would be more informative if log message also included name of the affected SST file.
Issue: https://github.com/facebook/rocksdb/issues/12664
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12667
Reviewed By: ajkr
Differential Revision: D57464025
Pulled By: hx235
fbshipit-source-id: 12f2f2635e3092f8c29362aa132462492b5c1417
Summary:
rocksdb_batched_multi_get_cf has performance improvement than normal multi_get, however it needs a cf_handle arg, so add a C-API to get and destroy the default cf_handle, as many user only use the default cf.
Fixes https://github.com/facebook/rocksdb/issues/12316
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12514
Reviewed By: hx235
Differential Revision: D55922517
Pulled By: ajkr
fbshipit-source-id: c4cc4289f2cfd9efbb8f390a44a9d8d1ed08d9f0
Summary:
We plan to re-enable the test after fixing the test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12728
Test Plan: N/A. Disabling the test
Reviewed By: hx235
Differential Revision: D58071284
Pulled By: jaykorean
fbshipit-source-id: af6b45ec7654f9c7b40c36d3b59c7087e27a7af9
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12723
`CoalescingIterator` doesn't support `Refresh` currently; the patch adds a check that was missing from https://github.com/facebook/rocksdb/pull/12721 to disable this operation when multi-CF iterators are in use in the stress test.
Reviewed By: jaykorean
Differential Revision: D58053334
fbshipit-source-id: 3146f0e7e87230b49b244cecdfcee345c0ce78fa
Summary:
This PR adds a `DB::WriteWithCallback` API that does the same things as `DB::Write` while takes an argument `UserWriteCallback` to execute custom callback functions during the write.
We currently support two types of callback functions: `OnWriteEnqueued` and `OnWalWriteFinish`. The former is invoked after the write is enqueued, and the later is invoked after WAL write finishes when applicable.
These callback functions are intended for users to use to improve synchronization between concurrent writes, their execution is on the write's critical path so it will impact the write's latency if not used properly. The documentation for the callback interface mentioned this and suggest user to keep these callback functions' implementation minimum.
Although transaction interfaces' writes doesn't yet allow user to specify such a user write callback argument, the `DBImpl::Write*` type of APIs do not differentiate between regular DB writes or writes coming from the transaction layer when it comes to supporting this `UserWriteCallback`. These callbacks works for all the write modes including: default write mode, Options.two_write_queues, Options.unordered_write, Options.enable_pipelined_write
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12603
Test Plan: Added unit test in ./write_callback_test
Reviewed By: anand1976
Differential Revision: D58044638
Pulled By: jowlyzhang
fbshipit-source-id: 87a84a0221df8f589ec8fc4d74597e72ce97e4cd
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12717
The PR adds `Transaction::MultiGetEntity` to the stress tests. Similarly to what we do for `Transaction::MultiGet`, in this mode we open a transaction and randomly add writes for some of the queried keys to it while keeping track of the values written on a per-key basis. The results of `Transaction::MultiGetEntity` can then be validated against these expected values (in order to test the read-your-own-writes functionality) as well as the results returned by `Transaction::GetEntity` for the same keys.
Reviewed By: jaykorean
Differential Revision: D57990210
fbshipit-source-id: 9bf3bb292051c2c57757f86b517919197b03c524
Summary:
Introduce `use_multi_cf_iterator`, and when it's set, use `CoalescingIterator` in `TestIterate()`. Because all the column families contain the same data in today's Stress Test, we can compare `CoalescingIterator` against any `DBIter` from any of the column families. Currently, coalescing logic verification is done by unit tests, but we can extend the stress test to support different data in different column families in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12706
Test Plan:
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1
```
**More PRs to come**
- Use `AttributeGroupIterator` when both `use_multi_cf_iterator` and `use_attribute_group` are true
- Support `Refresh()` in `CoalescingIterator`
- Extend Stress Test to support different data in different CFs (Long-term)
Reviewed By: ltamasi
Differential Revision: D58020247
Pulled By: jaykorean
fbshipit-source-id: 8e2483b85cf2bb0f5a9bb44851601bbf063484ec