De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
#include "table/block_based/block_based_table_iterator.h"
|
|
|
|
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
|
2022-05-20 23:09:33 +00:00
|
|
|
void BlockBasedTableIterator::SeekToFirst() { SeekImpl(nullptr, false); }
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::Seek(const Slice& target) {
|
|
|
|
SeekImpl(&target, true);
|
|
|
|
}
|
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
void BlockBasedTableIterator::SeekSecondPass(const Slice* target) {
|
|
|
|
AsyncInitDataBlock(/*is_first_pass=*/false);
|
|
|
|
|
|
|
|
if (target) {
|
|
|
|
block_iter_.Seek(*target);
|
|
|
|
} else {
|
|
|
|
block_iter_.SeekToFirst();
|
|
|
|
}
|
|
|
|
FindKeyForward();
|
|
|
|
|
|
|
|
CheckOutOfBound();
|
|
|
|
|
|
|
|
if (target) {
|
|
|
|
assert(!Valid() || icomp_.Compare(*target, key()) <= 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-05-20 23:09:33 +00:00
|
|
|
void BlockBasedTableIterator::SeekImpl(const Slice* target,
|
|
|
|
bool async_prefetch) {
|
2023-09-23 01:12:08 +00:00
|
|
|
bool is_first_pass = !async_read_in_progress_;
|
2023-12-06 21:48:15 +00:00
|
|
|
|
|
|
|
if (!is_first_pass) {
|
|
|
|
SeekSecondPass(target);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
ResetBlockCacheLookupVar();
|
|
|
|
|
2023-09-23 01:12:08 +00:00
|
|
|
bool autotune_readaheadsize = is_first_pass &&
|
|
|
|
read_options_.auto_readahead_size &&
|
|
|
|
read_options_.iterate_upper_bound;
|
|
|
|
|
|
|
|
if (autotune_readaheadsize &&
|
|
|
|
table_->get_rep()->table_options.block_cache.get() &&
|
2023-12-06 21:48:15 +00:00
|
|
|
direction_ == IterDirection::kForward) {
|
2023-09-23 01:12:08 +00:00
|
|
|
readahead_cache_lookup_ = true;
|
|
|
|
}
|
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
is_out_of_bound_ = false;
|
|
|
|
is_at_first_key_from_index_ = false;
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
seek_stat_state_ = kNone;
|
|
|
|
bool filter_checked = false;
|
|
|
|
if (target &&
|
|
|
|
!CheckPrefixMayMatch(*target, IterDirection::kForward, &filter_checked)) {
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
ResetDataIter();
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
RecordTick(table_->GetStatistics(), is_last_level_
|
|
|
|
? LAST_LEVEL_SEEK_FILTERED
|
|
|
|
: NON_LAST_LEVEL_SEEK_FILTERED);
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
return;
|
|
|
|
}
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
if (filter_checked) {
|
|
|
|
seek_stat_state_ = kFilterUsed;
|
|
|
|
RecordTick(table_->GetStatistics(), is_last_level_
|
|
|
|
? LAST_LEVEL_SEEK_FILTER_MATCH
|
|
|
|
: NON_LAST_LEVEL_SEEK_FILTER_MATCH);
|
|
|
|
}
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
|
|
|
|
bool need_seek_index = true;
|
2023-09-23 01:12:08 +00:00
|
|
|
|
|
|
|
// In case of readahead_cache_lookup_, index_iter_ could change to find the
|
2023-12-06 21:48:15 +00:00
|
|
|
// readahead size in BlockCacheLookupForReadAheadSize so it needs to
|
|
|
|
// reseek.
|
2023-09-23 01:12:08 +00:00
|
|
|
if (IsIndexAtCurr() && block_iter_points_to_real_block_ &&
|
|
|
|
block_iter_.Valid()) {
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
// Reseek.
|
|
|
|
prev_block_offset_ = index_iter_->value().handle.offset();
|
|
|
|
|
|
|
|
if (target) {
|
|
|
|
// We can avoid an index seek if:
|
|
|
|
// 1. The new seek key is larger than the current key
|
|
|
|
// 2. The new seek key is within the upper bound of the block
|
|
|
|
// Since we don't necessarily know the internal key for either
|
|
|
|
// the current key or the upper bound, we check user keys and
|
|
|
|
// exclude the equality case. Considering internal keys can
|
|
|
|
// improve for the boundary cases, but it would complicate the
|
|
|
|
// code.
|
|
|
|
if (user_comparator_.Compare(ExtractUserKey(*target),
|
|
|
|
block_iter_.user_key()) > 0 &&
|
|
|
|
user_comparator_.Compare(ExtractUserKey(*target),
|
|
|
|
index_iter_->user_key()) < 0) {
|
|
|
|
need_seek_index = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (need_seek_index) {
|
|
|
|
if (target) {
|
|
|
|
index_iter_->Seek(*target);
|
|
|
|
} else {
|
|
|
|
index_iter_->SeekToFirst();
|
|
|
|
}
|
2023-10-03 00:47:24 +00:00
|
|
|
is_index_at_curr_block_ = true;
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-09-23 01:12:08 +00:00
|
|
|
// After reseek, index_iter_ point to the right key i.e. target in
|
|
|
|
// case of readahead_cache_lookup_. So index_iter_ can be used directly.
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
IndexValue v = index_iter_->value();
|
|
|
|
const bool same_block = block_iter_points_to_real_block_ &&
|
|
|
|
v.handle.offset() == prev_block_offset_;
|
|
|
|
|
|
|
|
if (!v.first_internal_key.empty() && !same_block &&
|
|
|
|
(!target || icomp_.Compare(*target, v.first_internal_key) <= 0) &&
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-16 00:37:23 +00:00
|
|
|
allow_unprepared_value_) {
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
// Index contains the first key of the block, and it's >= target.
|
|
|
|
// We can defer reading the block.
|
|
|
|
is_at_first_key_from_index_ = true;
|
|
|
|
// ResetDataIter() will invalidate block_iter_. Thus, there is no need to
|
|
|
|
// call CheckDataBlockWithinUpperBound() to check for iterate_upper_bound
|
|
|
|
// as that will be done later when the data block is actually read.
|
|
|
|
ResetDataIter();
|
|
|
|
} else {
|
|
|
|
// Need to use the data block.
|
|
|
|
if (!same_block) {
|
2022-05-20 23:09:33 +00:00
|
|
|
if (read_options_.async_io && async_prefetch) {
|
2023-12-06 21:48:15 +00:00
|
|
|
AsyncInitDataBlock(/*is_first_pass=*/true);
|
2022-05-20 23:09:33 +00:00
|
|
|
if (async_read_in_progress_) {
|
|
|
|
// Status::TryAgain indicates asynchronous request for retrieval of
|
|
|
|
// data blocks has been submitted. So it should return at this point
|
2023-12-06 21:48:15 +00:00
|
|
|
// and Seek should be called again to retrieve the requested block
|
|
|
|
// and execute the remaining code.
|
2022-05-20 23:09:33 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
InitDataBlock();
|
|
|
|
}
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
} else {
|
|
|
|
// When the user does a reseek, the iterate_upper_bound might have
|
|
|
|
// changed. CheckDataBlockWithinUpperBound() needs to be called
|
|
|
|
// explicitly if the reseek ends up in the same data block.
|
|
|
|
// If the reseek ends up in a different block, InitDataBlock() will do
|
|
|
|
// the iterator upper bound check.
|
|
|
|
CheckDataBlockWithinUpperBound();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (target) {
|
|
|
|
block_iter_.Seek(*target);
|
|
|
|
} else {
|
|
|
|
block_iter_.SeekToFirst();
|
|
|
|
}
|
|
|
|
FindKeyForward();
|
|
|
|
}
|
|
|
|
|
|
|
|
CheckOutOfBound();
|
|
|
|
|
|
|
|
if (target) {
|
|
|
|
assert(!Valid() || icomp_.Compare(*target, key()) <= 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::SeekForPrev(const Slice& target) {
|
2023-10-03 00:47:24 +00:00
|
|
|
direction_ = IterDirection::kBackward;
|
|
|
|
ResetBlockCacheLookupVar();
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
is_out_of_bound_ = false;
|
|
|
|
is_at_first_key_from_index_ = false;
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
seek_stat_state_ = kNone;
|
|
|
|
bool filter_checked = false;
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
// For now totally disable prefix seek in auto prefix mode because we don't
|
|
|
|
// have logic
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
if (!CheckPrefixMayMatch(target, IterDirection::kBackward, &filter_checked)) {
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
ResetDataIter();
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
RecordTick(table_->GetStatistics(), is_last_level_
|
|
|
|
? LAST_LEVEL_SEEK_FILTERED
|
|
|
|
: NON_LAST_LEVEL_SEEK_FILTERED);
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
return;
|
|
|
|
}
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
if (filter_checked) {
|
|
|
|
seek_stat_state_ = kFilterUsed;
|
|
|
|
RecordTick(table_->GetStatistics(), is_last_level_
|
|
|
|
? LAST_LEVEL_SEEK_FILTER_MATCH
|
|
|
|
: NON_LAST_LEVEL_SEEK_FILTER_MATCH);
|
|
|
|
}
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
|
|
|
|
SavePrevIndexValue();
|
|
|
|
|
|
|
|
// Call Seek() rather than SeekForPrev() in the index block, because the
|
|
|
|
// target data block will likely to contain the position for `target`, the
|
|
|
|
// same as Seek(), rather than than before.
|
|
|
|
// For example, if we have three data blocks, each containing two keys:
|
|
|
|
// [2, 4] [6, 8] [10, 12]
|
|
|
|
// (the keys in the index block would be [4, 8, 12])
|
|
|
|
// and the user calls SeekForPrev(7), we need to go to the second block,
|
|
|
|
// just like if they call Seek(7).
|
|
|
|
// The only case where the block is difference is when they seek to a position
|
|
|
|
// in the boundary. For example, if they SeekForPrev(5), we should go to the
|
|
|
|
// first block, rather than the second. However, we don't have the information
|
|
|
|
// to distinguish the two unless we read the second block. In this case, we'll
|
|
|
|
// end up with reading two blocks.
|
|
|
|
index_iter_->Seek(target);
|
2023-10-03 00:47:24 +00:00
|
|
|
is_index_at_curr_block_ = true;
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
|
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
auto seek_status = index_iter_->status();
|
|
|
|
// Check for IO error
|
|
|
|
if (!seek_status.IsNotFound() && !seek_status.ok()) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
// With prefix index, Seek() returns NotFound if the prefix doesn't exist
|
|
|
|
if (seek_status.IsNotFound()) {
|
|
|
|
// Any key less than the target is fine for prefix seek
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
} else {
|
|
|
|
index_iter_->SeekToLast();
|
|
|
|
}
|
|
|
|
// Check for IO error
|
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
InitDataBlock();
|
|
|
|
|
|
|
|
block_iter_.SeekForPrev(target);
|
|
|
|
|
|
|
|
FindKeyBackward();
|
|
|
|
CheckDataBlockWithinUpperBound();
|
|
|
|
assert(!block_iter_.Valid() ||
|
|
|
|
icomp_.Compare(target, block_iter_.key()) >= 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::SeekToLast() {
|
2023-10-03 00:47:24 +00:00
|
|
|
direction_ = IterDirection::kBackward;
|
|
|
|
ResetBlockCacheLookupVar();
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
is_out_of_bound_ = false;
|
|
|
|
is_at_first_key_from_index_ = false;
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
seek_stat_state_ = kNone;
|
2023-09-23 01:12:08 +00:00
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
SavePrevIndexValue();
|
2023-09-23 01:12:08 +00:00
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
index_iter_->SeekToLast();
|
2023-10-03 00:47:24 +00:00
|
|
|
is_index_at_curr_block_ = true;
|
2023-09-23 01:12:08 +00:00
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
2023-09-23 01:12:08 +00:00
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
InitDataBlock();
|
|
|
|
block_iter_.SeekToLast();
|
|
|
|
FindKeyBackward();
|
|
|
|
CheckDataBlockWithinUpperBound();
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::Next() {
|
|
|
|
if (is_at_first_key_from_index_ && !MaterializeCurrentBlock()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
assert(block_iter_points_to_real_block_);
|
|
|
|
block_iter_.Next();
|
|
|
|
FindKeyForward();
|
|
|
|
CheckOutOfBound();
|
|
|
|
}
|
|
|
|
|
|
|
|
bool BlockBasedTableIterator::NextAndGetResult(IterateResult* result) {
|
|
|
|
Next();
|
|
|
|
bool is_valid = Valid();
|
|
|
|
if (is_valid) {
|
|
|
|
result->key = key();
|
2020-08-05 17:42:56 +00:00
|
|
|
result->bound_check_result = UpperBoundCheckResult();
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-16 00:37:23 +00:00
|
|
|
result->value_prepared = !is_at_first_key_from_index_;
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
}
|
|
|
|
return is_valid;
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::Prev() {
|
2023-12-13 20:15:04 +00:00
|
|
|
if (readahead_cache_lookup_ && !IsIndexAtCurr()) {
|
|
|
|
// In case of readahead_cache_lookup_, index_iter_ has moved forward. So we
|
|
|
|
// need to reseek the index_iter_ to point to current block by using
|
|
|
|
// block_iter_'s key.
|
|
|
|
if (Valid()) {
|
|
|
|
ResetBlockCacheLookupVar();
|
|
|
|
direction_ = IterDirection::kBackward;
|
|
|
|
Slice last_key = key();
|
|
|
|
|
|
|
|
index_iter_->Seek(last_key);
|
|
|
|
is_index_at_curr_block_ = true;
|
|
|
|
|
|
|
|
// Check for IO error.
|
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!Valid()) {
|
|
|
|
ResetDataIter();
|
|
|
|
return;
|
|
|
|
}
|
2023-09-23 01:12:08 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
ResetBlockCacheLookupVar();
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
if (is_at_first_key_from_index_) {
|
|
|
|
is_at_first_key_from_index_ = false;
|
|
|
|
|
|
|
|
index_iter_->Prev();
|
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
InitDataBlock();
|
|
|
|
block_iter_.SeekToLast();
|
|
|
|
} else {
|
|
|
|
assert(block_iter_points_to_real_block_);
|
|
|
|
block_iter_.Prev();
|
|
|
|
}
|
|
|
|
|
|
|
|
FindKeyBackward();
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::InitDataBlock() {
|
2023-09-23 01:12:08 +00:00
|
|
|
BlockHandle data_block_handle;
|
|
|
|
bool is_in_cache = false;
|
|
|
|
bool use_block_cache_for_lookup = true;
|
|
|
|
|
|
|
|
if (DoesContainBlockHandles()) {
|
2023-10-17 19:21:08 +00:00
|
|
|
data_block_handle = block_handles_.front().handle_;
|
2023-09-23 01:12:08 +00:00
|
|
|
is_in_cache = block_handles_.front().is_cache_hit_;
|
|
|
|
use_block_cache_for_lookup = false;
|
|
|
|
} else {
|
|
|
|
data_block_handle = index_iter_->value().handle;
|
|
|
|
}
|
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
if (!block_iter_points_to_real_block_ ||
|
|
|
|
data_block_handle.offset() != prev_block_offset_ ||
|
|
|
|
// if previous attempt of reading the block missed cache, try again
|
|
|
|
block_iter_.status().IsIncomplete()) {
|
|
|
|
if (block_iter_points_to_real_block_) {
|
|
|
|
ResetDataIter();
|
|
|
|
}
|
|
|
|
|
|
|
|
bool is_for_compaction =
|
|
|
|
lookup_context_.caller == TableReaderCaller::kCompaction;
|
2023-09-23 01:12:08 +00:00
|
|
|
|
|
|
|
// Initialize Data Block From CacheableEntry.
|
|
|
|
if (is_in_cache) {
|
|
|
|
Status s;
|
|
|
|
block_iter_.Invalidate(Status::OK());
|
|
|
|
table_->NewDataBlockIterator<DataBlockIter>(
|
|
|
|
read_options_, (block_handles_.front().cachable_entry_).As<Block>(),
|
|
|
|
&block_iter_, s);
|
|
|
|
} else {
|
|
|
|
auto* rep = table_->get_rep();
|
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
std::function<void(bool, uint64_t&, uint64_t&)> readaheadsize_cb =
|
2023-09-23 01:12:08 +00:00
|
|
|
nullptr;
|
|
|
|
if (readahead_cache_lookup_) {
|
|
|
|
readaheadsize_cb = std::bind(
|
|
|
|
&BlockBasedTableIterator::BlockCacheLookupForReadAheadSize, this,
|
|
|
|
std::placeholders::_1, std::placeholders::_2,
|
|
|
|
std::placeholders::_3);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Prefetch additional data for range scans (iterators).
|
|
|
|
// Implicit auto readahead:
|
|
|
|
// Enabled after 2 sequential IOs when ReadOptions.readahead_size == 0.
|
|
|
|
// Explicit user requested readahead:
|
|
|
|
// Enabled from the very first IO when ReadOptions.readahead_size is
|
|
|
|
// set.
|
|
|
|
block_prefetcher_.PrefetchIfNeeded(
|
|
|
|
rep, data_block_handle, read_options_.readahead_size,
|
|
|
|
is_for_compaction,
|
|
|
|
/*no_sequential_checking=*/false, read_options_, readaheadsize_cb);
|
|
|
|
|
|
|
|
Status s;
|
|
|
|
table_->NewDataBlockIterator<DataBlockIter>(
|
|
|
|
read_options_, data_block_handle, &block_iter_, BlockType::kData,
|
|
|
|
/*get_context=*/nullptr, &lookup_context_,
|
|
|
|
block_prefetcher_.prefetch_buffer(),
|
|
|
|
/*for_compaction=*/is_for_compaction, /*async_read=*/false, s,
|
|
|
|
use_block_cache_for_lookup);
|
|
|
|
}
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
block_iter_points_to_real_block_ = true;
|
2023-09-23 01:12:08 +00:00
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
CheckDataBlockWithinUpperBound();
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
if (!is_for_compaction &&
|
|
|
|
(seek_stat_state_ & kDataBlockReadSinceLastSeek) == 0) {
|
|
|
|
RecordTick(table_->GetStatistics(), is_last_level_
|
|
|
|
? LAST_LEVEL_SEEK_DATA
|
|
|
|
: NON_LAST_LEVEL_SEEK_DATA);
|
|
|
|
seek_stat_state_ = static_cast<SeekStatState>(
|
|
|
|
seek_stat_state_ | kDataBlockReadSinceLastSeek | kReportOnUseful);
|
|
|
|
}
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-05-20 23:09:33 +00:00
|
|
|
void BlockBasedTableIterator::AsyncInitDataBlock(bool is_first_pass) {
|
2023-12-06 21:48:15 +00:00
|
|
|
BlockHandle data_block_handle;
|
2022-05-20 23:09:33 +00:00
|
|
|
bool is_for_compaction =
|
|
|
|
lookup_context_.caller == TableReaderCaller::kCompaction;
|
|
|
|
if (is_first_pass) {
|
2023-12-06 21:48:15 +00:00
|
|
|
data_block_handle = index_iter_->value().handle;
|
2022-05-20 23:09:33 +00:00
|
|
|
if (!block_iter_points_to_real_block_ ||
|
|
|
|
data_block_handle.offset() != prev_block_offset_ ||
|
|
|
|
// if previous attempt of reading the block missed cache, try again
|
|
|
|
block_iter_.status().IsIncomplete()) {
|
|
|
|
if (block_iter_points_to_real_block_) {
|
|
|
|
ResetDataIter();
|
|
|
|
}
|
|
|
|
auto* rep = table_->get_rep();
|
2023-09-23 01:12:08 +00:00
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
std::function<void(bool, uint64_t&, uint64_t&)> readaheadsize_cb =
|
2023-09-23 01:12:08 +00:00
|
|
|
nullptr;
|
|
|
|
if (readahead_cache_lookup_) {
|
|
|
|
readaheadsize_cb = std::bind(
|
|
|
|
&BlockBasedTableIterator::BlockCacheLookupForReadAheadSize, this,
|
|
|
|
std::placeholders::_1, std::placeholders::_2,
|
|
|
|
std::placeholders::_3);
|
|
|
|
}
|
|
|
|
|
2022-05-20 23:09:33 +00:00
|
|
|
// Prefetch additional data for range scans (iterators).
|
|
|
|
// Implicit auto readahead:
|
|
|
|
// Enabled after 2 sequential IOs when ReadOptions.readahead_size == 0.
|
|
|
|
// Explicit user requested readahead:
|
|
|
|
// Enabled from the very first IO when ReadOptions.readahead_size is
|
|
|
|
// set.
|
2022-07-06 18:42:59 +00:00
|
|
|
// In case of async_io with Implicit readahead, block_prefetcher_ will
|
|
|
|
// always the create the prefetch buffer by setting no_sequential_checking
|
|
|
|
// = true.
|
2022-05-20 23:09:33 +00:00
|
|
|
block_prefetcher_.PrefetchIfNeeded(
|
|
|
|
rep, data_block_handle, read_options_.readahead_size,
|
2022-07-06 18:42:59 +00:00
|
|
|
is_for_compaction, /*no_sequential_checking=*/read_options_.async_io,
|
2023-09-23 01:12:08 +00:00
|
|
|
read_options_, readaheadsize_cb);
|
2022-05-20 23:09:33 +00:00
|
|
|
|
|
|
|
Status s;
|
|
|
|
table_->NewDataBlockIterator<DataBlockIter>(
|
|
|
|
read_options_, data_block_handle, &block_iter_, BlockType::kData,
|
|
|
|
/*get_context=*/nullptr, &lookup_context_,
|
|
|
|
block_prefetcher_.prefetch_buffer(),
|
2023-09-18 18:32:30 +00:00
|
|
|
/*for_compaction=*/is_for_compaction, /*async_read=*/true, s,
|
|
|
|
/*use_block_cache_for_lookup=*/true);
|
2022-05-20 23:09:33 +00:00
|
|
|
|
|
|
|
if (s.IsTryAgain()) {
|
|
|
|
async_read_in_progress_ = true;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// Second pass will call the Poll to get the data block which has been
|
|
|
|
// requested asynchronously.
|
2023-12-06 21:48:15 +00:00
|
|
|
bool is_in_cache = false;
|
|
|
|
|
|
|
|
if (DoesContainBlockHandles()) {
|
|
|
|
data_block_handle = block_handles_.front().handle_;
|
|
|
|
is_in_cache = block_handles_.front().is_cache_hit_;
|
|
|
|
} else {
|
|
|
|
data_block_handle = index_iter_->value().handle;
|
|
|
|
}
|
|
|
|
|
2022-05-20 23:09:33 +00:00
|
|
|
Status s;
|
2023-12-06 21:48:15 +00:00
|
|
|
// Initialize Data Block From CacheableEntry.
|
|
|
|
if (is_in_cache) {
|
|
|
|
block_iter_.Invalidate(Status::OK());
|
|
|
|
table_->NewDataBlockIterator<DataBlockIter>(
|
|
|
|
read_options_, (block_handles_.front().cachable_entry_).As<Block>(),
|
|
|
|
&block_iter_, s);
|
|
|
|
} else {
|
|
|
|
table_->NewDataBlockIterator<DataBlockIter>(
|
|
|
|
read_options_, data_block_handle, &block_iter_, BlockType::kData,
|
|
|
|
/*get_context=*/nullptr, &lookup_context_,
|
|
|
|
block_prefetcher_.prefetch_buffer(),
|
|
|
|
/*for_compaction=*/is_for_compaction, /*async_read=*/false, s,
|
|
|
|
/*use_block_cache_for_lookup=*/false);
|
|
|
|
}
|
2022-05-20 23:09:33 +00:00
|
|
|
}
|
|
|
|
block_iter_points_to_real_block_ = true;
|
|
|
|
CheckDataBlockWithinUpperBound();
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
|
|
|
|
if (!is_for_compaction &&
|
|
|
|
(seek_stat_state_ & kDataBlockReadSinceLastSeek) == 0) {
|
|
|
|
RecordTick(table_->GetStatistics(), is_last_level_
|
|
|
|
? LAST_LEVEL_SEEK_DATA
|
|
|
|
: NON_LAST_LEVEL_SEEK_DATA);
|
|
|
|
seek_stat_state_ = static_cast<SeekStatState>(
|
|
|
|
seek_stat_state_ | kDataBlockReadSinceLastSeek | kReportOnUseful);
|
|
|
|
}
|
2022-05-20 23:09:33 +00:00
|
|
|
async_read_in_progress_ = false;
|
|
|
|
}
|
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
bool BlockBasedTableIterator::MaterializeCurrentBlock() {
|
|
|
|
assert(is_at_first_key_from_index_);
|
|
|
|
assert(!block_iter_points_to_real_block_);
|
|
|
|
assert(index_iter_->Valid());
|
|
|
|
|
|
|
|
is_at_first_key_from_index_ = false;
|
|
|
|
InitDataBlock();
|
|
|
|
assert(block_iter_points_to_real_block_);
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-16 00:37:23 +00:00
|
|
|
|
|
|
|
if (!block_iter_.status().ok()) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
block_iter_.SeekToFirst();
|
|
|
|
|
2023-09-23 01:12:08 +00:00
|
|
|
// MaterializeCurrentBlock is called when block is actually read by
|
|
|
|
// calling InitDataBlock. is_at_first_key_from_index_ will be false for block
|
|
|
|
// handles placed in blockhandle. So index_ will be pointing to current block.
|
|
|
|
// After InitDataBlock, index_iter_ can point to different block if
|
|
|
|
// BlockCacheLookupForReadAheadSize is called.
|
2023-10-17 19:21:08 +00:00
|
|
|
Slice first_internal_key;
|
2023-09-23 01:12:08 +00:00
|
|
|
if (DoesContainBlockHandles()) {
|
2023-10-17 19:21:08 +00:00
|
|
|
first_internal_key = block_handles_.front().first_internal_key_;
|
2023-09-23 01:12:08 +00:00
|
|
|
} else {
|
2023-10-17 19:21:08 +00:00
|
|
|
first_internal_key = index_iter_->value().first_internal_key;
|
2023-09-23 01:12:08 +00:00
|
|
|
}
|
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
if (!block_iter_.Valid() ||
|
2023-10-17 19:21:08 +00:00
|
|
|
icomp_.Compare(block_iter_.key(), first_internal_key) != 0) {
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
block_iter_.Invalidate(Status::Corruption(
|
|
|
|
"first key in index doesn't match first key in block"));
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::FindKeyForward() {
|
|
|
|
// This method's code is kept short to make it likely to be inlined.
|
|
|
|
assert(!is_out_of_bound_);
|
|
|
|
assert(block_iter_points_to_real_block_);
|
|
|
|
|
|
|
|
if (!block_iter_.Valid()) {
|
|
|
|
// This is the only call site of FindBlockForward(), but it's extracted into
|
|
|
|
// a separate method to keep FindKeyForward() short and likely to be
|
|
|
|
// inlined. When transitioning to a different block, we call
|
|
|
|
// FindBlockForward(), which is much longer and is probably not inlined.
|
|
|
|
FindBlockForward();
|
|
|
|
} else {
|
|
|
|
// This is the fast path that avoids a function call.
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::FindBlockForward() {
|
|
|
|
// TODO the while loop inherits from two-level-iterator. We don't know
|
|
|
|
// whether a block can be empty so it can be replaced by an "if".
|
|
|
|
do {
|
|
|
|
if (!block_iter_.status().ok()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
// Whether next data block is out of upper bound, if there is one.
|
2023-09-23 01:12:08 +00:00
|
|
|
// index_iter_ can point to different block in case of
|
|
|
|
// readahead_cache_lookup_. readahead_cache_lookup_ will be handle the
|
|
|
|
// upper_bound check.
|
2023-10-03 00:47:24 +00:00
|
|
|
bool next_block_is_out_of_bound =
|
2023-09-23 01:12:08 +00:00
|
|
|
IsIndexAtCurr() && read_options_.iterate_upper_bound != nullptr &&
|
2020-08-04 18:28:02 +00:00
|
|
|
block_iter_points_to_real_block_ &&
|
|
|
|
block_upper_bound_check_ == BlockUpperBound::kUpperBoundInCurBlock;
|
2023-09-23 01:12:08 +00:00
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
assert(!next_block_is_out_of_bound ||
|
|
|
|
user_comparator_.CompareWithoutTimestamp(
|
|
|
|
*read_options_.iterate_upper_bound, /*a_has_ts=*/false,
|
|
|
|
index_iter_->user_key(), /*b_has_ts=*/true) <= 0);
|
2023-09-23 01:12:08 +00:00
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
ResetDataIter();
|
|
|
|
|
2023-09-23 01:12:08 +00:00
|
|
|
if (DoesContainBlockHandles()) {
|
|
|
|
// Advance and point to that next Block handle to make that block handle
|
|
|
|
// current.
|
|
|
|
block_handles_.pop_front();
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
}
|
|
|
|
|
2023-09-23 01:12:08 +00:00
|
|
|
if (!DoesContainBlockHandles()) {
|
|
|
|
// For readahead_cache_lookup_ enabled scenario -
|
|
|
|
// 1. In case of Seek, block_handle will be empty and it should be follow
|
|
|
|
// as usual doing index_iter_->Next().
|
|
|
|
// 2. If block_handles is empty and index is not at current because of
|
|
|
|
// lookup (during Next), it should skip doing index_iter_->Next(), as
|
|
|
|
// it's already pointing to next block;
|
2023-10-03 00:47:24 +00:00
|
|
|
// 3. Last block could be out of bound and it won't iterate over that
|
|
|
|
// during BlockCacheLookup. We need to set for that block here.
|
|
|
|
if (IsIndexAtCurr() || is_index_out_of_bound_) {
|
2023-09-23 01:12:08 +00:00
|
|
|
index_iter_->Next();
|
2023-10-03 00:47:24 +00:00
|
|
|
if (is_index_out_of_bound_) {
|
|
|
|
next_block_is_out_of_bound = is_index_out_of_bound_;
|
|
|
|
is_index_out_of_bound_ = false;
|
|
|
|
}
|
2023-09-23 01:12:08 +00:00
|
|
|
} else {
|
|
|
|
// Skip Next as index_iter_ already points to correct index when it
|
|
|
|
// iterates in BlockCacheLookupForReadAheadSize.
|
|
|
|
is_index_at_curr_block_ = true;
|
|
|
|
}
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
|
2023-09-23 01:12:08 +00:00
|
|
|
if (next_block_is_out_of_bound) {
|
|
|
|
// The next block is out of bound. No need to read it.
|
|
|
|
TEST_SYNC_POINT_CALLBACK("BlockBasedTableIterator:out_of_bound",
|
|
|
|
nullptr);
|
|
|
|
// We need to make sure this is not the last data block before setting
|
|
|
|
// is_out_of_bound_, since the index key for the last data block can be
|
|
|
|
// larger than smallest key of the next file on the same level.
|
|
|
|
if (index_iter_->Valid()) {
|
|
|
|
is_out_of_bound_ = true;
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
|
2023-09-23 01:12:08 +00:00
|
|
|
if (!index_iter_->Valid()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
IndexValue v = index_iter_->value();
|
|
|
|
|
|
|
|
if (!v.first_internal_key.empty() && allow_unprepared_value_) {
|
|
|
|
// Index contains the first key of the block. Defer reading the block.
|
|
|
|
is_at_first_key_from_index_ = true;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
InitDataBlock();
|
|
|
|
block_iter_.SeekToFirst();
|
|
|
|
} while (!block_iter_.Valid());
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::FindKeyBackward() {
|
|
|
|
while (!block_iter_.Valid()) {
|
|
|
|
if (!block_iter_.status().ok()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
ResetDataIter();
|
|
|
|
index_iter_->Prev();
|
|
|
|
|
|
|
|
if (index_iter_->Valid()) {
|
|
|
|
InitDataBlock();
|
|
|
|
block_iter_.SeekToLast();
|
|
|
|
} else {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// We could have check lower bound here too, but we opt not to do it for
|
|
|
|
// code simplicity.
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::CheckOutOfBound() {
|
2023-10-03 00:47:24 +00:00
|
|
|
if (read_options_.iterate_upper_bound != nullptr &&
|
2020-08-04 18:28:02 +00:00
|
|
|
block_upper_bound_check_ != BlockUpperBound::kUpperBoundBeyondCurBlock &&
|
|
|
|
Valid()) {
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
is_out_of_bound_ =
|
|
|
|
user_comparator_.CompareWithoutTimestamp(
|
|
|
|
*read_options_.iterate_upper_bound, /*a_has_ts=*/false, user_key(),
|
|
|
|
/*b_has_ts=*/true) <= 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableIterator::CheckDataBlockWithinUpperBound() {
|
2023-09-23 01:12:08 +00:00
|
|
|
if (IsIndexAtCurr() && read_options_.iterate_upper_bound != nullptr &&
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
block_iter_points_to_real_block_) {
|
2020-08-04 18:28:02 +00:00
|
|
|
block_upper_bound_check_ = (user_comparator_.CompareWithoutTimestamp(
|
|
|
|
*read_options_.iterate_upper_bound,
|
|
|
|
/*a_has_ts=*/false, index_iter_->user_key(),
|
|
|
|
/*b_has_ts=*/true) > 0)
|
|
|
|
? BlockUpperBound::kUpperBoundBeyondCurBlock
|
|
|
|
: BlockUpperBound::kUpperBoundInCurBlock;
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
}
|
|
|
|
}
|
2023-08-18 22:52:04 +00:00
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
void BlockBasedTableIterator::InitializeStartAndEndOffsets(
|
|
|
|
bool read_curr_block, bool& found_first_miss_block,
|
|
|
|
uint64_t& start_updated_offset, uint64_t& end_updated_offset,
|
|
|
|
size_t& prev_handles_size) {
|
|
|
|
prev_handles_size = block_handles_.size();
|
|
|
|
size_t footer = table_->get_rep()->footer.GetBlockTrailerSize();
|
2023-09-23 01:12:08 +00:00
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
// It initialize start and end offset to begin which is covered by following
|
|
|
|
// scenarios
|
|
|
|
if (read_curr_block) {
|
|
|
|
if (!DoesContainBlockHandles()) {
|
|
|
|
// Scenario 1 : read_curr_block (callback made on miss block which caller
|
|
|
|
// was reading) and it has no existing handles in queue. i.e.
|
|
|
|
// index_iter_ is pointing to block that is being read by
|
|
|
|
// caller.
|
|
|
|
//
|
|
|
|
// Add current block here as it doesn't need any lookup.
|
|
|
|
BlockHandleInfo block_handle_info;
|
|
|
|
block_handle_info.handle_ = index_iter_->value().handle;
|
|
|
|
block_handle_info.SetFirstInternalKey(
|
|
|
|
index_iter_->value().first_internal_key);
|
|
|
|
|
|
|
|
end_updated_offset = block_handle_info.handle_.offset() + footer +
|
|
|
|
block_handle_info.handle_.size();
|
|
|
|
block_handles_.emplace_back(std::move(block_handle_info));
|
|
|
|
|
|
|
|
index_iter_->Next();
|
|
|
|
is_index_at_curr_block_ = false;
|
|
|
|
found_first_miss_block = true;
|
|
|
|
} else {
|
|
|
|
// Scenario 2 : read_curr_block (callback made on miss block which caller
|
|
|
|
// was reading) but the queue already has some handles.
|
|
|
|
//
|
|
|
|
// It can be due to reading error in second buffer in FilePrefetchBuffer.
|
|
|
|
// BlockHandles already added to the queue but there was error in fetching
|
|
|
|
// those data blocks. So in this call they need to be read again.
|
|
|
|
assert(block_handles_.front().is_cache_hit_ == false);
|
|
|
|
found_first_miss_block = true;
|
|
|
|
// Initialize prev_handles_size to 0 as all those handles need to be read
|
|
|
|
// again.
|
|
|
|
prev_handles_size = 0;
|
|
|
|
start_updated_offset = block_handles_.front().handle_.offset();
|
|
|
|
end_updated_offset = block_handles_.back().handle_.offset() + footer +
|
|
|
|
block_handles_.back().handle_.size();
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// Scenario 3 : read_curr_block is false (callback made to do additional
|
|
|
|
// prefetching in buffers) and the queue already has some
|
|
|
|
// handles from first buffer.
|
|
|
|
if (DoesContainBlockHandles()) {
|
|
|
|
start_updated_offset = block_handles_.back().handle_.offset() + footer +
|
|
|
|
block_handles_.back().handle_.size();
|
|
|
|
end_updated_offset = start_updated_offset;
|
|
|
|
} else {
|
|
|
|
// Scenario 4 : read_curr_block is false (callback made to do additional
|
|
|
|
// prefetching in buffers) but the queue has no handle
|
|
|
|
// from first buffer.
|
|
|
|
//
|
|
|
|
// It can be when Reseek is from block cache (which doesn't clear the
|
|
|
|
// buffers in FilePrefetchBuffer but clears block handles from queue) and
|
|
|
|
// reseek also lies within the buffer. So Next will get data from
|
|
|
|
// exisiting buffers untill this callback is made to prefetch additional
|
|
|
|
// data. All handles need to be added to the queue starting from
|
|
|
|
// index_iter_.
|
|
|
|
assert(index_iter_->Valid());
|
|
|
|
start_updated_offset = index_iter_->value().handle.offset();
|
|
|
|
end_updated_offset = start_updated_offset;
|
|
|
|
}
|
2023-09-23 01:12:08 +00:00
|
|
|
}
|
2023-12-06 21:48:15 +00:00
|
|
|
}
|
2023-09-23 01:12:08 +00:00
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
// BlockCacheLookupForReadAheadSize API lookups in the block cache and tries to
|
|
|
|
// reduce the start and end offset passed.
|
|
|
|
//
|
|
|
|
// Implementation -
|
|
|
|
// This function looks into the block cache for the blocks between start_offset
|
|
|
|
// and end_offset and add all the handles in the queue.
|
|
|
|
// It then iterates from the end to find first miss block and update the end
|
|
|
|
// offset to that block.
|
|
|
|
// It also iterates from the start and find first miss block and update the
|
|
|
|
// start offset to that block.
|
|
|
|
//
|
|
|
|
// Arguments -
|
|
|
|
// start_offset : Offset from which the caller wants to read.
|
|
|
|
// end_offset : End offset till which the caller wants to read.
|
|
|
|
// read_curr_block : True if this call was due to miss in the cache and
|
|
|
|
// caller wants to read that block.
|
|
|
|
// False if current call is to prefetch additional data in
|
|
|
|
// extra buffers.
|
|
|
|
void BlockBasedTableIterator::BlockCacheLookupForReadAheadSize(
|
|
|
|
bool read_curr_block, uint64_t& start_offset, uint64_t& end_offset) {
|
|
|
|
uint64_t start_updated_offset = start_offset;
|
2023-09-23 01:12:08 +00:00
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
// readahead_cache_lookup_ can be set false, if after Seek and Next
|
|
|
|
// there is SeekForPrev or any other backward operation.
|
|
|
|
if (!readahead_cache_lookup_) {
|
2023-09-23 01:12:08 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
size_t footer = table_->get_rep()->footer.GetBlockTrailerSize();
|
|
|
|
if (read_curr_block && !DoesContainBlockHandles() &&
|
|
|
|
IsNextBlockOutOfBound()) {
|
|
|
|
end_offset = index_iter_->value().handle.offset() + footer +
|
|
|
|
index_iter_->value().handle.size();
|
2023-10-03 00:47:24 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
uint64_t end_updated_offset = start_updated_offset;
|
|
|
|
bool found_first_miss_block = false;
|
|
|
|
size_t prev_handles_size;
|
2023-09-23 01:12:08 +00:00
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
// Initialize start and end offsets based on exisiting handles in the queue
|
|
|
|
// and read_curr_block argument passed.
|
|
|
|
InitializeStartAndEndOffsets(read_curr_block, found_first_miss_block,
|
|
|
|
start_updated_offset, end_updated_offset,
|
|
|
|
prev_handles_size);
|
2023-09-23 01:12:08 +00:00
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
while (index_iter_->Valid() && !is_index_out_of_bound_) {
|
2023-09-23 01:12:08 +00:00
|
|
|
BlockHandle block_handle = index_iter_->value().handle;
|
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
// Adding this data block exceeds end offset. So this data
|
2023-09-23 01:12:08 +00:00
|
|
|
// block won't be added.
|
2023-12-06 21:48:15 +00:00
|
|
|
// There can be a case where passed end offset is smaller than
|
|
|
|
// block_handle.size() + footer because of readahead_size truncated to
|
|
|
|
// upper_bound. So we prefer to read the block rather than skip it to avoid
|
|
|
|
// sync read calls in case of async_io.
|
|
|
|
if (start_updated_offset != end_updated_offset &&
|
|
|
|
(end_updated_offset + block_handle.size() + footer > end_offset)) {
|
2023-09-23 01:12:08 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
// For current data block, do the lookup in the cache. Lookup should pin the
|
2023-12-06 21:48:15 +00:00
|
|
|
// data block in cache.
|
2023-09-23 01:12:08 +00:00
|
|
|
BlockHandleInfo block_handle_info;
|
2023-10-17 19:21:08 +00:00
|
|
|
block_handle_info.handle_ = index_iter_->value().handle;
|
|
|
|
block_handle_info.SetFirstInternalKey(
|
|
|
|
index_iter_->value().first_internal_key);
|
2023-12-06 21:48:15 +00:00
|
|
|
end_updated_offset += footer + block_handle_info.handle_.size();
|
2023-09-23 01:12:08 +00:00
|
|
|
|
|
|
|
Status s = table_->LookupAndPinBlocksInCache<Block_kData>(
|
|
|
|
read_options_, block_handle,
|
|
|
|
&(block_handle_info.cachable_entry_).As<Block_kData>());
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
2023-10-03 00:47:24 +00:00
|
|
|
|
2023-09-23 01:12:08 +00:00
|
|
|
block_handle_info.is_cache_hit_ =
|
|
|
|
(block_handle_info.cachable_entry_.GetValue() ||
|
|
|
|
block_handle_info.cachable_entry_.GetCacheHandle());
|
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
// If this is the first miss block, update start offset to this block.
|
|
|
|
if (!found_first_miss_block && !block_handle_info.is_cache_hit_) {
|
|
|
|
found_first_miss_block = true;
|
|
|
|
start_updated_offset = block_handle_info.handle_.offset();
|
|
|
|
}
|
|
|
|
|
2023-09-23 01:12:08 +00:00
|
|
|
// Add the handle to the queue.
|
|
|
|
block_handles_.emplace_back(std::move(block_handle_info));
|
|
|
|
|
|
|
|
// Can't figure out for current block if current block
|
|
|
|
// is out of bound. But for next block we can find that.
|
|
|
|
// If curr block's index key >= iterate_upper_bound, it
|
|
|
|
// means all the keys in next block or above are out of
|
|
|
|
// bound.
|
|
|
|
if (IsNextBlockOutOfBound()) {
|
2023-10-03 00:47:24 +00:00
|
|
|
is_index_out_of_bound_ = true;
|
2023-09-23 01:12:08 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
index_iter_->Next();
|
2023-12-06 21:48:15 +00:00
|
|
|
is_index_at_curr_block_ = false;
|
2023-09-23 01:12:08 +00:00
|
|
|
};
|
|
|
|
|
2023-12-06 21:48:15 +00:00
|
|
|
if (found_first_miss_block) {
|
|
|
|
// Iterate cache hit block handles from the end till a Miss is there, to
|
|
|
|
// truncate and update the end offset till that Miss.
|
|
|
|
auto it = block_handles_.rbegin();
|
|
|
|
auto it_end =
|
|
|
|
block_handles_.rbegin() + (block_handles_.size() - prev_handles_size);
|
|
|
|
|
|
|
|
while (it != it_end && (*it).is_cache_hit_) {
|
|
|
|
it++;
|
|
|
|
}
|
|
|
|
end_updated_offset = (*it).handle_.offset() + footer + (*it).handle_.size();
|
|
|
|
} else {
|
|
|
|
// Nothing to read. Can be because of IOError in index_iter_->Next() or
|
|
|
|
// reached upper_bound.
|
|
|
|
end_updated_offset = start_updated_offset;
|
2023-09-23 01:12:08 +00:00
|
|
|
}
|
2023-12-06 21:48:15 +00:00
|
|
|
|
|
|
|
end_offset = end_updated_offset;
|
|
|
|
start_offset = start_updated_offset;
|
2023-09-23 01:12:08 +00:00
|
|
|
ResetPreviousBlockOffset();
|
|
|
|
}
|
|
|
|
|
De-template block based table iterator (#6531)
Summary:
Right now block based table iterator is used as both of iterating data for block based table, and for the index iterator for partitioend index. This was initially convenient for introducing a new iterator and block type for new index format, while reducing code change. However, these two usage doesn't go with each other very well. For example, Prev() is never called for partitioned index iterator, and some other complexity is maintained in block based iterators, which is not needed for index iterator but maintainers will always need to reason about it. Furthermore, the template usage is not following Google C++ Style which we are following, and makes a large chunk of code tangled together. This commit separate the two iterators. Right now, here is what it is done:
1. Copy the block based iterator code into partitioned index iterator, and de-template them.
2. Remove some code not needed for partitioned index. The upper bound check and tricks are removed. We never tested performance for those tricks when partitioned index is enabled in the first place. It's unlikelyl to generate performance regression, as creating new partitioned index block is much rarer than data blocks.
3. Separate out the prefetch logic to a helper class and both classes call them.
This commit will enable future follow-ups. One direction is that we might separate index iterator interface for data blocks and index blocks, as they are quite different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6531
Test Plan: build using make and cmake. And build release
Differential Revision: D20473108
fbshipit-source-id: e48011783b339a4257c204cc07507b171b834b0f
2020-03-16 19:17:34 +00:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|