Summary:
Fix a use-after-move issue in block.cc and added some unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11505
Test Plan:
```
make all check
./block_test
```
Reviewed By: ltamasi
Differential Revision: D46506188
Pulled By: jowlyzhang
fbshipit-source-id: 316ed8ddd221c00b2bce2cf9fd47eea686cd74a5
Summary:
Add support to strip timestamp in block based table builder and pad timestamp in block based table reader.
On the write path, use the per column family option `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` to indicate whether user-defined timestamps should be stripped for all block based tables created for the column family.
On the read path, added a per table `TableReadOption.user_defined_timestamps_persisted` to flag whether the user keys in the table contains user defined timestamps.
This patch is mostly passing the related flags down to the block building/parsing level with the exception of handling the `first_internal_key` in `IndexValue`, which is included in the `IndexBuilder` level. The value part of range deletion entries should have a similar handling, I haven't decided where to best fit this piece of logic, I will do it in a follow up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11495
Test Plan:
Existing test `BlockBasedTableReaderTest` is parameterized to run with:
1) different UDT test modes: kNone, kNormal, kStripUserDefinedTimestamp
2) all four index types, when index type is `kTwoLevelIndexSearch`, also enables partitioned filters
3) parallel vs non-parallel compression
4) enable/disable compression dictionary.
Also added tests for API `BlockBasedTableReader::NewIterator`.
`PartitionedFilterBlockTest` is parameterized to run with different UDT test modes:kNone, kNormal, kStripUserDefinedTimestamp.
```
make all check
./block_based_table_reader_test
./partitioned_filter_block_test
```
Reviewed By: ltamasi
Differential Revision: D46344577
Pulled By: jowlyzhang
fbshipit-source-id: 93ac8542b19319d1298712b8bed908c8831ba675
Summary:
This patch adds support in `BlockBuilder` to strip user-defined timestamp from the `key` added via `Add(key, value)` and its equivalent APIs. The stripping logic is different when the key is either a user key or an internal key, so the `BlockBuilder` is created with a flag to indicate that. This patch also add support on the read path to APIs `NewIndexIterator`, `NewDataIterator` to support pad a min timestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11472
Test Plan:
Three test modes are added to parameterize existing tests:
UserDefinedTimestampTestMode::kNone -> UDT feature is not enabled
UserDefinedTimestampTestMode::kNormal -> UDT feature enabled, write / read with min timestamp
UserDefinedTimestampTestMode::kStripUserDefinedTimestamps -> UDT feature enabled, write / read with min timestamp, set `persist_user_defined_timestamps` where it applies to false.
The tests read/write with min timestamp so that point read and range scan can correctly read values in all three test modes.
`block_test` are parameterized to run with above three test modes and some additional parameteriazation
```
make all check
./block_test --gtest_filter="P/BlockTest*"
./block_test --gtest_filter="P/IndexBlockTest*"
```
Reviewed By: ltamasi
Differential Revision: D46200539
Pulled By: jowlyzhang
fbshipit-source-id: 59f5d6b584639976b69c2943eba723bd47d9b3c0
Summary:
add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
1. checksum construction and verification in block.cc/h
2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
3. unit tests/crash test updates
Tests:
* Added unit tests
* Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`
Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.
Performance:
Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):
```
SETUP
make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none
BENCHMARK
./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE
The readrandom ops/sec looks like the following:
Block cache size: 2GB 1.2GB * 0.9 1.2GB * 0.8 1.2GB * 0.5 8MB
Main 240805 223604 198176 161653 139040
PR prot_bytes=0 238691 226693 200127 161082 141153
PR prot_bytes=1 214983 193199 178532 137013 108211
prot_bytes=1 vs -10% -15% -10.8% -15% -23%
prot_bytes=0
```
The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287
Reviewed By: ajkr
Differential Revision: D43970708
Pulled By: cbi42
fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
Summary:
The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
a new API called `PutEntity` that can be used to write a wide-column entity
to the database. The new API is added to both `DB` and `WriteBatch`. Note
that currently there is no way to retrieve these entities; more precisely, all
read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
encounter a wide-column entity that is required to answer a query. Read-side
support (as well as other missing functionality like `Merge`, compaction filter,
and timestamp support) will be added in later PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D37369748
Pulled By: ltamasi
fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
Test Plan: See CI Runs.
Reviewed By: riversand963
Differential Revision: D36173954
fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
Summary:
This PR solves the problem discussed in https://github.com/facebook/rocksdb/issues/7149. By storing the pointer of InternalKeyComparator as icmp_ in BlockIter, the object size remains the same. And for each call to CompareCurrentKey, there is no need to create Comparator objects. One can use icmp_ directly or use the "user_comparator" from the icmp_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9611
Test Plan:
with https://github.com/facebook/rocksdb/issues/9903,
```
$ TEST_TMPDIR=/dev/shm python3.6 ../benchmark/tools/compare.py benchmarks ./db_basic_bench ../rocksdb-pr9611/db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1 --benchmark_repetitions=50
...
Comparing ./db_basic_bench to ../rocksdb-pr9611/db_basic_bench
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
...
DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_pvalue 0.0001 0.0001 U Test, Repetitions: 50 vs 50
DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_mean -0.0483 -0.0483 3924 3734 3924 3734
DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_median -0.0713 -0.0713 3971 3687 3970 3687
DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_stddev -0.0342 -0.0344 225 217 225 217
DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_cv +0.0148 +0.0146 0 0 0 0
OVERALL_GEOMEAN -0.0483 -0.0483 0 0 0 0
```
Reviewed By: akankshamahajan15
Differential Revision: D35882037
Pulled By: ajkr
fbshipit-source-id: 9e5337bbad8f1239dff7aa9f6549020d599bfcdf
Summary:
Fixes a problem where the iterator for metadata was being treated as a non-user key when in fact it was a user key. This led to a problem where the property keys could not be searched for correctly.
The main exposure of this problem was that the HashIndexReader could not get the "prefixes" property correctly, resulting in the failure of retrieval/creation of the BlockPrefixIndex.
Added BlockBasedTableTest.SeekMetaBlocks test to validate this condition.
Fixing this condition exposed two other tests (SeekWithPrefixLongerThanKey, MultiGetPrefixFilter) that passed incorrectly previously and now failed. Updated those two tests to pass. Not sure if the tests are functionally correct/still appropriate, but made them pass...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8692
Reviewed By: riversand963
Differential Revision: D33119539
Pulled By: mrambacher
fbshipit-source-id: 658969fe9265f73dc184dab97cc3f4eaed2d881a
Summary:
As a first step of supporting user-defined timestamps with ingestion, the
patch adds timestamp support to `SstFileWriter`; namely, it adds new
versions of the `Put` and `Delete` APIs that take timestamps. (`Merge`
and `DeleteRange` are currently not supported with user-defined timestamps
in general but once those features are implemented, we can handle them
in `SstFileWriter` in a similar fashion.) The new APIs validate the size of
the timestamp provided by the client. Similarly, calls to the pre-existing
timestamp-less APIs are now disallowed when user-defined timestamps are
in use according to the comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8899
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D30850699
Pulled By: ltamasi
fbshipit-source-id: 779154373618f19b8f0797976bb7286783c57b67
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D24579392
Pulled By: riversand963
fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
Summary:
PR https://github.com/facebook/rocksdb/issues/6944 transitioned `BlockIter` from using `Comparator*` to using
concrete `UserComparatorWrapper` and `InternalKeyComparator`. However,
adding them as instance variables to `BlockIter` was not optimal.
Bloating `BlockIter` caused the `ArenaWrappedDBIter`'s arena allocator to do more heap
allocations (in certain cases) which harmed performance of `DB::NewIterator()`. This PR
pushes down the concrete comparator objects to the point of usage, which
forces them to be on the stack. As a result, the `BlockIter` is back to
its original size prior to https://github.com/facebook/rocksdb/issues/6944 (actually a bit smaller since there
were two `Comparator*` before).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7149
Test Plan:
verified our internal `DB::NewIterator()`-heavy regression
test no longer reports regression.
Reviewed By: riversand963
Differential Revision: D22623189
Pulled By: ajkr
fbshipit-source-id: f6d69accfe5de51e0bd9874a480b32b29909bab6
Summary:
This is a followup to https://github.com/facebook/rocksdb/issues/6646. In that PR, for simplicity I just appended a comparison against the 0th restart key in case `BinarySeek()`'s binary search landed at index 0. As a result there were `2/(N+1) + log_2(N)` key comparisons. This PR does it differently. Now we expand the binary search range by one so it also covers the case where target is at or before the restart key at index 0. As a result, it involves `log_2(N+1)` key comparisons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7068
Test Plan:
ran readrandom with mostly default settings and counted key comparisons
using `PerfContext`.
before: `user_key_comparison_count = 28881965`
after: `user_key_comparison_count = 27823245`
setup command:
```
$ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
```
benchmark command:
```
$ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
```
Reviewed By: anand1976
Differential Revision: D22357032
Pulled By: ajkr
fbshipit-source-id: 8b01e9c1c2a4e9d02fc9dfe16c1cc0327f8bdf24
Summary:
This saves up to two key comparisons in block seeks. The first key
comparison saved is a redundant key comparison against the restart key
where the linear scan starts. This comparison is saved in all cases
except when the found key is in the first restart interval. The
second key comparison saved is a redundant key comparison against the
restart key where the linear scan ends. This is only saved in cases
where all keys in the restart interval are less than the target
(probability roughly `1/restart_interval`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6646
Test Plan:
ran a benchmark with mostly default settings and counted key comparisons
before: `user_key_comparison_count = 19399529`
after: `user_key_comparison_count = 18431498`
setup command:
```
$ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
```
benchmark command:
```
$ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
```
Reviewed By: pdillinger
Differential Revision: D20849707
Pulled By: ajkr
fbshipit-source-id: 1f01c5cd99ea771fd27974046e37b194f1cdcfac
Summary:
`IterKey::UpdateInternalKey()` is an error-prone API as it's
incompatible with `IterKey::TrimAppend()`, which is used for
decoding delta-encoded internal keys. This PR stops using it in
`BlockIter`. Instead, it assigns global seqno in a separate `IterKey`'s
buffer when needed. The logic for safely getting a Slice with global
seqno properly assigned is encapsulated in `GlobalSeqnoAppliedKey`.
`BinarySeek()` is also migrated to use this API (previously it ignored
global seqno entirely).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6843
Test Plan:
benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
created by regular flush; "ingestion_db" created by ingesting a file. Both
DBs have same contents.
```
$ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
$ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
$ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
```
benchmark run command:
```
TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=10 -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=1048576000 -threads=1 -reads=40000000
```
results:
| DB | code | throughput |
|---|---|---|
| normal_db | master | 267.9 |
| normal_db | PR6843 | 254.2 (-5.1%) |
| ingestion_db | master | 259.6 |
| ingestion_db | PR6843 | 250.5 (-3.5%) |
Reviewed By: pdillinger
Differential Revision: D21562604
Pulled By: ajkr
fbshipit-source-id: 937596f836930515da8084d11755e1f247dcb264
Summary:
When index block is empty or an error happens while reading it,
`Invalidate()` is called rather than `Initialize()`. So `Seek()` must
not refer to member variables that are only initialized in
`Initialize()` until it is sure `Initialize()` has been called.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6736
Reviewed By: siying
Differential Revision: D21139641
Pulled By: ajkr
fbshipit-source-id: 71c58cc1adbd795dc3729dd5023bf7df1515ff32
Summary:
On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key.
The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669
Test Plan:
New unit test
Signed-off-by: Yi Wu <yiwu@pingcap.com>
Reviewed By: cheng-chang
Differential Revision: D20931808
Pulled By: ajkr
fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab
Summary:
Original author: jeffrey-xiao
If we are writing a global seqno for an ingested file, the range
tombstone metablock gets accessed and put into the cache during
ingestion preparation. At the time, the global seqno of the ingested
file has not yet been determined, so the cached block will not have a
global seqno. When the file is ingested and we read its range tombstone
metablock, it will be returned from the cache with no global seqno. In
that case, we use the actual seqnos stored in the range tombstones,
which are all zero, so the tombstones cover nothing.
This commit removes global_seqno_ variable from Block. When iterating
over a block, the global seqno for the block is determined by the
iterator instead of storing this mutable attribute in Block.
Additionally, this commit adds a regression test to check that keys are
deleted when ingesting a file with a global seqno and range deletion
tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429
Differential Revision: D19961563
Pulled By: ajkr
fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7
Summary:
Cleanup some code without any real change in functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6440
Differential Revision: D20015891
Pulled By: riversand963
fbshipit-source-id: 33e18754b0f002006a6d4805e9aaf84c0c8ad25a
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
Recent fix to Prefix Hash https://github.com/facebook/rocksdb/pull/6292 caused a bug that the newly created NotFound status in hash index is never reset. This causes reseek or implict reseek to return wrong results sometimes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6302
Test Plan:
Add a unit test that would fail. Not fix.
crash test with hash test would fail in several seconds. With the fix, it will run about several minutes before failing with another failure.
Differential Revision: D19424572
fbshipit-source-id: c5276f36a95fd0e2837e30190476d2fe21ed8566
Summary:
When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target.
Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297
Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix.
Differential Revision: D19404695
fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387
Differential Revision: D15579036
Pulled By: siying
fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f