Commit graph

5 commits

Author SHA1 Message Date
Yu Zhang 9f7877f246 Add support to strip / pad timestamp when creating / reading a block based table (#11495)
Summary:
Add support to strip timestamp in block based table builder and pad timestamp in block based table reader.

On the write path, use the per column family option `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` to indicate whether user-defined timestamps should be stripped for all block based tables created for the column family.

On the read path, added a per table `TableReadOption.user_defined_timestamps_persisted` to flag whether the user keys in the table contains user defined timestamps.

This patch is mostly passing the related flags down to the block building/parsing level with the exception of handling the `first_internal_key` in `IndexValue`, which is included in the `IndexBuilder` level.  The value part of range deletion entries should have a similar handling, I haven't decided where to best fit this piece of logic, I will do it in a follow up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11495

Test Plan:
Existing test `BlockBasedTableReaderTest` is parameterized to run with:
1) different UDT test modes: kNone, kNormal, kStripUserDefinedTimestamp
2) all four index types, when index type is `kTwoLevelIndexSearch`, also enables partitioned filters
3) parallel vs non-parallel compression
4) enable/disable compression dictionary.

Also added tests for API `BlockBasedTableReader::NewIterator`.

`PartitionedFilterBlockTest` is parameterized to run with different UDT test modes:kNone, kNormal, kStripUserDefinedTimestamp.

```
make all check
./block_based_table_reader_test
./partitioned_filter_block_test
```

Reviewed By: ltamasi

Differential Revision: D46344577

Pulled By: jowlyzhang

fbshipit-source-id: 93ac8542b19319d1298712b8bed908c8831ba675
2023-06-01 11:10:03 -07:00
Andrew Kryczka b45738622a Use user-provided ReadOptions for metadata block reads more often (#11208)
Summary:
This is mostly taken from https://github.com/facebook/rocksdb/issues/10427 with my own comments addressed. This PR plumbs the user’s `ReadOptions` down to `GetOrReadIndexBlock()`, `GetOrReadFilterBlock()`, and `GetFilterPartitionBlock()`. Now those functions no longer have to make up a `ReadOptions` with incomplete information.

I also let `PartitionIndexReader::NewIterator()` pass through its caller's `ReadOptions::verify_checksums`, which was inexplicably dropped previously.

Fixes https://github.com/facebook/rocksdb/issues/10463

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11208

Test Plan:
Functional:
- Measured `-verify_checksum=false` applies to metadata blocks read outside of table open
  - setup command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56`
  - run command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=readrandom -use_existing_db=true -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56 -duration=10 -threads=32 -cache_size=131072 -statistics=true -verify_checksum=false -open_files=20 -cache_index_and_filter_blocks=true`
  - before: `rocksdb.block.checksum.compute.count COUNT : 384353`
  - after: `rocksdb.block.checksum.compute.count COUNT : 22`

Performance:
- Setup command (tmpfs, 128MB logical data size, cache indexes/filters without pinning so index/filter lookups go through table reader): `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=131072 -target_file_size_base=131072 -max_bytes_for_level_base=524288 -compression_type=none -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1`
- Measured point lookup performance. Database is fully cached to emphasize any new callstack overheads
  - Command: `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=readrandom[-W1][-X20] -use_existing_db=true -cache_index_and_filter_blocks=true -disable_auto_compactions=true -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1 -duration=10 -cache_size=1048576000`
  - Before: `readrandom [AVG    20 runs] : 274848 (± 3717) ops/sec;    8.4 (± 0.1) MB/sec`
  - After: `readrandom [AVG    20 runs] : 277904 (± 4474) ops/sec;    8.5 (± 0.1) MB/sec`

Reviewed By: hx235

Differential Revision: D43145366

Pulled By: ajkr

fbshipit-source-id: 75ec062ece86a82cd788783de9de2c72df57f994
2023-04-04 16:53:14 -07:00
anand76 727bad78b8 Format files under table/ by clang-format (#10852)
Summary:
Run clang-format on files under the `table` directory.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10852

Reviewed By: ajkr

Differential Revision: D40650732

Pulled By: anand1976

fbshipit-source-id: 2023a958e37fd6274040c5181130284600c9e0ef
2022-10-25 11:50:38 -07:00
gitbw95 4da34b97ee Set Read rate limiter priority dynamically and pass it to FS (#9996)
Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.

### Solution
User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile.

**This PR is for the Read path.** The **Read:** dynamic priority for different state are listed as follows:

| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
|  Flush (verification read in BuildTable()) | IO_USER | IO_USER | IO_USER |
|  Compaction | IO_LOW  | IO_USER | IO_USER |
|  User | User provided | User provided | User provided |

We will respect the read_options that the user provided and will not set it.
The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read".

**Details**
1. Set read_options.rate_limiter_priority dynamically:
- User: Do not update the read_options. Use the read_options that the user provided.
- Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction().
- Flush: Update read_options in BuildTable().

2. Pass the rate limiter priority to FSRandomAccessFile functions:
- After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions,  including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock().
- In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch().
- Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator().

### Test Plans
Add unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996

Reviewed By: anand1976

Differential Revision: D36452483

Pulled By: gitbw95

fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf
2022-05-18 19:41:44 -07:00
sdong 674cf41732 Divide block_based_table_reader.cc (#6527)
Summary:
block_based_table_reader.cc is a giant file, which makes it hard for users to navigate the code. Divide the files to multiple files.
Some class templates cannot be moved to .cc file. They are moved to .h files. It is still better than including them all in block_based_table_reader.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6527

Test Plan: "make all check" and "make release". Also build using cmake.

Differential Revision: D20428455

fbshipit-source-id: ca713c698469f07f35bc0c271358c0874ed4eb28
2020-03-12 21:41:50 -07:00