2016-02-09 23:12:00 +00:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-15 23:03:42 +00:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2014-01-22 19:44:53 +00:00
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
|
|
|
|
#pragma once
|
|
|
|
|
Meta-internal folly integration with F14FastMap (#9546)
Summary:
Especially after updating to C++17, I don't see a compelling case for
*requiring* any folly components in RocksDB. I was able to purge the existing
hard dependencies, and it can be quite difficult to strip out non-trivial components
from folly for use in RocksDB. (The prospect of doing that on F14 has changed
my mind on the best approach here.)
But this change creates an optional integration where we can plug in
components from folly at compile time, starting here with F14FastMap to replace
std::unordered_map when possible (probably no public APIs for example). I have
replaced the biggest CPU users of std::unordered_map with compile-time
pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
that is in the Makefile for public CI testing. A full folly build is not needed, but
checking out the full folly repo is much simpler for getting the dependency,
and anything else we might want to optionally integrate in the future.
Some picky details:
* I don't think the distributed mutex stuff is actually used, so it was easy to remove.
* I implemented an alternative to `folly::constexpr_log2` (which is much easier
in C++17 than C++11) so that I could pull out the hard dependencies on
`ConstexprMath.h`
* I had to add noexcept move constructors/operators to some types to make
F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
macro to make that easier in some common cases.
* Updated Meta-internal buck build to use folly F14Map (always)
No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
production integration for open source users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546
Test Plan:
CircleCI tests updated so that a couple of them use folly.
Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
(Note: they should probably use buck but they currently use Makefile.)
Example performance improvement: when filter partitions are pinned in cache,
they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
a test that exercises that heavily. Build DB with
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
```
and test with (simultaneous runs with & without folly, ~20 times each to see
convergence)
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
```
Average ops/s no folly: 26229.2
Average ops/s with folly: 26853.3 (+2.4%)
Reviewed By: ajkr
Differential Revision: D34181736
Pulled By: pdillinger
fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
2022-04-13 14:34:01 +00:00
|
|
|
#include <atomic>
|
2014-01-22 19:44:53 +00:00
|
|
|
#include <string>
|
Meta-internal folly integration with F14FastMap (#9546)
Summary:
Especially after updating to C++17, I don't see a compelling case for
*requiring* any folly components in RocksDB. I was able to purge the existing
hard dependencies, and it can be quite difficult to strip out non-trivial components
from folly for use in RocksDB. (The prospect of doing that on F14 has changed
my mind on the best approach here.)
But this change creates an optional integration where we can plug in
components from folly at compile time, starting here with F14FastMap to replace
std::unordered_map when possible (probably no public APIs for example). I have
replaced the biggest CPU users of std::unordered_map with compile-time
pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
that is in the Makefile for public CI testing. A full folly build is not needed, but
checking out the full folly repo is much simpler for getting the dependency,
and anything else we might want to optionally integrate in the future.
Some picky details:
* I don't think the distributed mutex stuff is actually used, so it was easy to remove.
* I implemented an alternative to `folly::constexpr_log2` (which is much easier
in C++17 than C++11) so that I could pull out the hard dependencies on
`ConstexprMath.h`
* I had to add noexcept move constructors/operators to some types to make
F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
macro to make that easier in some common cases.
* Updated Meta-internal buck build to use folly F14Map (always)
No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
production integration for open source users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546
Test Plan:
CircleCI tests updated so that a couple of them use folly.
Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
(Note: they should probably use buck but they currently use Makefile.)
Example performance improvement: when filter partitions are pinned in cache,
they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
a test that exercises that heavily. Build DB with
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
```
and test with (simultaneous runs with & without folly, ~20 times each to see
convergence)
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
```
Average ops/s no folly: 26229.2
Average ops/s with folly: 26853.3 (+2.4%)
Reviewed By: ajkr
Differential Revision: D34181736
Pulled By: pdillinger
fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
2022-04-13 14:34:01 +00:00
|
|
|
#include <unordered_map>
|
2014-01-22 19:44:53 +00:00
|
|
|
#include <vector>
|
|
|
|
|
Account memory of FileMetaData in global memory limit (#9924)
Summary:
**Context/Summary:**
As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924
Test Plan:
- Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released
- New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)`
- db bench (CPU cost of `charge_file_metadata` in write and compact)
- **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
- **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'`
table 1 - write
#-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078
80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734**
160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978**
table 2 - compact
#-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67
20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96
40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96**
80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78**
- stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal
Reviewed By: ajkr
Differential Revision: D36055583
Pulled By: hx235
fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2022-06-14 20:06:40 +00:00
|
|
|
#include "cache/cache_reservation_manager.h"
|
2014-02-06 23:42:16 +00:00
|
|
|
#include "db/memtable_list.h"
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 17:07:55 +00:00
|
|
|
#include "db/table_cache.h"
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 17:04:30 +00:00
|
|
|
#include "db/table_properties_collector.h"
|
2016-09-02 21:16:31 +00:00
|
|
|
#include "db/write_batch_internal.h"
|
|
|
|
#include "db/write_controller.h"
|
2017-04-06 02:02:00 +00:00
|
|
|
#include "options/cf_options.h"
|
2015-06-03 00:07:16 +00:00
|
|
|
#include "rocksdb/compaction_job_stats.h"
|
|
|
|
#include "rocksdb/db.h"
|
|
|
|
#include "rocksdb/env.h"
|
|
|
|
#include "rocksdb/options.h"
|
2019-06-13 22:39:52 +00:00
|
|
|
#include "trace_replay/block_cache_tracer.h"
|
Support returning write unix time in iterator property (#12428)
Summary:
This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is:
1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned.
2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown.
Needed follow up:
1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it.
2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion.
3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428
Test Plan: Added unit test
Reviewed By: pdillinger
Differential Revision: D54967067
Pulled By: jowlyzhang
fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a
2024-03-15 22:37:37 +00:00
|
|
|
#include "util/cast_util.h"
|
Meta-internal folly integration with F14FastMap (#9546)
Summary:
Especially after updating to C++17, I don't see a compelling case for
*requiring* any folly components in RocksDB. I was able to purge the existing
hard dependencies, and it can be quite difficult to strip out non-trivial components
from folly for use in RocksDB. (The prospect of doing that on F14 has changed
my mind on the best approach here.)
But this change creates an optional integration where we can plug in
components from folly at compile time, starting here with F14FastMap to replace
std::unordered_map when possible (probably no public APIs for example). I have
replaced the biggest CPU users of std::unordered_map with compile-time
pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
that is in the Makefile for public CI testing. A full folly build is not needed, but
checking out the full folly repo is much simpler for getting the dependency,
and anything else we might want to optionally integrate in the future.
Some picky details:
* I don't think the distributed mutex stuff is actually used, so it was easy to remove.
* I implemented an alternative to `folly::constexpr_log2` (which is much easier
in C++17 than C++11) so that I could pull out the hard dependencies on
`ConstexprMath.h`
* I had to add noexcept move constructors/operators to some types to make
F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
macro to make that easier in some common cases.
* Updated Meta-internal buck build to use folly F14Map (always)
No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
production integration for open source users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546
Test Plan:
CircleCI tests updated so that a couple of them use folly.
Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
(Note: they should probably use buck but they currently use Makefile.)
Example performance improvement: when filter partitions are pinned in cache,
they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
a test that exercises that heavily. Build DB with
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
```
and test with (simultaneous runs with & without folly, ~20 times each to see
convergence)
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
```
Average ops/s no folly: 26229.2
Average ops/s with folly: 26853.3 (+2.4%)
Reviewed By: ajkr
Differential Revision: D34181736
Pulled By: pdillinger
fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
2022-04-13 14:34:01 +00:00
|
|
|
#include "util/hash_containers.h"
|
2015-02-05 05:39:45 +00:00
|
|
|
#include "util/thread_local.h"
|
2014-01-24 22:30:28 +00:00
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
2014-01-22 19:44:53 +00:00
|
|
|
|
|
|
|
class Version;
|
|
|
|
class VersionSet;
|
2018-02-12 23:34:39 +00:00
|
|
|
class VersionStorageInfo;
|
2014-01-24 22:30:28 +00:00
|
|
|
class MemTable;
|
|
|
|
class MemTableListVersion;
|
2014-01-31 23:30:27 +00:00
|
|
|
class CompactionPicker;
|
|
|
|
class Compaction;
|
|
|
|
class InternalKey;
|
2014-02-05 01:45:19 +00:00
|
|
|
class InternalStats;
|
2014-02-11 01:04:44 +00:00
|
|
|
class ColumnFamilyData;
|
|
|
|
class DBImpl;
|
2014-03-11 00:25:10 +00:00
|
|
|
class LogBuffer;
|
2015-02-05 05:39:45 +00:00
|
|
|
class InstrumentedMutex;
|
|
|
|
class InstrumentedMutexLock;
|
2017-10-06 01:00:38 +00:00
|
|
|
struct SuperVersionContext;
|
2020-10-15 20:02:44 +00:00
|
|
|
class BlobFileCache;
|
2022-06-21 03:58:11 +00:00
|
|
|
class BlobSource;
|
2014-02-11 01:04:44 +00:00
|
|
|
|
2016-11-23 17:19:11 +00:00
|
|
|
extern const double kIncSlowdownRatio;
|
2019-05-24 19:03:16 +00:00
|
|
|
// This file contains a list of data structures for managing column family
|
2019-06-13 22:39:52 +00:00
|
|
|
// level metadata.
|
2019-05-24 19:03:16 +00:00
|
|
|
//
|
|
|
|
// The basic relationships among classes declared here are illustrated as
|
|
|
|
// following:
|
|
|
|
//
|
|
|
|
// +----------------------+ +----------------------+ +--------+
|
|
|
|
// +---+ ColumnFamilyHandle 1 | +--+ ColumnFamilyHandle 2 | | DBImpl |
|
|
|
|
// | +----------------------+ | +----------------------+ +----+---+
|
|
|
|
// | +--------------------------+ |
|
|
|
|
// | | +-----------------------------+
|
|
|
|
// | | |
|
|
|
|
// | | +-----------------------------v-------------------------------+
|
|
|
|
// | | | |
|
|
|
|
// | | | ColumnFamilySet |
|
|
|
|
// | | | |
|
|
|
|
// | | +-------------+--------------------------+----------------+---+
|
|
|
|
// | | | | |
|
|
|
|
// | +-------------------------------------+ | |
|
|
|
|
// | | | | v
|
|
|
|
// | +-------------v-------------+ +-----v----v---------+
|
|
|
|
// | | | | |
|
|
|
|
// | | ColumnFamilyData 1 | | ColumnFamilyData 2 | ......
|
|
|
|
// | | | | |
|
|
|
|
// +---> | | |
|
|
|
|
// | +---------+ | |
|
|
|
|
// | | MemTable| | |
|
|
|
|
// | | List | | |
|
|
|
|
// +--------+---+--+-+----+----+ +--------------------++
|
|
|
|
// | | | |
|
|
|
|
// | | | |
|
|
|
|
// | | | +-----------------------+
|
|
|
|
// | | +-----------+ |
|
|
|
|
// v +--------+ | |
|
|
|
|
// +--------+--------+ | | |
|
|
|
|
// | | | | +----------v----------+
|
|
|
|
// +---> |SuperVersion 1.a +-----------------> |
|
|
|
|
// | +------+ | | MemTableListVersion |
|
|
|
|
// +---+-------------+ | | | | |
|
|
|
|
// | | | | +----+------------+---+
|
|
|
|
// | current | | | | |
|
|
|
|
// | +-------------+ | |mem | |
|
|
|
|
// | | | | | |
|
|
|
|
// +-v---v-------+ +---v--v---+ +-----v----+ +----v-----+
|
|
|
|
// | | | | | | | |
|
|
|
|
// | Version 1.a | | memtable | | memtable | | memtable |
|
|
|
|
// | | | 1.a | | 1.b | | 1.c |
|
|
|
|
// +-------------+ | | | | | |
|
|
|
|
// +----------+ +----------+ +----------+
|
2019-06-13 22:39:52 +00:00
|
|
|
//
|
2019-05-24 19:03:16 +00:00
|
|
|
// DBImpl keeps a ColumnFamilySet, which references to all column families by
|
|
|
|
// pointing to respective ColumnFamilyData object of each column family.
|
|
|
|
// This is how DBImpl can list and operate on all the column families.
|
|
|
|
// ColumnFamilyHandle also points to ColumnFamilyData directly, so that
|
|
|
|
// when a user executes a query, it can directly find memtables and Version
|
|
|
|
// as well as SuperVersion to the column family, without going through
|
|
|
|
// ColumnFamilySet.
|
|
|
|
//
|
|
|
|
// ColumnFamilySet points to the latest view of the LSM-tree (list of memtables
|
|
|
|
// and SST files) indirectly, while ongoing operations may hold references
|
|
|
|
// to a current or an out-of-date SuperVersion, which in turn points to a
|
|
|
|
// point-in-time view of the LSM-tree. This guarantees the memtables and SST
|
|
|
|
// files being operated on will not go away, until the SuperVersion is
|
|
|
|
// unreferenced to 0 and destoryed.
|
|
|
|
//
|
|
|
|
// The following graph illustrates a possible referencing relationships:
|
|
|
|
//
|
|
|
|
// Column +--------------+ current +-----------+
|
|
|
|
// Family +---->+ +------------------->+ |
|
|
|
|
// Data | SuperVersion +----------+ | Version A |
|
|
|
|
// | 3 | imm | | |
|
|
|
|
// Iter2 +----->+ | +-------v------+ +-----------+
|
|
|
|
// +-----+--------+ | MemtableList +----------------> Empty
|
|
|
|
// | | Version r | +-----------+
|
|
|
|
// | +--------------+ | |
|
|
|
|
// +------------------+ current| Version B |
|
|
|
|
// +--------------+ | +----->+ |
|
|
|
|
// | | | | +-----+-----+
|
|
|
|
// Compaction +>+ SuperVersion +-------------+ ^
|
|
|
|
// Job | 2 +------+ | |current
|
|
|
|
// | +----+ | | mem | +------------+
|
|
|
|
// +--------------+ | | +---------------------> |
|
|
|
|
// | +------------------------> MemTable a |
|
|
|
|
// | mem | | |
|
|
|
|
// +--------------+ | | +------------+
|
|
|
|
// | +--------------------------+
|
|
|
|
// Iter1 +-----> SuperVersion | | +------------+
|
|
|
|
// | 1 +------------------------------>+ |
|
|
|
|
// | +-+ | mem | MemTable b |
|
|
|
|
// +--------------+ | | | |
|
|
|
|
// | | +--------------+ +-----^------+
|
|
|
|
// | |imm | MemtableList | |
|
|
|
|
// | +--->+ Version s +------------+
|
|
|
|
// | +--------------+
|
|
|
|
// | +--------------+
|
|
|
|
// | | MemtableList |
|
|
|
|
// +------>+ Version t +--------> Empty
|
|
|
|
// imm +--------------+
|
|
|
|
//
|
|
|
|
// In this example, even if the current LSM-tree consists of Version A and
|
|
|
|
// memtable a, which is also referenced by SuperVersion, two older SuperVersion
|
|
|
|
// SuperVersion2 and Superversion1 still exist, and are referenced by a
|
|
|
|
// compaction job and an old iterator Iter1, respectively. SuperVersion2
|
|
|
|
// contains Version B, memtable a and memtable b; SuperVersion1 contains
|
|
|
|
// Version B and memtable b (mutable). As a result, Version B and memtable b
|
|
|
|
// are prevented from being destroyed or deleted.
|
2019-06-13 22:39:52 +00:00
|
|
|
|
2014-03-11 21:52:17 +00:00
|
|
|
// ColumnFamilyHandleImpl is the class that clients use to access different
|
|
|
|
// column families. It has non-trivial destructor, which gets called when client
|
|
|
|
// is done using the column family
|
2014-02-11 01:04:44 +00:00
|
|
|
class ColumnFamilyHandleImpl : public ColumnFamilyHandle {
|
|
|
|
public:
|
|
|
|
// create while holding the mutex
|
2022-11-02 21:34:24 +00:00
|
|
|
ColumnFamilyHandleImpl(ColumnFamilyData* cfd, DBImpl* db,
|
|
|
|
InstrumentedMutex* mutex);
|
2014-02-11 01:04:44 +00:00
|
|
|
// destroy without mutex
|
|
|
|
virtual ~ColumnFamilyHandleImpl();
|
|
|
|
virtual ColumnFamilyData* cfd() const { return cfd_; }
|
Access DBImpl* and CFD* by CFHImpl* in Iterators (#12395)
Summary:
In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR https://github.com/facebook/rocksdb/issues/12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs: https://github.com/facebook/rocksdb/issues/11925 #11943, and https://github.com/facebook/rocksdb/issues/11977.
To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12395
Test Plan:
# Summary
In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR #12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs: #11925 #11943, and #11977.
To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`.
# Test Plan
There should be no behavior changes. Existing tests and CI for the correctness tests.
**Test for Perf Regression**
Build
```
$> make -j64 release
```
Setup
```
$> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none
```
Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000
```
Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator : 0.552 micros/op 1810157 ops/sec 0.552 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom : 4.502 micros/op 222143 ops/sec 4.502 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator : 0.520 micros/op 1924401 ops/sec 0.520 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom : 4.532 micros/op 220657 ops/sec 4.532 seconds 1000000 operations; (0 of 1000000 found)
```
Reviewed By: pdillinger
Differential Revision: D54332713
Pulled By: jaykorean
fbshipit-source-id: b28d897ad519e58b1ca82eb068a6319544a4fae5
2024-03-01 18:28:20 +00:00
|
|
|
virtual DBImpl* db() const { return db_; }
|
2014-02-11 01:04:44 +00:00
|
|
|
|
2024-01-31 21:14:42 +00:00
|
|
|
uint32_t GetID() const override;
|
|
|
|
const std::string& GetName() const override;
|
|
|
|
Status GetDescriptor(ColumnFamilyDescriptor* desc) override;
|
|
|
|
const Comparator* GetComparator() const override;
|
2014-02-26 01:30:54 +00:00
|
|
|
|
2014-02-11 01:04:44 +00:00
|
|
|
private:
|
|
|
|
ColumnFamilyData* cfd_;
|
|
|
|
DBImpl* db_;
|
2015-02-05 05:39:45 +00:00
|
|
|
InstrumentedMutex* mutex_;
|
2014-02-11 01:04:44 +00:00
|
|
|
};
|
|
|
|
|
2014-03-11 21:52:17 +00:00
|
|
|
// Does not ref-count ColumnFamilyData
|
|
|
|
// We use this dummy ColumnFamilyHandleImpl because sometimes MemTableInserter
|
|
|
|
// calls DBImpl methods. When this happens, MemTableInserter need access to
|
|
|
|
// ColumnFamilyHandle (same as the client would need). In that case, we feed
|
|
|
|
// MemTableInserter dummy ColumnFamilyHandle and enable it to call DBImpl
|
|
|
|
// methods
|
2014-02-11 01:04:44 +00:00
|
|
|
class ColumnFamilyHandleInternal : public ColumnFamilyHandleImpl {
|
|
|
|
public:
|
|
|
|
ColumnFamilyHandleInternal()
|
2022-11-02 21:34:24 +00:00
|
|
|
: ColumnFamilyHandleImpl(nullptr, nullptr, nullptr),
|
|
|
|
internal_cfd_(nullptr) {}
|
2014-02-11 01:04:44 +00:00
|
|
|
|
2014-11-06 19:14:28 +00:00
|
|
|
void SetCFD(ColumnFamilyData* _cfd) { internal_cfd_ = _cfd; }
|
2024-01-31 21:14:42 +00:00
|
|
|
ColumnFamilyData* cfd() const override { return internal_cfd_; }
|
2014-02-11 01:04:44 +00:00
|
|
|
|
|
|
|
private:
|
|
|
|
ColumnFamilyData* internal_cfd_;
|
|
|
|
};
|
2014-01-24 22:30:28 +00:00
|
|
|
|
|
|
|
// holds references to memtable, all immutable memtables and version
|
|
|
|
struct SuperVersion {
|
2015-04-09 04:10:35 +00:00
|
|
|
// Accessing members of this class is not thread-safe and requires external
|
|
|
|
// synchronization (ie db mutex held or on write thread).
|
2019-12-13 03:02:51 +00:00
|
|
|
ColumnFamilyData* cfd;
|
2014-01-24 22:30:28 +00:00
|
|
|
MemTable* mem;
|
|
|
|
MemTableListVersion* imm;
|
|
|
|
Version* current;
|
2014-09-17 19:49:13 +00:00
|
|
|
MutableCFOptions mutable_cf_options;
|
2014-03-04 01:54:04 +00:00
|
|
|
// Version number of the current SuperVersion
|
|
|
|
uint64_t version_number;
|
2017-10-06 01:00:38 +00:00
|
|
|
WriteStallCondition write_stall_condition;
|
2023-09-13 23:34:18 +00:00
|
|
|
// Each time `full_history_ts_low` collapses history, a new SuperVersion is
|
|
|
|
// installed. This field tracks the effective `full_history_ts_low` for that
|
|
|
|
// SuperVersion, to be used by read APIs for sanity checks. This field is
|
|
|
|
// immutable once SuperVersion is installed. For column family that doesn't
|
|
|
|
// enable UDT feature, this is an empty string.
|
|
|
|
std::string full_history_ts_low;
|
2015-04-09 04:10:35 +00:00
|
|
|
|
Support returning write unix time in iterator property (#12428)
Summary:
This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is:
1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned.
2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown.
Needed follow up:
1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it.
2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion.
3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428
Test Plan: Added unit test
Reviewed By: pdillinger
Differential Revision: D54967067
Pulled By: jowlyzhang
fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a
2024-03-15 22:37:37 +00:00
|
|
|
// A shared copy of the DB's seqno to time mapping.
|
|
|
|
std::shared_ptr<const SeqnoToTimeMapping> seqno_to_time_mapping{nullptr};
|
|
|
|
|
2014-01-24 22:30:28 +00:00
|
|
|
// should be called outside the mutex
|
2014-02-12 22:01:30 +00:00
|
|
|
SuperVersion() = default;
|
2014-01-24 22:30:28 +00:00
|
|
|
~SuperVersion();
|
|
|
|
SuperVersion* Ref();
|
2015-04-09 04:10:35 +00:00
|
|
|
// If Unref() returns true, Cleanup() should be called with mutex held
|
|
|
|
// before deleting this SuperVersion.
|
2014-01-24 22:30:28 +00:00
|
|
|
bool Unref();
|
|
|
|
|
|
|
|
// call these two methods with db mutex held
|
|
|
|
// Cleanup unrefs mem, imm and current. Also, it stores all memtables
|
|
|
|
// that needs to be deleted in to_delete vector. Unrefing those
|
|
|
|
// objects needs to be done in the mutex
|
Make mempurge a background process (equivalent to in-memory compaction). (#8505)
Summary:
In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype.
In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).
Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.
MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`.
The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505
Reviewed By: pdillinger
Differential Revision: D29619283
Pulled By: bjlemaire
fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95
2021-07-10 00:16:00 +00:00
|
|
|
void Cleanup();
|
Support returning write unix time in iterator property (#12428)
Summary:
This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is:
1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned.
2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown.
Needed follow up:
1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it.
2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion.
3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428
Test Plan: Added unit test
Reviewed By: pdillinger
Differential Revision: D54967067
Pulled By: jowlyzhang
fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a
2024-03-15 22:37:37 +00:00
|
|
|
void Init(
|
|
|
|
ColumnFamilyData* new_cfd, MemTable* new_mem,
|
|
|
|
MemTableListVersion* new_imm, Version* new_current,
|
|
|
|
std::shared_ptr<const SeqnoToTimeMapping> new_seqno_to_time_mapping);
|
|
|
|
|
|
|
|
// Share the ownership of the seqno to time mapping object referred to in this
|
|
|
|
// SuperVersion. To be used by the new SuperVersion to be installed after this
|
|
|
|
// one if seqno to time mapping does not change in between these two
|
2024-03-21 17:00:15 +00:00
|
|
|
// SuperVersions. Or to share the ownership of the mapping with a FlushJob.
|
Support returning write unix time in iterator property (#12428)
Summary:
This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is:
1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned.
2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown.
Needed follow up:
1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it.
2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion.
3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428
Test Plan: Added unit test
Reviewed By: pdillinger
Differential Revision: D54967067
Pulled By: jowlyzhang
fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a
2024-03-15 22:37:37 +00:00
|
|
|
std::shared_ptr<const SeqnoToTimeMapping> ShareSeqnoToTimeMapping() {
|
|
|
|
return seqno_to_time_mapping;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Access the seqno to time mapping object in this SuperVersion.
|
|
|
|
UnownedPtr<const SeqnoToTimeMapping> GetSeqnoToTimeMapping() const {
|
|
|
|
return seqno_to_time_mapping.get();
|
|
|
|
}
|
2014-03-08 00:59:47 +00:00
|
|
|
|
|
|
|
// The value of dummy is not actually used. kSVInUse takes its address as a
|
|
|
|
// mark in the thread local storage to indicate the SuperVersion is in use
|
|
|
|
// by thread. This way, the value of kSVInUse is guaranteed to have no
|
|
|
|
// conflict with SuperVersion object address and portable on different
|
|
|
|
// platform.
|
|
|
|
static int dummy;
|
|
|
|
static void* const kSVInUse;
|
|
|
|
static void* const kSVObsolete;
|
2015-04-09 04:10:35 +00:00
|
|
|
|
|
|
|
private:
|
|
|
|
std::atomic<uint32_t> refs;
|
|
|
|
// We need to_delete because during Cleanup(), imm->Unref() returns
|
|
|
|
// all memtables that we need to free through this vector. We then
|
|
|
|
// delete all those memtables outside of mutex, during destruction
|
|
|
|
autovector<MemTable*> to_delete;
|
2014-01-24 22:30:28 +00:00
|
|
|
};
|
2014-01-22 19:44:53 +00:00
|
|
|
|
2024-01-29 18:38:08 +00:00
|
|
|
Status CheckCompressionSupported(const ColumnFamilyOptions& cf_options);
|
2015-06-18 21:55:05 +00:00
|
|
|
|
2024-01-29 18:38:08 +00:00
|
|
|
Status CheckConcurrentWritesSupported(const ColumnFamilyOptions& cf_options);
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 23:59:07 +00:00
|
|
|
|
2024-01-29 18:38:08 +00:00
|
|
|
Status CheckCFPathsSupported(const DBOptions& db_options,
|
|
|
|
const ColumnFamilyOptions& cf_options);
|
2018-04-06 02:49:06 +00:00
|
|
|
|
2024-01-29 18:38:08 +00:00
|
|
|
ColumnFamilyOptions SanitizeOptions(const ImmutableDBOptions& db_options,
|
|
|
|
const ColumnFamilyOptions& src);
|
2021-03-26 04:17:17 +00:00
|
|
|
// Wrap user defined table properties collector factories `from cf_options`
|
2024-02-02 22:14:43 +00:00
|
|
|
// into internal ones in internal_tbl_prop_coll_factories. Add a system internal
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 17:04:30 +00:00
|
|
|
// one too.
|
2024-02-02 22:14:43 +00:00
|
|
|
void GetInternalTblPropCollFactory(
|
2016-09-23 23:34:04 +00:00
|
|
|
const ImmutableCFOptions& ioptions,
|
2024-02-02 22:14:43 +00:00
|
|
|
InternalTblPropCollFactories* internal_tbl_prop_coll_factories);
|
2014-02-05 00:31:18 +00:00
|
|
|
|
2014-02-11 01:04:44 +00:00
|
|
|
class ColumnFamilySet;
|
|
|
|
|
2015-01-06 20:44:21 +00:00
|
|
|
// This class keeps all the data that a column family needs.
|
2014-03-11 21:52:17 +00:00
|
|
|
// Most methods require DB mutex held, unless otherwise noted
|
2014-01-29 21:28:50 +00:00
|
|
|
class ColumnFamilyData {
|
|
|
|
public:
|
2014-02-11 01:04:44 +00:00
|
|
|
~ColumnFamilyData();
|
|
|
|
|
2014-03-11 21:52:17 +00:00
|
|
|
// thread-safe
|
2014-01-29 21:28:50 +00:00
|
|
|
uint32_t GetID() const { return id_; }
|
2014-03-11 21:52:17 +00:00
|
|
|
// thread-safe
|
|
|
|
const std::string& GetName() const { return name_; }
|
2014-01-29 21:28:50 +00:00
|
|
|
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 23:59:07 +00:00
|
|
|
// Ref() can only be called from a context where the caller can guarantee
|
|
|
|
// that ColumnFamilyData is alive (while holding a non-zero ref already,
|
|
|
|
// holding a DB mutex, or as the leader in a write batch group).
|
Bump up memory order of ref counting of ColumnFamilyData (#5723)
Summary:
We see this TSAN warning:
WARNING: ThreadSanitizer: data race (pid=282806)
Write of size 8 at 0x7b6c00000e38 by thread T16 (mutexes: write M1023578822185846136):
#0 operator delete(void*) <null> (libtsan.so.0+0x0000000795f8)
https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2202 (db_flush_test+0x00000060b462)
https://github.com/facebook/rocksdb/issues/2 rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2226 (db_flush_test+0x00000060cbd8)
https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::BGWorkFlush(void*) db/db_impl/db_impl_compaction_flush.cc:2073 (db_flush_test+0x00000060d5ac)
......
Previous atomic write of size 4 at 0x7b6c00000e38 by main thread:
#0 __tsan_atomic32_fetch_sub <null> (libtsan.so.0+0x00000006d721)
https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<int>::fetch_sub(int, std::memory_order) /mnt/gvfs/third-party2/libgcc/c67031f0f739ac61575a061518d6ef5038f99f90/7.x/platform007/5620abc/include/c++/7.3.0/bits/atomic_base.h:524 (db_flush_test+0x0000005f9e38)
https://github.com/facebook/rocksdb/issues/2 rocksdb::ColumnFamilyData::Unref() db/column_family.h:286 (db_flush_test+0x0000005f9e38)
https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&, rocksdb::FlushReason, bool) db/db_impl/db_impl_compaction_flush.cc:1624 (db_flush_test+0x0000005f9e38)
https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::TEST_FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&) db/db_impl/db_impl_debug.cc:127 (db_flush_test+0x00000061ace9)
https://github.com/facebook/rocksdb/issues/5 rocksdb::DBFlushTest_CFDropRaceWithWaitForFlushMemTables_Test::TestBody() db/db_flush_test.cc:320 (db_flush_test+0x0000004b44e5)
https://github.com/facebook/rocksdb/issues/6 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.7.0/fused-src/gtest/gtest-all.cc:3824 (db_flush_test+0x000000be2988)
......
It's still very clear the cause of the warning is because that TSAN treats results from relaxed atomic::fetch_sub() as non-atomic with the operation itself. We can make it more explicit by bumping up the order to CS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5723
Test Plan: Run all existing test.
Differential Revision: D16908250
fbshipit-source-id: bf17d39ed19058372bdf97f6440a743f88153021
2019-08-20 17:33:03 +00:00
|
|
|
void Ref() { refs_.fetch_add(1); }
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 23:59:07 +00:00
|
|
|
|
2019-12-13 03:02:51 +00:00
|
|
|
// UnrefAndTryDelete() decreases the reference count and do free if needed,
|
|
|
|
// return true if this is freed else false, UnrefAndTryDelete() can only
|
|
|
|
// be called while holding a DB mutex, or during single-threaded recovery.
|
Fix a race in ColumnFamilyData::UnrefAndTryDelete (#8605)
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
https://github.com/facebook/rocksdb/issues/8440 ,
https://github.com/facebook/rocksdb/issues/8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.
Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the
unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.
The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605
Test Plan: Ran `make check` and stress tests locally.
Reviewed By: pdillinger
Differential Revision: D30051035
Pulled By: ltamasi
fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
2021-08-03 01:10:57 +00:00
|
|
|
bool UnrefAndTryDelete();
|
2019-12-13 03:02:51 +00:00
|
|
|
|
2015-01-06 20:44:21 +00:00
|
|
|
// SetDropped() can only be called under following conditions:
|
|
|
|
// 1) Holding a DB mutex,
|
|
|
|
// 2) from single-threaded write thread, AND
|
|
|
|
// 3) from single-threaded VersionSet::LogAndApply()
|
2014-03-11 21:52:17 +00:00
|
|
|
// After dropping column family no other operation on that column family
|
|
|
|
// will be executed. All the files and memory will be, however, kept around
|
|
|
|
// until client drops the column family handle. That way, client can still
|
|
|
|
// access data from dropped column family.
|
|
|
|
// Column family can be dropped and still alive. In that state:
|
|
|
|
// *) Compaction and flush is not executed on the dropped column family.
|
2015-01-06 20:44:21 +00:00
|
|
|
// *) Client can continue reading from column family. Writes will fail unless
|
|
|
|
// WriteOptions::ignore_missing_column_families is true
|
2014-03-11 21:52:17 +00:00
|
|
|
// When the dropped column family is unreferenced, then we:
|
2015-03-20 00:04:29 +00:00
|
|
|
// *) Remove column family from the linked list maintained by ColumnFamilySet
|
2014-03-11 21:52:17 +00:00
|
|
|
// *) delete all memory associated with that column family
|
|
|
|
// *) delete all the files associated with that column family
|
2015-01-06 20:44:21 +00:00
|
|
|
void SetDropped();
|
2017-12-11 21:46:57 +00:00
|
|
|
bool IsDropped() const { return dropped_.load(std::memory_order_relaxed); }
|
2014-02-11 01:04:44 +00:00
|
|
|
|
2024-06-13 20:18:10 +00:00
|
|
|
void SetFlushSkipReschedule();
|
|
|
|
|
|
|
|
bool GetAndClearFlushSkipReschedule();
|
|
|
|
|
2014-03-11 21:52:17 +00:00
|
|
|
// thread-safe
|
2014-10-23 22:34:21 +00:00
|
|
|
int NumberLevels() const { return ioptions_.num_levels; }
|
2014-01-31 23:30:27 +00:00
|
|
|
|
2014-01-29 21:28:50 +00:00
|
|
|
void SetLogNumber(uint64_t log_number) { log_number_ = log_number; }
|
|
|
|
uint64_t GetLogNumber() const { return log_number_; }
|
|
|
|
|
2014-11-18 18:20:10 +00:00
|
|
|
// thread-safe
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 22:47:08 +00:00
|
|
|
const FileOptions* soptions() const;
|
2021-05-05 20:59:21 +00:00
|
|
|
const ImmutableOptions* ioptions() const { return &ioptions_; }
|
2014-09-17 19:49:13 +00:00
|
|
|
// REQUIRES: DB mutex held
|
|
|
|
// This returns the MutableCFOptions used by current SuperVersion
|
2016-04-30 00:00:50 +00:00
|
|
|
// You should use this API to reference MutableCFOptions most of the time.
|
2014-11-06 19:14:28 +00:00
|
|
|
const MutableCFOptions* GetCurrentMutableCFOptions() const {
|
2014-09-17 19:49:13 +00:00
|
|
|
return &(super_version_->mutable_cf_options);
|
|
|
|
}
|
|
|
|
// REQUIRES: DB mutex held
|
|
|
|
// This returns the latest MutableCFOptions, which may be not in effect yet.
|
|
|
|
const MutableCFOptions* GetLatestMutableCFOptions() const {
|
|
|
|
return &mutable_cf_options_;
|
|
|
|
}
|
2016-09-15 05:10:28 +00:00
|
|
|
|
|
|
|
// REQUIRES: DB mutex held
|
|
|
|
// Build ColumnFamiliesOptions with immutable options and latest mutable
|
|
|
|
// options.
|
|
|
|
ColumnFamilyOptions GetLatestCFOptions() const;
|
|
|
|
|
2016-11-15 23:18:56 +00:00
|
|
|
bool is_delete_range_supported() { return is_delete_range_supported_; }
|
|
|
|
|
2019-06-04 02:47:02 +00:00
|
|
|
// Validate CF options against DB options
|
|
|
|
static Status ValidateOptions(const DBOptions& db_options,
|
|
|
|
const ColumnFamilyOptions& cf_options);
|
2014-09-17 19:49:13 +00:00
|
|
|
// REQUIRES: DB mutex held
|
2014-11-05 00:23:05 +00:00
|
|
|
Status SetOptions(
|
2019-06-04 02:47:02 +00:00
|
|
|
const DBOptions& db_options,
|
2014-09-17 19:49:13 +00:00
|
|
|
const std::unordered_map<std::string, std::string>& options_map);
|
2014-03-11 21:52:17 +00:00
|
|
|
|
|
|
|
InternalStats* internal_stats() { return internal_stats_.get(); }
|
2014-01-29 21:28:50 +00:00
|
|
|
|
|
|
|
MemTableList* imm() { return &imm_; }
|
|
|
|
MemTable* mem() { return mem_; }
|
Track WAL obsoletion when updating empty CF's log number (#7781)
Summary:
In the write path, there is an optimization: when a new WAL is created during SwitchMemtable, we update the internal log number of the empty column families to the new WAL. `FindObsoleteFiles` marks a WAL as obsolete if the WAL's log number is less than `VersionSet::MinLogNumberWithUnflushedData`. After updating the empty column families' internal log number, `VersionSet::MinLogNumberWithUnflushedData` might change, so some WALs might become obsolete to be purged from disk.
For example, consider there are 3 column families: 0, 1, 2:
1. initially, all the column families' log number is 1;
2. write some data to cf0, and flush cf0, but the flush is pending;
3. now a new WAL 2 is created;
4. write data to cf1 and WAL 2, now cf0's log number is 1, cf1's log number is 2, cf2's log number is 2 (because cf1 and cf2 are empty, so their log numbers will be set to the highest log number);
5. now cf0's flush hasn't finished, flush cf1, a new WAL 3 is created, and cf1's flush finishes, now cf0's log number is 1, cf1's log number is 3, cf2's log number is 3, since WAL 1 still contains data for the unflushed cf0, no WAL can be deleted from disk;
6. now cf0's flush finishes, cf0's log number is 2 (because when cf0 was switching memtable, WAL 3 does not exist yet), cf1's log number is 3, cf2's log number is 3, so WAL 1 can be purged from disk now, but WAL 2 still cannot because `MinLogNumberToKeep()` is 2;
7. write data to cf2 and WAL 3, because cf0 is empty, its log number is updated to 3, so now cf0's log number is 3, cf1's log number is 3, cf2's log number is 3;
8. now if the background threads want to purge obsolete files from disk, WAL 2 can be purged because `MinLogNumberToKeep()` is 3. But there are only two flush results written to MANIFEST: the first is for flushing cf1, and the `MinLogNumberToKeep` is 1, the second is for flushing cf0, and the `MinLogNumberToKeep` is 2. So without this PR, if the DB crashes at this point and try to recover, `WalSet` will still expect WAL 2 to exist.
When WAL tracking is enabled, we assume WALs will only become obsolete after a flush result is written to MANIFEST in `MemtableList::TryInstallMemtableFlushResults` (or its atomic flush counterpart). The above situation breaks this assumption.
This PR tracks WAL obsoletion if necessary before updating the empty column families' log numbers.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7781
Test Plan:
watch existing tests and stress tests to pass.
`make -j48 blackbox_crash_test` on devserver
Reviewed By: ltamasi
Differential Revision: D25631695
Pulled By: cheng-chang
fbshipit-source-id: ca7fff967bdb42204b84226063d909893bc0a4ec
2020-12-19 05:33:20 +00:00
|
|
|
|
|
|
|
bool IsEmpty() {
|
|
|
|
return mem()->GetFirstSequenceNumber() == 0 && imm()->NumNotFlushed() == 0;
|
|
|
|
}
|
|
|
|
|
2014-01-29 21:28:50 +00:00
|
|
|
Version* current() { return current_; }
|
|
|
|
Version* dummy_versions() { return dummy_versions_; }
|
2015-12-28 17:50:49 +00:00
|
|
|
void SetCurrent(Version* _current);
|
2022-11-02 21:34:24 +00:00
|
|
|
uint64_t GetNumLiveVersions() const; // REQUIRE: DB mutex held
|
2015-08-20 18:47:19 +00:00
|
|
|
uint64_t GetTotalSstFilesSize() const; // REQUIRE: DB mutex held
|
2018-03-02 01:50:54 +00:00
|
|
|
uint64_t GetLiveSstFilesSize() const; // REQUIRE: DB mutex held
|
2021-09-08 19:19:01 +00:00
|
|
|
uint64_t GetTotalBlobFileSize() const; // REQUIRE: DB mutex held
|
2018-01-19 01:32:50 +00:00
|
|
|
void SetMemtable(MemTable* new_mem) {
|
|
|
|
uint64_t memtable_id = last_memtable_id_.fetch_add(1) + 1;
|
|
|
|
new_mem->SetID(memtable_id);
|
|
|
|
mem_ = new_mem;
|
|
|
|
}
|
2015-05-29 21:36:35 +00:00
|
|
|
|
2017-01-19 23:21:07 +00:00
|
|
|
// calculate the oldest log needed for the durability of this column family
|
|
|
|
uint64_t OldestLogToKeep();
|
|
|
|
|
2015-05-29 21:36:35 +00:00
|
|
|
// See Memtable constructor for explanation of earliest_seq param.
|
|
|
|
MemTable* ConstructNewMemtable(const MutableCFOptions& mutable_cf_options,
|
2021-07-23 01:26:47 +00:00
|
|
|
SequenceNumber earliest_seq);
|
2015-05-29 21:36:35 +00:00
|
|
|
void CreateNewMemtable(const MutableCFOptions& mutable_cf_options,
|
2021-07-23 01:26:47 +00:00
|
|
|
SequenceNumber earliest_seq);
|
2014-01-29 21:28:50 +00:00
|
|
|
|
2014-05-30 21:31:55 +00:00
|
|
|
TableCache* table_cache() const { return table_cache_.get(); }
|
2022-06-21 03:58:11 +00:00
|
|
|
BlobSource* blob_source() const { return blob_source_.get(); }
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 17:07:55 +00:00
|
|
|
|
2014-01-31 23:30:27 +00:00
|
|
|
// See documentation in compaction_picker.h
|
2014-10-01 23:19:16 +00:00
|
|
|
// REQUIRES: DB mutex held
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 19:38:12 +00:00
|
|
|
bool NeedsCompaction() const;
|
|
|
|
// REQUIRES: DB mutex held
|
2014-10-01 23:19:16 +00:00
|
|
|
Compaction* PickCompaction(const MutableCFOptions& mutable_options,
|
2020-07-23 01:31:25 +00:00
|
|
|
const MutableDBOptions& mutable_db_options,
|
2014-10-01 23:19:16 +00:00
|
|
|
LogBuffer* log_buffer);
|
2016-10-13 17:49:06 +00:00
|
|
|
|
|
|
|
// Check if the passed range overlap with any running compactions.
|
|
|
|
// REQUIRES: DB mutex held
|
|
|
|
bool RangeOverlapWithCompaction(const Slice& smallest_user_key,
|
|
|
|
const Slice& largest_user_key,
|
|
|
|
int level) const;
|
|
|
|
|
2018-02-28 01:08:34 +00:00
|
|
|
// Check if the passed ranges overlap with any unflushed memtables
|
|
|
|
// (immutable or mutable).
|
|
|
|
//
|
|
|
|
// @param super_version A referenced SuperVersion that will be held for the
|
|
|
|
// duration of this function.
|
|
|
|
//
|
|
|
|
// Thread-safe
|
2024-02-07 02:35:36 +00:00
|
|
|
Status RangesOverlapWithMemtables(const autovector<UserKeyRange>& ranges,
|
2020-10-28 17:11:13 +00:00
|
|
|
SuperVersion* super_version,
|
|
|
|
bool allow_data_in_errors, bool* overlap);
|
2018-02-28 01:08:34 +00:00
|
|
|
|
2015-04-22 23:55:22 +00:00
|
|
|
// A flag to tell a manual compaction is to compact all levels together
|
2017-10-18 19:18:43 +00:00
|
|
|
// instead of a specific level.
|
2015-04-22 23:55:22 +00:00
|
|
|
static const int kCompactAllLevels;
|
2015-04-15 04:45:20 +00:00
|
|
|
// A flag to tell a manual compaction's output is base level.
|
|
|
|
static const int kCompactToBaseLevel;
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 19:38:12 +00:00
|
|
|
// REQUIRES: DB mutex held
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 19:20:34 +00:00
|
|
|
Compaction* CompactRange(const MutableCFOptions& mutable_cf_options,
|
2020-07-23 01:31:25 +00:00
|
|
|
const MutableDBOptions& mutable_db_options,
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 19:20:34 +00:00
|
|
|
int input_level, int output_level,
|
2019-04-17 06:29:32 +00:00
|
|
|
const CompactRangeOptions& compact_range_options,
|
2018-04-27 18:48:21 +00:00
|
|
|
const InternalKey* begin, const InternalKey* end,
|
2019-04-17 06:29:32 +00:00
|
|
|
InternalKey** compaction_end, bool* manual_conflict,
|
2022-03-12 00:13:23 +00:00
|
|
|
uint64_t max_file_num_to_ignore,
|
|
|
|
const std::string& trim_ts);
|
2014-01-31 23:30:27 +00:00
|
|
|
|
2014-03-11 21:52:17 +00:00
|
|
|
CompactionPicker* compaction_picker() { return compaction_picker_.get(); }
|
|
|
|
// thread-safe
|
2014-02-05 00:31:18 +00:00
|
|
|
const Comparator* user_comparator() const {
|
|
|
|
return internal_comparator_.user_comparator();
|
|
|
|
}
|
2014-03-11 21:52:17 +00:00
|
|
|
// thread-safe
|
2014-02-05 00:31:18 +00:00
|
|
|
const InternalKeyComparator& internal_comparator() const {
|
|
|
|
return internal_comparator_;
|
|
|
|
}
|
2014-01-31 23:30:27 +00:00
|
|
|
|
2024-02-02 22:14:43 +00:00
|
|
|
const InternalTblPropCollFactories* internal_tbl_prop_coll_factories() const {
|
|
|
|
return &internal_tbl_prop_coll_factories_;
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 17:04:30 +00:00
|
|
|
}
|
|
|
|
|
2014-03-11 21:52:17 +00:00
|
|
|
SuperVersion* GetSuperVersion() { return super_version_; }
|
|
|
|
// thread-safe
|
2014-04-14 16:34:59 +00:00
|
|
|
// Return a already referenced SuperVersion to be used safely.
|
2019-12-17 21:20:42 +00:00
|
|
|
SuperVersion* GetReferencedSuperVersion(DBImpl* db);
|
2014-04-14 16:34:59 +00:00
|
|
|
// thread-safe
|
|
|
|
// Get SuperVersion stored in thread local storage. If it does not exist,
|
|
|
|
// get a reference from a current SuperVersion.
|
2019-12-17 21:20:42 +00:00
|
|
|
SuperVersion* GetThreadLocalSuperVersion(DBImpl* db);
|
2021-03-26 04:17:17 +00:00
|
|
|
// Try to return SuperVersion back to thread local storage. Return true on
|
2014-04-14 16:34:59 +00:00
|
|
|
// success and false on failure. It fails when the thread local storage
|
|
|
|
// contains anything other than SuperVersion::kSVInUse flag.
|
|
|
|
bool ReturnThreadLocalSuperVersion(SuperVersion* sv);
|
2014-03-11 21:52:17 +00:00
|
|
|
// thread-safe
|
2014-01-29 21:28:50 +00:00
|
|
|
uint64_t GetSuperVersionNumber() const {
|
|
|
|
return super_version_number_.load();
|
|
|
|
}
|
|
|
|
// will return a pointer to SuperVersion* if previous SuperVersion
|
|
|
|
// if its reference count is zero and needs deletion or nullptr if not
|
|
|
|
// As argument takes a pointer to allocated SuperVersion to enable
|
|
|
|
// the clients to allocate SuperVersion outside of mutex.
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 19:38:12 +00:00
|
|
|
// IMPORTANT: Only call this from DBImpl::InstallSuperVersion()
|
2017-10-06 01:00:38 +00:00
|
|
|
void InstallSuperVersion(SuperVersionContext* sv_context,
|
Make mempurge a background process (equivalent to in-memory compaction). (#8505)
Summary:
In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype.
In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).
Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.
MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`.
The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505
Reviewed By: pdillinger
Differential Revision: D29619283
Pulled By: bjlemaire
fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95
2021-07-10 00:16:00 +00:00
|
|
|
const MutableCFOptions& mutable_cf_options);
|
2017-10-06 01:00:38 +00:00
|
|
|
void InstallSuperVersion(SuperVersionContext* sv_context,
|
|
|
|
InstrumentedMutex* db_mutex);
|
2014-03-04 01:54:04 +00:00
|
|
|
|
|
|
|
void ResetThreadLocalSuperVersions();
|
2014-01-29 21:28:50 +00:00
|
|
|
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 19:38:12 +00:00
|
|
|
// Protected by DB mutex
|
2018-04-27 04:09:53 +00:00
|
|
|
void set_queued_for_flush(bool value) { queued_for_flush_ = value; }
|
2018-04-27 18:11:12 +00:00
|
|
|
void set_queued_for_compaction(bool value) { queued_for_compaction_ = value; }
|
2018-04-27 04:09:53 +00:00
|
|
|
bool queued_for_flush() { return queued_for_flush_; }
|
2018-04-27 18:11:12 +00:00
|
|
|
bool queued_for_compaction() { return queued_for_compaction_; }
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 19:38:12 +00:00
|
|
|
|
2018-02-12 23:34:39 +00:00
|
|
|
static std::pair<WriteStallCondition, WriteStallCause>
|
2021-03-09 10:19:28 +00:00
|
|
|
GetWriteStallConditionAndCause(
|
|
|
|
int num_unflushed_memtables, int num_l0_files,
|
|
|
|
uint64_t num_compaction_needed_bytes,
|
|
|
|
const MutableCFOptions& mutable_cf_options,
|
|
|
|
const ImmutableCFOptions& immutable_cf_options);
|
2018-02-12 23:34:39 +00:00
|
|
|
|
2022-06-22 22:45:21 +00:00
|
|
|
// Recalculate some stall conditions, which are changed only during
|
|
|
|
// compaction, adding new memtable and/or recalculation of compaction score.
|
2017-10-06 01:00:38 +00:00
|
|
|
WriteStallCondition RecalculateWriteStallConditions(
|
When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
2015-12-18 01:07:44 +00:00
|
|
|
const MutableCFOptions& mutable_cf_options);
|
|
|
|
|
2017-06-22 22:45:42 +00:00
|
|
|
void set_initialized() { initialized_.store(true); }
|
|
|
|
|
|
|
|
bool initialized() const { return initialized_.load(); }
|
|
|
|
|
2018-03-21 00:07:28 +00:00
|
|
|
const ColumnFamilyOptions& initial_cf_options() {
|
|
|
|
return initial_cf_options_;
|
|
|
|
}
|
|
|
|
|
2020-02-03 22:11:50 +00:00
|
|
|
// created_dirs remembers directory created, so that we don't need to call
|
|
|
|
// the same data creation operation again.
|
|
|
|
Status AddDirectories(
|
2020-03-03 00:14:00 +00:00
|
|
|
std::map<std::string, std::shared_ptr<FSDirectory>>* created_dirs);
|
2018-04-06 02:49:06 +00:00
|
|
|
|
2020-03-03 00:14:00 +00:00
|
|
|
FSDirectory* GetDataDir(size_t path_id) const;
|
2018-04-06 02:49:06 +00:00
|
|
|
|
2020-12-05 22:17:11 +00:00
|
|
|
// full_history_ts_low_ can only increase.
|
|
|
|
void SetFullHistoryTsLow(std::string ts_low) {
|
|
|
|
assert(!ts_low.empty());
|
|
|
|
const Comparator* ucmp = user_comparator();
|
|
|
|
assert(ucmp);
|
|
|
|
if (full_history_ts_low_.empty() ||
|
|
|
|
ucmp->CompareTimestamp(ts_low, full_history_ts_low_) > 0) {
|
|
|
|
full_history_ts_low_ = std::move(ts_low);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
const std::string& GetFullHistoryTsLow() const {
|
|
|
|
return full_history_ts_low_;
|
|
|
|
}
|
|
|
|
|
2023-07-26 23:25:06 +00:00
|
|
|
// REQUIRES: DB mutex held.
|
|
|
|
// Return true if flushing up to MemTables with ID `max_memtable_id`
|
|
|
|
// should be postponed to retain user-defined timestamps according to the
|
|
|
|
// user's setting. Called by background flush job.
|
|
|
|
bool ShouldPostponeFlushToRetainUDT(uint64_t max_memtable_id);
|
|
|
|
|
2019-01-02 19:40:12 +00:00
|
|
|
ThreadLocalPtr* TEST_GetLocalSV() { return local_sv_.get(); }
|
Make mempurge a background process (equivalent to in-memory compaction). (#8505)
Summary:
In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype.
In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).
Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.
MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`.
The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505
Reviewed By: pdillinger
Differential Revision: D29619283
Pulled By: bjlemaire
fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95
2021-07-10 00:16:00 +00:00
|
|
|
WriteBufferManager* write_buffer_mgr() { return write_buffer_manager_; }
|
Account memory of FileMetaData in global memory limit (#9924)
Summary:
**Context/Summary:**
As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924
Test Plan:
- Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released
- New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)`
- db bench (CPU cost of `charge_file_metadata` in write and compact)
- **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
- **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'`
table 1 - write
#-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078
80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734**
160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978**
table 2 - compact
#-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67
20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96
40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96**
80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78**
- stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal
Reviewed By: ajkr
Differential Revision: D36055583
Pulled By: hx235
fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2022-06-14 20:06:40 +00:00
|
|
|
std::shared_ptr<CacheReservationManager>
|
|
|
|
GetFileMetadataCacheReservationManager() {
|
|
|
|
return file_metadata_cache_res_mgr_;
|
|
|
|
}
|
2019-01-02 19:40:12 +00:00
|
|
|
|
2022-03-24 20:05:17 +00:00
|
|
|
static const uint32_t kDummyColumnFamilyDataId;
|
|
|
|
|
2022-06-23 16:42:18 +00:00
|
|
|
// Keep track of whether the mempurge feature was ever used.
|
|
|
|
void SetMempurgeUsed() { mempurge_used_ = true; }
|
|
|
|
bool GetMempurgeUsed() { return mempurge_used_; }
|
|
|
|
|
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: ajkr
Differential Revision: D41063187
Pulled By: hx235
fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
|
|
|
// Allocate and return a new epoch number
|
|
|
|
uint64_t NewEpochNumber() { return next_epoch_number_.fetch_add(1); }
|
|
|
|
|
|
|
|
// Get the next epoch number to be assigned
|
|
|
|
uint64_t GetNextEpochNumber() const { return next_epoch_number_.load(); }
|
|
|
|
|
|
|
|
// Set the next epoch number to be assigned
|
|
|
|
void SetNextEpochNumber(uint64_t next_epoch_number) {
|
|
|
|
next_epoch_number_.store(next_epoch_number);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Reset the next epoch number to be assigned
|
|
|
|
void ResetNextEpochNumber() { next_epoch_number_.store(1); }
|
|
|
|
|
|
|
|
// Recover the next epoch number of this CF and epoch number
|
|
|
|
// of its files (if missing)
|
|
|
|
void RecoverEpochNumbers();
|
|
|
|
|
2024-06-14 20:37:37 +00:00
|
|
|
int GetUnflushedMemTableCountForWriteStallCheck() const {
|
|
|
|
return (mem_->IsEmpty() ? 0 : 1) + imm_.NumNotFlushed();
|
|
|
|
}
|
|
|
|
|
2014-01-29 21:28:50 +00:00
|
|
|
private:
|
2014-01-31 00:49:46 +00:00
|
|
|
friend class ColumnFamilySet;
|
2014-07-30 20:53:08 +00:00
|
|
|
ColumnFamilyData(uint32_t id, const std::string& name,
|
|
|
|
Version* dummy_versions, Cache* table_cache,
|
2016-06-21 01:01:03 +00:00
|
|
|
WriteBufferManager* write_buffer_manager,
|
2014-07-30 20:53:08 +00:00
|
|
|
const ColumnFamilyOptions& options,
|
2016-09-23 23:34:04 +00:00
|
|
|
const ImmutableDBOptions& db_options,
|
Fix use-after-free on implicit temporary FileOptions (#8571)
Summary:
FileOptions has an implicit conversion from EnvOptions and some
internal APIs take `const FileOptions&` and save the reference, which is
counter to Google C++ guidelines,
> Avoid defining functions that require a const reference parameter to outlive the call, because const reference parameters bind to temporaries. Instead, find a way to eliminate the lifetime requirement (for example, by copying the parameter), or pass it by const pointer and document the lifetime and non-null requirements.
This is at least a problem for repair.cc, which passes an EnvOptions to
TableCache(), which would save a reference to the temporary copy as
FileOptions. This was unfortunately only caught as a side effect of
changes in https://github.com/facebook/rocksdb/issues/8544.
This change fixes the repair.cc case and updates the involved internal
APIs that save a reference to use `const FileOptions*` instead.
Unfortunately, I don't know how to get any of our sanitizers to reliably
report bugs like this, so I can't rule out more existing in our
codebase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8571
Test Plan:
Test that issues seen with https://github.com/facebook/rocksdb/issues/8544 are fixed (can reproduce on
AWS EC2)
Reviewed By: ajkr
Differential Revision: D29943890
Pulled By: pdillinger
fbshipit-source-id: 95f9c5251548777b4dc994c1a083dd2add5799c9
2021-07-28 04:48:22 +00:00
|
|
|
const FileOptions* file_options,
|
2019-06-13 22:39:52 +00:00
|
|
|
ColumnFamilySet* column_family_set,
|
2020-08-27 18:20:08 +00:00
|
|
|
BlockCacheTracer* const block_cache_tracer,
|
2021-06-10 18:01:44 +00:00
|
|
|
const std::shared_ptr<IOTracer>& io_tracer,
|
2022-06-21 03:58:11 +00:00
|
|
|
const std::string& db_id, const std::string& db_session_id);
|
2014-01-31 00:49:46 +00:00
|
|
|
|
2020-03-12 01:36:43 +00:00
|
|
|
std::vector<std::string> GetDbPaths() const;
|
|
|
|
|
2014-01-29 21:28:50 +00:00
|
|
|
uint32_t id_;
|
|
|
|
const std::string name_;
|
|
|
|
Version* dummy_versions_; // Head of circular doubly-linked list of versions.
|
|
|
|
Version* current_; // == dummy_versions->prev_
|
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
std::atomic<int> refs_; // outstanding references to ColumnFamilyData
|
2017-06-22 22:45:42 +00:00
|
|
|
std::atomic<bool> initialized_;
|
2017-12-11 21:46:57 +00:00
|
|
|
std::atomic<bool> dropped_; // true if client dropped it
|
2014-02-11 01:04:44 +00:00
|
|
|
|
2024-06-13 20:18:10 +00:00
|
|
|
// When user-defined timestamps in memtable only feature is enabled, this
|
|
|
|
// flag indicates a successfully requested flush that should
|
|
|
|
// skip being rescheduled and haven't undergone the rescheduling check yet.
|
|
|
|
// This flag is cleared when a check skips rescheduling a FlushRequest.
|
|
|
|
// With this flag, automatic flushes in regular cases can continue to
|
|
|
|
// retain UDTs by getting rescheduled as usual while manual flushes and
|
|
|
|
// error recovery flushes will proceed without getting rescheduled.
|
|
|
|
std::atomic<bool> flush_skip_reschedule_;
|
|
|
|
|
2014-02-05 00:31:18 +00:00
|
|
|
const InternalKeyComparator internal_comparator_;
|
2024-02-02 22:14:43 +00:00
|
|
|
InternalTblPropCollFactories internal_tbl_prop_coll_factories_;
|
2014-02-05 00:31:18 +00:00
|
|
|
|
2016-09-23 23:34:04 +00:00
|
|
|
const ColumnFamilyOptions initial_cf_options_;
|
2021-05-05 20:59:21 +00:00
|
|
|
const ImmutableOptions ioptions_;
|
2014-09-17 19:49:13 +00:00
|
|
|
MutableCFOptions mutable_cf_options_;
|
2014-01-31 23:30:27 +00:00
|
|
|
|
2016-11-15 23:18:56 +00:00
|
|
|
const bool is_delete_range_supported_;
|
|
|
|
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 17:07:55 +00:00
|
|
|
std::unique_ptr<TableCache> table_cache_;
|
2020-10-15 20:02:44 +00:00
|
|
|
std::unique_ptr<BlobFileCache> blob_file_cache_;
|
2022-06-21 03:58:11 +00:00
|
|
|
std::unique_ptr<BlobSource> blob_source_;
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 17:07:55 +00:00
|
|
|
|
2014-02-05 01:45:19 +00:00
|
|
|
std::unique_ptr<InternalStats> internal_stats_;
|
|
|
|
|
2016-06-21 01:01:03 +00:00
|
|
|
WriteBufferManager* write_buffer_manager_;
|
2014-12-02 20:09:20 +00:00
|
|
|
|
2014-01-29 21:28:50 +00:00
|
|
|
MemTable* mem_;
|
|
|
|
MemTableList imm_;
|
|
|
|
SuperVersion* super_version_;
|
|
|
|
|
|
|
|
// An ordinal representing the current SuperVersion. Updated by
|
|
|
|
// InstallSuperVersion(), i.e. incremented every time super_version_
|
|
|
|
// changes.
|
|
|
|
std::atomic<uint64_t> super_version_number_;
|
|
|
|
|
2014-03-04 01:54:04 +00:00
|
|
|
// Thread's local copy of SuperVersion pointer
|
|
|
|
// This needs to be destructed before mutex_
|
2014-03-04 17:03:56 +00:00
|
|
|
std::unique_ptr<ThreadLocalPtr> local_sv_;
|
2014-03-04 01:54:04 +00:00
|
|
|
|
2015-03-20 00:04:29 +00:00
|
|
|
// pointers for a circular linked list. we use it to support iterations over
|
|
|
|
// all column families that are alive (note: dropped column families can also
|
|
|
|
// be alive as long as client holds a reference)
|
2014-02-11 01:04:44 +00:00
|
|
|
ColumnFamilyData* next_;
|
|
|
|
ColumnFamilyData* prev_;
|
2014-01-31 00:49:46 +00:00
|
|
|
|
2014-01-29 21:28:50 +00:00
|
|
|
// This is the earliest log file number that contains data from this
|
|
|
|
// Column Family. All earlier log files must be ignored and not
|
|
|
|
// recovered from
|
|
|
|
uint64_t log_number_;
|
2014-01-30 23:23:13 +00:00
|
|
|
|
2014-01-31 23:30:27 +00:00
|
|
|
// An object that keeps all the compaction stats
|
|
|
|
// and picks the next compaction
|
|
|
|
std::unique_ptr<CompactionPicker> compaction_picker_;
|
2014-02-11 01:04:44 +00:00
|
|
|
|
|
|
|
ColumnFamilySet* column_family_set_;
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 18:20:25 +00:00
|
|
|
|
|
|
|
std::unique_ptr<WriteControllerToken> write_controller_token_;
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 19:38:12 +00:00
|
|
|
|
|
|
|
// If true --> this ColumnFamily is currently present in DBImpl::flush_queue_
|
2018-04-27 04:09:53 +00:00
|
|
|
bool queued_for_flush_;
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 19:38:12 +00:00
|
|
|
|
|
|
|
// If true --> this ColumnFamily is currently present in
|
|
|
|
// DBImpl::compaction_queue_
|
2018-04-27 18:11:12 +00:00
|
|
|
bool queued_for_compaction_;
|
When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
2015-12-18 01:07:44 +00:00
|
|
|
|
|
|
|
uint64_t prev_compaction_needed_bytes_;
|
2017-01-19 23:21:07 +00:00
|
|
|
|
|
|
|
// if the database was opened with 2pc enabled
|
|
|
|
bool allow_2pc_;
|
2018-01-19 01:32:50 +00:00
|
|
|
|
|
|
|
// Memtable id to track flush.
|
|
|
|
std::atomic<uint64_t> last_memtable_id_;
|
2018-04-06 02:49:06 +00:00
|
|
|
|
|
|
|
// Directories corresponding to cf_paths.
|
2020-03-03 00:14:00 +00:00
|
|
|
std::vector<std::shared_ptr<FSDirectory>> data_dirs_;
|
2020-03-12 01:36:43 +00:00
|
|
|
|
|
|
|
bool db_paths_registered_;
|
2020-12-05 22:17:11 +00:00
|
|
|
|
|
|
|
std::string full_history_ts_low_;
|
Account memory of FileMetaData in global memory limit (#9924)
Summary:
**Context/Summary:**
As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924
Test Plan:
- Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released
- New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)`
- db bench (CPU cost of `charge_file_metadata` in write and compact)
- **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
- **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'`
table 1 - write
#-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078
80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734**
160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978**
table 2 - compact
#-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67
20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96
40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96**
80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78**
- stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal
Reviewed By: ajkr
Differential Revision: D36055583
Pulled By: hx235
fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2022-06-14 20:06:40 +00:00
|
|
|
|
|
|
|
// For charging memory usage of file metadata created for newly added files to
|
|
|
|
// a Version associated with this CFD
|
|
|
|
std::shared_ptr<CacheReservationManager> file_metadata_cache_res_mgr_;
|
2022-06-23 16:42:18 +00:00
|
|
|
bool mempurge_used_;
|
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: ajkr
Differential Revision: D41063187
Pulled By: hx235
fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
|
|
|
|
|
|
|
std::atomic<uint64_t> next_epoch_number_;
|
2014-01-22 19:44:53 +00:00
|
|
|
};
|
|
|
|
|
2014-03-11 21:52:17 +00:00
|
|
|
// ColumnFamilySet has interesting thread-safety requirements
|
2015-01-06 20:44:21 +00:00
|
|
|
// * CreateColumnFamily() or RemoveColumnFamily() -- need to be protected by DB
|
|
|
|
// mutex AND executed in the write thread.
|
|
|
|
// CreateColumnFamily() should ONLY be called from VersionSet::LogAndApply() AND
|
|
|
|
// single-threaded write thread. It is also called during Recovery and in
|
|
|
|
// DumpManifest().
|
|
|
|
// RemoveColumnFamily() is only called from SetDropped(). DB mutex needs to be
|
|
|
|
// held and it needs to be executed from the write thread. SetDropped() also
|
|
|
|
// guarantees that it will be called only from single-threaded LogAndApply(),
|
|
|
|
// but this condition is not that important.
|
2022-03-24 20:05:17 +00:00
|
|
|
// * Iteration -- hold DB mutex. If you want to release the DB mutex in the
|
|
|
|
// body of the iteration, wrap in a RefedColumnFamilySet.
|
2014-03-11 21:52:17 +00:00
|
|
|
// * GetDefault() -- thread safe
|
2015-01-06 20:44:21 +00:00
|
|
|
// * GetColumnFamily() -- either inside of DB mutex or from a write thread
|
2014-06-02 22:33:54 +00:00
|
|
|
// * GetNextColumnFamilyID(), GetMaxColumnFamily(), UpdateMaxColumnFamily(),
|
|
|
|
// NumberOfColumnFamilies -- inside of DB mutex
|
2014-01-22 19:44:53 +00:00
|
|
|
class ColumnFamilySet {
|
|
|
|
public:
|
2014-03-11 21:52:17 +00:00
|
|
|
// ColumnFamilySet supports iteration
|
2014-01-24 22:30:28 +00:00
|
|
|
class iterator {
|
|
|
|
public:
|
2022-11-02 21:34:24 +00:00
|
|
|
explicit iterator(ColumnFamilyData* cfd) : current_(cfd) {}
|
2022-03-24 20:05:17 +00:00
|
|
|
// NOTE: minimum operators for for-loop iteration
|
2014-01-24 22:30:28 +00:00
|
|
|
iterator& operator++() {
|
2022-03-24 20:05:17 +00:00
|
|
|
current_ = current_->next_;
|
2014-01-24 22:30:28 +00:00
|
|
|
return *this;
|
|
|
|
}
|
2022-03-24 20:05:17 +00:00
|
|
|
bool operator!=(const iterator& other) const {
|
2014-01-31 00:49:46 +00:00
|
|
|
return this->current_ != other.current_;
|
|
|
|
}
|
|
|
|
ColumnFamilyData* operator*() { return current_; }
|
2014-01-24 22:30:28 +00:00
|
|
|
|
|
|
|
private:
|
2014-01-31 00:49:46 +00:00
|
|
|
ColumnFamilyData* current_;
|
2014-01-24 22:30:28 +00:00
|
|
|
};
|
2014-01-22 19:44:53 +00:00
|
|
|
|
2016-09-23 23:34:04 +00:00
|
|
|
ColumnFamilySet(const std::string& dbname,
|
|
|
|
const ImmutableDBOptions* db_options,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 22:47:08 +00:00
|
|
|
const FileOptions& file_options, Cache* table_cache,
|
2020-03-21 02:17:54 +00:00
|
|
|
WriteBufferManager* _write_buffer_manager,
|
|
|
|
WriteController* _write_controller,
|
2020-08-27 18:20:08 +00:00
|
|
|
BlockCacheTracer* const block_cache_tracer,
|
2021-06-10 18:01:44 +00:00
|
|
|
const std::shared_ptr<IOTracer>& io_tracer,
|
2022-06-21 03:58:11 +00:00
|
|
|
const std::string& db_id, const std::string& db_session_id);
|
2014-01-22 19:44:53 +00:00
|
|
|
~ColumnFamilySet();
|
|
|
|
|
|
|
|
ColumnFamilyData* GetDefault() const;
|
|
|
|
// GetColumnFamily() calls return nullptr if column family is not found
|
|
|
|
ColumnFamilyData* GetColumnFamily(uint32_t id) const;
|
2014-02-28 22:05:11 +00:00
|
|
|
ColumnFamilyData* GetColumnFamily(const std::string& name) const;
|
2014-01-22 19:44:53 +00:00
|
|
|
// this call will return the next available column family ID. it guarantees
|
|
|
|
// that there is no column family with id greater than or equal to the
|
2014-03-05 20:13:44 +00:00
|
|
|
// returned value in the current running instance or anytime in RocksDB
|
|
|
|
// instance history.
|
2014-01-22 19:44:53 +00:00
|
|
|
uint32_t GetNextColumnFamilyID();
|
2014-03-05 20:13:44 +00:00
|
|
|
uint32_t GetMaxColumnFamily();
|
|
|
|
void UpdateMaxColumnFamily(uint32_t new_max_column_family);
|
2014-06-02 22:33:54 +00:00
|
|
|
size_t NumberOfColumnFamilies() const;
|
2014-01-22 19:44:53 +00:00
|
|
|
|
|
|
|
ColumnFamilyData* CreateColumnFamily(const std::string& name, uint32_t id,
|
|
|
|
Version* dummy_version,
|
|
|
|
const ColumnFamilyOptions& options);
|
|
|
|
|
2023-06-05 20:36:26 +00:00
|
|
|
const UnorderedMap<uint32_t, size_t>& GetRunningColumnFamiliesTimestampSize()
|
|
|
|
const {
|
2023-05-31 02:32:00 +00:00
|
|
|
return running_ts_sz_;
|
|
|
|
}
|
|
|
|
|
2023-06-05 20:36:26 +00:00
|
|
|
const UnorderedMap<uint32_t, size_t>&
|
2023-05-31 02:32:00 +00:00
|
|
|
GetColumnFamiliesTimestampSizeForRecord() const {
|
|
|
|
return ts_sz_for_record_;
|
|
|
|
}
|
|
|
|
|
2014-03-11 21:52:17 +00:00
|
|
|
iterator begin() { return iterator(dummy_cfd_->next_); }
|
2014-01-31 00:49:46 +00:00
|
|
|
iterator end() { return iterator(dummy_cfd_); }
|
2014-01-22 19:44:53 +00:00
|
|
|
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
Cache* get_table_cache() { return table_cache_; }
|
|
|
|
|
2020-03-21 02:17:54 +00:00
|
|
|
WriteBufferManager* write_buffer_manager() { return write_buffer_manager_; }
|
|
|
|
|
|
|
|
WriteController* write_controller() { return write_controller_; }
|
|
|
|
|
2014-01-22 19:44:53 +00:00
|
|
|
private:
|
2014-03-11 21:52:17 +00:00
|
|
|
friend class ColumnFamilyData;
|
|
|
|
// helper function that gets called from cfd destructor
|
|
|
|
// REQUIRES: DB mutex held
|
|
|
|
void RemoveColumnFamily(ColumnFamilyData* cfd);
|
|
|
|
|
|
|
|
// column_families_ and column_family_data_ need to be protected:
|
2015-01-06 20:44:21 +00:00
|
|
|
// * when mutating both conditions have to be satisfied:
|
|
|
|
// 1. DB mutex locked
|
|
|
|
// 2. thread currently in single-threaded write thread
|
|
|
|
// * when reading, at least one condition needs to be satisfied:
|
|
|
|
// 1. DB mutex locked
|
|
|
|
// 2. accessed from a single-threaded write thread
|
Meta-internal folly integration with F14FastMap (#9546)
Summary:
Especially after updating to C++17, I don't see a compelling case for
*requiring* any folly components in RocksDB. I was able to purge the existing
hard dependencies, and it can be quite difficult to strip out non-trivial components
from folly for use in RocksDB. (The prospect of doing that on F14 has changed
my mind on the best approach here.)
But this change creates an optional integration where we can plug in
components from folly at compile time, starting here with F14FastMap to replace
std::unordered_map when possible (probably no public APIs for example). I have
replaced the biggest CPU users of std::unordered_map with compile-time
pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
that is in the Makefile for public CI testing. A full folly build is not needed, but
checking out the full folly repo is much simpler for getting the dependency,
and anything else we might want to optionally integrate in the future.
Some picky details:
* I don't think the distributed mutex stuff is actually used, so it was easy to remove.
* I implemented an alternative to `folly::constexpr_log2` (which is much easier
in C++17 than C++11) so that I could pull out the hard dependencies on
`ConstexprMath.h`
* I had to add noexcept move constructors/operators to some types to make
F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
macro to make that easier in some common cases.
* Updated Meta-internal buck build to use folly F14Map (always)
No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
production integration for open source users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546
Test Plan:
CircleCI tests updated so that a couple of them use folly.
Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
(Note: they should probably use buck but they currently use Makefile.)
Example performance improvement: when filter partitions are pinned in cache,
they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
a test that exercises that heavily. Build DB with
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
```
and test with (simultaneous runs with & without folly, ~20 times each to see
convergence)
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
```
Average ops/s no folly: 26229.2
Average ops/s with folly: 26853.3 (+2.4%)
Reviewed By: ajkr
Differential Revision: D34181736
Pulled By: pdillinger
fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
2022-04-13 14:34:01 +00:00
|
|
|
UnorderedMap<std::string, uint32_t> column_families_;
|
|
|
|
UnorderedMap<uint32_t, ColumnFamilyData*> column_family_data_;
|
2014-03-11 21:52:17 +00:00
|
|
|
|
2023-05-31 02:32:00 +00:00
|
|
|
// Mutating / reading `running_ts_sz_` and `ts_sz_for_record_` follow
|
|
|
|
// the same requirements as `column_families_` and `column_family_data_`.
|
|
|
|
// Mapping from column family id to user-defined timestamp size for all
|
|
|
|
// running column families.
|
2023-06-05 20:36:26 +00:00
|
|
|
UnorderedMap<uint32_t, size_t> running_ts_sz_;
|
2023-05-31 02:32:00 +00:00
|
|
|
// Mapping from column family id to user-defined timestamp size for
|
|
|
|
// column families with non-zero user-defined timestamp size.
|
2023-06-05 20:36:26 +00:00
|
|
|
UnorderedMap<uint32_t, size_t> ts_sz_for_record_;
|
2023-05-31 02:32:00 +00:00
|
|
|
|
2014-01-22 19:44:53 +00:00
|
|
|
uint32_t max_column_family_;
|
Fix use-after-free on implicit temporary FileOptions (#8571)
Summary:
FileOptions has an implicit conversion from EnvOptions and some
internal APIs take `const FileOptions&` and save the reference, which is
counter to Google C++ guidelines,
> Avoid defining functions that require a const reference parameter to outlive the call, because const reference parameters bind to temporaries. Instead, find a way to eliminate the lifetime requirement (for example, by copying the parameter), or pass it by const pointer and document the lifetime and non-null requirements.
This is at least a problem for repair.cc, which passes an EnvOptions to
TableCache(), which would save a reference to the temporary copy as
FileOptions. This was unfortunately only caught as a side effect of
changes in https://github.com/facebook/rocksdb/issues/8544.
This change fixes the repair.cc case and updates the involved internal
APIs that save a reference to use `const FileOptions*` instead.
Unfortunately, I don't know how to get any of our sanitizers to reliably
report bugs like this, so I can't rule out more existing in our
codebase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8571
Test Plan:
Test that issues seen with https://github.com/facebook/rocksdb/issues/8544 are fixed (can reproduce on
AWS EC2)
Reviewed By: ajkr
Differential Revision: D29943890
Pulled By: pdillinger
fbshipit-source-id: 95f9c5251548777b4dc994c1a083dd2add5799c9
2021-07-28 04:48:22 +00:00
|
|
|
const FileOptions file_options_;
|
|
|
|
|
2014-01-31 00:49:46 +00:00
|
|
|
ColumnFamilyData* dummy_cfd_;
|
2014-03-11 21:52:17 +00:00
|
|
|
// We don't hold the refcount here, since default column family always exists
|
|
|
|
// We are also not responsible for cleaning up default_cfd_cache_. This is
|
|
|
|
// just a cache that makes common case (accessing default column family)
|
|
|
|
// faster
|
|
|
|
ColumnFamilyData* default_cfd_cache_;
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 17:07:55 +00:00
|
|
|
|
|
|
|
const std::string db_name_;
|
2016-09-23 23:34:04 +00:00
|
|
|
const ImmutableDBOptions* const db_options_;
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 17:07:55 +00:00
|
|
|
Cache* table_cache_;
|
2016-06-21 01:01:03 +00:00
|
|
|
WriteBufferManager* write_buffer_manager_;
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 18:20:25 +00:00
|
|
|
WriteController* write_controller_;
|
2019-06-13 22:39:52 +00:00
|
|
|
BlockCacheTracer* const block_cache_tracer_;
|
2020-08-27 18:20:08 +00:00
|
|
|
std::shared_ptr<IOTracer> io_tracer_;
|
2022-06-21 03:58:11 +00:00
|
|
|
const std::string& db_id_;
|
2021-06-10 18:01:44 +00:00
|
|
|
std::string db_session_id_;
|
2014-01-22 19:44:53 +00:00
|
|
|
};
|
|
|
|
|
2022-03-24 20:05:17 +00:00
|
|
|
// A wrapper for ColumnFamilySet that supports releasing DB mutex during each
|
|
|
|
// iteration over the iterator, because the cfd is Refed and Unrefed during
|
|
|
|
// each iteration to prevent concurrent CF drop from destroying it (until
|
|
|
|
// Unref).
|
|
|
|
class RefedColumnFamilySet {
|
|
|
|
public:
|
|
|
|
explicit RefedColumnFamilySet(ColumnFamilySet* cfs) : wrapped_(cfs) {}
|
|
|
|
|
|
|
|
class iterator {
|
|
|
|
public:
|
|
|
|
explicit iterator(ColumnFamilySet::iterator wrapped) : wrapped_(wrapped) {
|
|
|
|
MaybeRef(*wrapped_);
|
|
|
|
}
|
|
|
|
~iterator() { MaybeUnref(*wrapped_); }
|
|
|
|
inline void MaybeRef(ColumnFamilyData* cfd) {
|
|
|
|
if (cfd->GetID() != ColumnFamilyData::kDummyColumnFamilyDataId) {
|
|
|
|
cfd->Ref();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
inline void MaybeUnref(ColumnFamilyData* cfd) {
|
|
|
|
if (cfd->GetID() != ColumnFamilyData::kDummyColumnFamilyDataId) {
|
|
|
|
cfd->UnrefAndTryDelete();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// NOTE: minimum operators for for-loop iteration
|
|
|
|
inline iterator& operator++() {
|
|
|
|
ColumnFamilyData* old = *wrapped_;
|
|
|
|
++wrapped_;
|
|
|
|
// Can only unref & potentially free cfd after accessing its next_
|
|
|
|
MaybeUnref(old);
|
|
|
|
MaybeRef(*wrapped_);
|
|
|
|
return *this;
|
|
|
|
}
|
|
|
|
inline bool operator!=(const iterator& other) const {
|
|
|
|
return this->wrapped_ != other.wrapped_;
|
|
|
|
}
|
|
|
|
inline ColumnFamilyData* operator*() { return *wrapped_; }
|
|
|
|
|
|
|
|
private:
|
|
|
|
ColumnFamilySet::iterator wrapped_;
|
|
|
|
};
|
|
|
|
|
|
|
|
iterator begin() { return iterator(wrapped_->begin()); }
|
|
|
|
iterator end() { return iterator(wrapped_->end()); }
|
|
|
|
|
|
|
|
private:
|
|
|
|
ColumnFamilySet* wrapped_;
|
|
|
|
};
|
|
|
|
|
2014-03-11 21:52:17 +00:00
|
|
|
// We use ColumnFamilyMemTablesImpl to provide WriteBatch a way to access
|
|
|
|
// memtables of different column families (specified by ID in the write batch)
|
2014-01-28 19:05:04 +00:00
|
|
|
class ColumnFamilyMemTablesImpl : public ColumnFamilyMemTables {
|
|
|
|
public:
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 23:59:07 +00:00
|
|
|
explicit ColumnFamilyMemTablesImpl(ColumnFamilySet* column_family_set)
|
|
|
|
: column_family_set_(column_family_set), current_(nullptr) {}
|
|
|
|
|
|
|
|
// Constructs a ColumnFamilyMemTablesImpl equivalent to one constructed
|
|
|
|
// with the arguments used to construct *orig.
|
|
|
|
explicit ColumnFamilyMemTablesImpl(ColumnFamilyMemTablesImpl* orig)
|
|
|
|
: column_family_set_(orig->column_family_set_), current_(nullptr) {}
|
2014-01-28 19:05:04 +00:00
|
|
|
|
2014-02-06 00:02:48 +00:00
|
|
|
// sets current_ to ColumnFamilyData with column_family_id
|
|
|
|
// returns false if column family doesn't exist
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 23:59:07 +00:00
|
|
|
// REQUIRES: use this function of DBImpl::column_family_memtables_ should be
|
|
|
|
// under a DB mutex OR from a write thread
|
2014-02-06 00:02:48 +00:00
|
|
|
bool Seek(uint32_t column_family_id) override;
|
|
|
|
|
|
|
|
// Returns log number of the selected column family
|
2015-01-06 20:44:21 +00:00
|
|
|
// REQUIRES: under a DB mutex OR from a write thread
|
2014-02-06 00:02:48 +00:00
|
|
|
uint64_t GetLogNumber() const override;
|
|
|
|
|
|
|
|
// REQUIRES: Seek() called first
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 23:59:07 +00:00
|
|
|
// REQUIRES: use this function of DBImpl::column_family_memtables_ should be
|
|
|
|
// under a DB mutex OR from a write thread
|
2024-01-31 21:14:42 +00:00
|
|
|
MemTable* GetMemTable() const override;
|
2014-01-28 19:05:04 +00:00
|
|
|
|
2014-02-06 00:02:48 +00:00
|
|
|
// Returns column family handle for the selected column family
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 23:59:07 +00:00
|
|
|
// REQUIRES: use this function of DBImpl::column_family_memtables_ should be
|
|
|
|
// under a DB mutex OR from a write thread
|
2024-01-31 21:14:42 +00:00
|
|
|
ColumnFamilyHandle* GetColumnFamilyHandle() override;
|
2014-01-28 19:05:04 +00:00
|
|
|
|
support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
2015-08-14 23:59:07 +00:00
|
|
|
// Cannot be called while another thread is calling Seek().
|
|
|
|
// REQUIRES: use this function of DBImpl::column_family_memtables_ should be
|
|
|
|
// under a DB mutex OR from a write thread
|
2024-01-31 21:14:42 +00:00
|
|
|
ColumnFamilyData* current() override { return current_; }
|
2014-09-11 01:46:09 +00:00
|
|
|
|
2014-01-28 19:05:04 +00:00
|
|
|
private:
|
|
|
|
ColumnFamilySet* column_family_set_;
|
2014-02-06 00:02:48 +00:00
|
|
|
ColumnFamilyData* current_;
|
2014-02-11 01:04:44 +00:00
|
|
|
ColumnFamilyHandleInternal handle_;
|
2014-01-28 19:05:04 +00:00
|
|
|
};
|
|
|
|
|
2024-01-29 18:38:08 +00:00
|
|
|
uint32_t GetColumnFamilyID(ColumnFamilyHandle* column_family);
|
2014-08-18 22:19:17 +00:00
|
|
|
|
2024-01-29 18:38:08 +00:00
|
|
|
const Comparator* GetColumnFamilyUserComparator(
|
2014-09-22 18:37:35 +00:00
|
|
|
ColumnFamilyHandle* column_family);
|
|
|
|
|
2024-01-29 18:38:08 +00:00
|
|
|
const ImmutableOptions& GetImmutableOptions(ColumnFamilyHandle* column_family);
|
2023-10-09 22:25:35 +00:00
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|