2016-03-01 02:38:03 +00:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-15 23:03:42 +00:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2016-03-01 02:38:03 +00:00
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
2021-08-19 00:04:36 +00:00
|
|
|
|
2016-07-07 18:29:14 +00:00
|
|
|
#include <atomic>
|
2016-03-01 02:38:03 +00:00
|
|
|
#include <cstdlib>
|
2016-07-07 18:29:14 +00:00
|
|
|
#include <functional>
|
2021-06-07 18:40:31 +00:00
|
|
|
#include <memory>
|
2015-12-16 02:20:10 +00:00
|
|
|
|
2016-03-01 02:38:03 +00:00
|
|
|
#include "db/db_test_util.h"
|
2017-09-11 15:58:52 +00:00
|
|
|
#include "db/read_callback.h"
|
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: ajkr
Differential Revision: D41063187
Pulled By: hx235
fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
|
|
|
#include "db/version_edit.h"
|
2020-11-03 03:20:15 +00:00
|
|
|
#include "options/options_helper.h"
|
2017-02-06 22:43:55 +00:00
|
|
|
#include "port/port.h"
|
2016-03-01 02:38:03 +00:00
|
|
|
#include "port/stack_trace.h"
|
2022-03-18 23:35:51 +00:00
|
|
|
#include "rocksdb/experimental.h"
|
2021-10-07 21:57:02 +00:00
|
|
|
#include "rocksdb/iostats_context.h"
|
2015-12-16 02:20:10 +00:00
|
|
|
#include "rocksdb/persistent_cache.h"
|
2021-08-19 00:04:36 +00:00
|
|
|
#include "rocksdb/trace_record.h"
|
|
|
|
#include "rocksdb/trace_record_result.h"
|
2021-08-12 02:31:44 +00:00
|
|
|
#include "rocksdb/utilities/replayer.h"
|
2016-03-22 19:07:15 +00:00
|
|
|
#include "rocksdb/wal_filter.h"
|
2021-09-08 14:45:59 +00:00
|
|
|
#include "test_util/testutil.h"
|
2020-07-09 21:33:42 +00:00
|
|
|
#include "util/random.h"
|
|
|
|
#include "utilities/fault_injection_env.h"
|
2016-03-01 02:38:03 +00:00
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
2016-03-01 02:38:03 +00:00
|
|
|
|
|
|
|
class DBTest2 : public DBTestBase {
|
|
|
|
public:
|
2021-07-23 15:37:27 +00:00
|
|
|
DBTest2() : DBTestBase("db_test2", /*env_do_fsync=*/true) {}
|
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: ajkr
Differential Revision: D41063187
Pulled By: hx235
fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
|
|
|
std::vector<FileMetaData*> GetLevelFileMetadatas(int level, int cf = 0) {
|
|
|
|
VersionSet* const versions = dbfull()->GetVersionSet();
|
|
|
|
assert(versions);
|
|
|
|
ColumnFamilyData* const cfd =
|
|
|
|
versions->GetColumnFamilySet()->GetColumnFamily(cf);
|
|
|
|
assert(cfd);
|
|
|
|
Version* const current = cfd->current();
|
|
|
|
assert(current);
|
|
|
|
VersionStorageInfo* const storage_info = current->storage_info();
|
|
|
|
assert(storage_info);
|
|
|
|
return storage_info->LevelFiles(level);
|
|
|
|
}
|
2016-03-01 02:38:03 +00:00
|
|
|
};
|
|
|
|
|
2020-06-04 01:55:25 +00:00
|
|
|
TEST_F(DBTest2, OpenForReadOnly) {
|
|
|
|
DB* db_ptr = nullptr;
|
|
|
|
std::string dbname = test::PerThreadDBPath("db_readonly");
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.create_if_missing = true;
|
|
|
|
// OpenForReadOnly should fail but will create <dbname> in the file system
|
|
|
|
ASSERT_NOK(DB::OpenForReadOnly(options, dbname, &db_ptr));
|
|
|
|
// Since <dbname> is created, we should be able to delete the dir
|
|
|
|
// We first get the list files under <dbname>
|
|
|
|
// There should not be any subdirectories -- this is not checked here
|
|
|
|
std::vector<std::string> files;
|
|
|
|
ASSERT_OK(env_->GetChildren(dbname, &files));
|
|
|
|
for (auto& f : files) {
|
2021-01-09 17:42:21 +00:00
|
|
|
ASSERT_OK(env_->DeleteFile(dbname + "/" + f));
|
2020-06-04 01:55:25 +00:00
|
|
|
}
|
|
|
|
// <dbname> should be empty now and we should be able to delete it
|
|
|
|
ASSERT_OK(env_->DeleteDir(dbname));
|
|
|
|
options.create_if_missing = false;
|
|
|
|
// OpenForReadOnly should fail since <dbname> was successfully deleted
|
|
|
|
ASSERT_NOK(DB::OpenForReadOnly(options, dbname, &db_ptr));
|
|
|
|
// With create_if_missing false, there should not be a dir in the file system
|
|
|
|
ASSERT_NOK(env_->FileExists(dbname));
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, OpenForReadOnlyWithColumnFamilies) {
|
|
|
|
DB* db_ptr = nullptr;
|
|
|
|
std::string dbname = test::PerThreadDBPath("db_readonly");
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.create_if_missing = true;
|
|
|
|
|
|
|
|
ColumnFamilyOptions cf_options(options);
|
|
|
|
std::vector<ColumnFamilyDescriptor> column_families;
|
2024-01-05 19:53:57 +00:00
|
|
|
column_families.emplace_back(kDefaultColumnFamilyName, cf_options);
|
|
|
|
column_families.emplace_back("goku", cf_options);
|
2020-06-04 01:55:25 +00:00
|
|
|
std::vector<ColumnFamilyHandle*> handles;
|
|
|
|
// OpenForReadOnly should fail but will create <dbname> in the file system
|
|
|
|
ASSERT_NOK(
|
|
|
|
DB::OpenForReadOnly(options, dbname, column_families, &handles, &db_ptr));
|
|
|
|
// Since <dbname> is created, we should be able to delete the dir
|
|
|
|
// We first get the list files under <dbname>
|
|
|
|
// There should not be any subdirectories -- this is not checked here
|
|
|
|
std::vector<std::string> files;
|
|
|
|
ASSERT_OK(env_->GetChildren(dbname, &files));
|
|
|
|
for (auto& f : files) {
|
2021-01-09 17:42:21 +00:00
|
|
|
ASSERT_OK(env_->DeleteFile(dbname + "/" + f));
|
2020-06-04 01:55:25 +00:00
|
|
|
}
|
|
|
|
// <dbname> should be empty now and we should be able to delete it
|
|
|
|
ASSERT_OK(env_->DeleteDir(dbname));
|
|
|
|
options.create_if_missing = false;
|
|
|
|
// OpenForReadOnly should fail since <dbname> was successfully deleted
|
|
|
|
ASSERT_NOK(
|
|
|
|
DB::OpenForReadOnly(options, dbname, column_families, &handles, &db_ptr));
|
|
|
|
// With create_if_missing false, there should not be a dir in the file system
|
|
|
|
ASSERT_NOK(env_->FileExists(dbname));
|
|
|
|
}
|
2020-06-17 21:19:35 +00:00
|
|
|
|
2020-07-09 04:02:06 +00:00
|
|
|
class PartitionedIndexTestListener : public EventListener {
|
|
|
|
public:
|
|
|
|
void OnFlushCompleted(DB* /*db*/, const FlushJobInfo& info) override {
|
|
|
|
ASSERT_GT(info.table_properties.index_partitions, 1);
|
|
|
|
ASSERT_EQ(info.table_properties.index_key_is_user_key, 0);
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
TEST_F(DBTest2, PartitionedIndexUserToInternalKey) {
|
2021-03-08 22:46:09 +00:00
|
|
|
const int kValueSize = 10500;
|
|
|
|
const int kNumEntriesPerFile = 1000;
|
|
|
|
const int kNumFiles = 3;
|
|
|
|
const int kNumDistinctKeys = 30;
|
|
|
|
|
2020-07-09 04:02:06 +00:00
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
Options options = CurrentOptions();
|
2021-03-08 22:46:09 +00:00
|
|
|
options.disable_auto_compactions = true;
|
2020-07-09 04:02:06 +00:00
|
|
|
table_options.index_type = BlockBasedTableOptions::kTwoLevelIndexSearch;
|
|
|
|
PartitionedIndexTestListener* listener = new PartitionedIndexTestListener();
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
|
|
|
options.listeners.emplace_back(listener);
|
|
|
|
std::vector<const Snapshot*> snapshots;
|
|
|
|
Reopen(options);
|
|
|
|
Random rnd(301);
|
|
|
|
|
2021-03-08 22:46:09 +00:00
|
|
|
for (int i = 0; i < kNumFiles; i++) {
|
|
|
|
for (int j = 0; j < kNumEntriesPerFile; j++) {
|
|
|
|
int key_id = (i * kNumEntriesPerFile + j) % kNumDistinctKeys;
|
|
|
|
std::string value = rnd.RandomString(kValueSize);
|
|
|
|
ASSERT_OK(Put("keykey_" + std::to_string(key_id), value));
|
|
|
|
snapshots.push_back(db_->GetSnapshot());
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
2020-07-09 04:02:06 +00:00
|
|
|
}
|
2020-12-07 18:23:17 +00:00
|
|
|
|
2020-07-09 04:02:06 +00:00
|
|
|
for (auto s : snapshots) {
|
|
|
|
db_->ReleaseSnapshot(s);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-06-04 01:55:25 +00:00
|
|
|
|
2016-04-13 20:02:33 +00:00
|
|
|
class PrefixFullBloomWithReverseComparator
|
|
|
|
: public DBTestBase,
|
|
|
|
public ::testing::WithParamInterface<bool> {
|
|
|
|
public:
|
|
|
|
PrefixFullBloomWithReverseComparator()
|
2021-07-23 15:37:27 +00:00
|
|
|
: DBTestBase("prefix_bloom_reverse", /*env_do_fsync=*/true) {}
|
2019-02-14 21:52:47 +00:00
|
|
|
void SetUp() override { if_cache_filter_ = GetParam(); }
|
2016-04-13 20:02:33 +00:00
|
|
|
bool if_cache_filter_;
|
|
|
|
};
|
|
|
|
|
|
|
|
TEST_P(PrefixFullBloomWithReverseComparator,
|
|
|
|
PrefixFullBloomWithReverseComparator) {
|
2016-04-12 20:56:24 +00:00
|
|
|
Options options = last_options_;
|
|
|
|
options.comparator = ReverseBytewiseComparator();
|
|
|
|
options.prefix_extractor.reset(NewCappedPrefixTransform(3));
|
2020-02-20 20:07:53 +00:00
|
|
|
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
2016-04-12 20:56:24 +00:00
|
|
|
BlockBasedTableOptions bbto;
|
2016-04-13 20:02:33 +00:00
|
|
|
if (if_cache_filter_) {
|
|
|
|
bbto.no_block_cache = false;
|
|
|
|
bbto.cache_index_and_filter_blocks = true;
|
|
|
|
bbto.block_cache = NewLRUCache(1);
|
|
|
|
}
|
2016-04-12 20:56:24 +00:00
|
|
|
bbto.filter_policy.reset(NewBloomFilterPolicy(10, false));
|
|
|
|
bbto.whole_key_filtering = false;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
ASSERT_OK(dbfull()->Put(WriteOptions(), "bar123", "foo"));
|
|
|
|
ASSERT_OK(dbfull()->Put(WriteOptions(), "bar234", "foo2"));
|
|
|
|
ASSERT_OK(dbfull()->Put(WriteOptions(), "foo123", "foo3"));
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->Flush(FlushOptions()));
|
2016-04-12 20:56:24 +00:00
|
|
|
|
2016-04-13 20:02:33 +00:00
|
|
|
if (bbto.block_cache) {
|
|
|
|
bbto.block_cache->EraseUnRefEntries();
|
|
|
|
}
|
|
|
|
|
2018-11-09 19:17:34 +00:00
|
|
|
std::unique_ptr<Iterator> iter(db_->NewIterator(ReadOptions()));
|
2016-04-12 20:56:24 +00:00
|
|
|
iter->Seek("bar345");
|
|
|
|
ASSERT_OK(iter->status());
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("bar234", iter->key().ToString());
|
|
|
|
ASSERT_EQ("foo2", iter->value().ToString());
|
|
|
|
iter->Next();
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("bar123", iter->key().ToString());
|
|
|
|
ASSERT_EQ("foo", iter->value().ToString());
|
|
|
|
|
|
|
|
iter->Seek("foo234");
|
|
|
|
ASSERT_OK(iter->status());
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("foo123", iter->key().ToString());
|
|
|
|
ASSERT_EQ("foo3", iter->value().ToString());
|
|
|
|
|
|
|
|
iter->Seek("bar");
|
|
|
|
ASSERT_OK(iter->status());
|
|
|
|
ASSERT_TRUE(!iter->Valid());
|
|
|
|
}
|
|
|
|
|
2020-06-03 22:53:09 +00:00
|
|
|
INSTANTIATE_TEST_CASE_P(PrefixFullBloomWithReverseComparator,
|
|
|
|
PrefixFullBloomWithReverseComparator, testing::Bool());
|
2016-04-13 20:02:33 +00:00
|
|
|
|
2016-03-01 02:38:03 +00:00
|
|
|
TEST_F(DBTest2, IteratorPropertyVersionNumber) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("", ""));
|
2016-03-01 02:38:03 +00:00
|
|
|
Iterator* iter1 = db_->NewIterator(ReadOptions());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter1->status());
|
2016-03-01 02:38:03 +00:00
|
|
|
std::string prop_value;
|
2016-03-03 21:18:56 +00:00
|
|
|
ASSERT_OK(
|
|
|
|
iter1->GetProperty("rocksdb.iterator.super-version-number", &prop_value));
|
2016-03-01 02:38:03 +00:00
|
|
|
uint64_t version_number1 =
|
|
|
|
static_cast<uint64_t>(std::atoi(prop_value.c_str()));
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("", ""));
|
|
|
|
ASSERT_OK(Flush());
|
2016-03-01 02:38:03 +00:00
|
|
|
|
|
|
|
Iterator* iter2 = db_->NewIterator(ReadOptions());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter2->status());
|
2016-03-03 21:18:56 +00:00
|
|
|
ASSERT_OK(
|
|
|
|
iter2->GetProperty("rocksdb.iterator.super-version-number", &prop_value));
|
2016-03-01 02:38:03 +00:00
|
|
|
uint64_t version_number2 =
|
|
|
|
static_cast<uint64_t>(std::atoi(prop_value.c_str()));
|
|
|
|
|
|
|
|
ASSERT_GT(version_number2, version_number1);
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("", ""));
|
2016-03-01 02:38:03 +00:00
|
|
|
|
|
|
|
Iterator* iter3 = db_->NewIterator(ReadOptions());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter3->status());
|
2016-03-03 21:18:56 +00:00
|
|
|
ASSERT_OK(
|
|
|
|
iter3->GetProperty("rocksdb.iterator.super-version-number", &prop_value));
|
2016-03-01 02:38:03 +00:00
|
|
|
uint64_t version_number3 =
|
|
|
|
static_cast<uint64_t>(std::atoi(prop_value.c_str()));
|
|
|
|
|
|
|
|
ASSERT_EQ(version_number2, version_number3);
|
|
|
|
|
|
|
|
iter1->SeekToFirst();
|
2016-03-03 21:18:56 +00:00
|
|
|
ASSERT_OK(
|
|
|
|
iter1->GetProperty("rocksdb.iterator.super-version-number", &prop_value));
|
2016-03-01 02:38:03 +00:00
|
|
|
uint64_t version_number1_new =
|
|
|
|
static_cast<uint64_t>(std::atoi(prop_value.c_str()));
|
|
|
|
ASSERT_EQ(version_number1, version_number1_new);
|
|
|
|
|
|
|
|
delete iter1;
|
|
|
|
delete iter2;
|
|
|
|
delete iter3;
|
|
|
|
}
|
Index Reader should not be reused after DB restart
Summary:
In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems.
Fix it by making cache key identical per table reader.
Test Plan: Add a new test which failed with out the commit but now pass.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55287
2016-03-10 23:16:11 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, CacheIndexAndFilterWithDBRestart) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.create_if_missing = true;
|
2020-02-20 20:07:53 +00:00
|
|
|
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
Index Reader should not be reused after DB restart
Summary:
In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems.
Fix it by making cache key identical per table reader.
Test Plan: Add a new test which failed with out the commit but now pass.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55287
2016-03-10 23:16:11 +00:00
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
table_options.cache_index_and_filter_blocks = true;
|
|
|
|
table_options.filter_policy.reset(NewBloomFilterPolicy(20));
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
Index Reader should not be reused after DB restart
Summary:
In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems.
Fix it by making cache key identical per table reader.
Test Plan: Add a new test which failed with out the commit but now pass.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55287
2016-03-10 23:16:11 +00:00
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put(1, "a", "begin"));
|
|
|
|
ASSERT_OK(Put(1, "z", "end"));
|
Index Reader should not be reused after DB restart
Summary:
In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems.
Fix it by making cache key identical per table reader.
Test Plan: Add a new test which failed with out the commit but now pass.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55287
2016-03-10 23:16:11 +00:00
|
|
|
ASSERT_OK(Flush(1));
|
2023-08-09 22:46:44 +00:00
|
|
|
ASSERT_OK(TryReopenWithColumnFamilies({"default", "pikachu"}, options));
|
Index Reader should not be reused after DB restart
Summary:
In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems.
Fix it by making cache key identical per table reader.
Test Plan: Add a new test which failed with out the commit but now pass.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55287
2016-03-10 23:16:11 +00:00
|
|
|
|
|
|
|
std::string value;
|
|
|
|
value = Get(1, "a");
|
|
|
|
}
|
2016-03-22 19:07:15 +00:00
|
|
|
|
2016-09-21 18:05:07 +00:00
|
|
|
TEST_F(DBTest2, MaxSuccessiveMergesChangeWithDBRecovery) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.create_if_missing = true;
|
2020-02-20 20:07:53 +00:00
|
|
|
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
2016-09-21 18:05:07 +00:00
|
|
|
options.max_successive_merges = 3;
|
|
|
|
options.merge_operator = MergeOperators::CreatePutOperator();
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
DestroyAndReopen(options);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("poi", "Finch"));
|
|
|
|
ASSERT_OK(db_->Merge(WriteOptions(), "poi", "Reese"));
|
|
|
|
ASSERT_OK(db_->Merge(WriteOptions(), "poi", "Shaw"));
|
|
|
|
ASSERT_OK(db_->Merge(WriteOptions(), "poi", "Root"));
|
2016-09-21 18:05:07 +00:00
|
|
|
options.max_successive_merges = 2;
|
|
|
|
Reopen(options);
|
|
|
|
}
|
|
|
|
|
2016-06-21 01:01:03 +00:00
|
|
|
class DBTestSharedWriteBufferAcrossCFs
|
|
|
|
: public DBTestBase,
|
2017-06-02 21:13:59 +00:00
|
|
|
public testing::WithParamInterface<std::tuple<bool, bool>> {
|
2016-06-21 01:01:03 +00:00
|
|
|
public:
|
|
|
|
DBTestSharedWriteBufferAcrossCFs()
|
2021-07-23 15:37:27 +00:00
|
|
|
: DBTestBase("db_test_shared_write_buffer", /*env_do_fsync=*/true) {}
|
2017-06-02 21:13:59 +00:00
|
|
|
void SetUp() override {
|
|
|
|
use_old_interface_ = std::get<0>(GetParam());
|
|
|
|
cost_cache_ = std::get<1>(GetParam());
|
|
|
|
}
|
2016-06-21 01:01:03 +00:00
|
|
|
bool use_old_interface_;
|
2017-06-02 21:13:59 +00:00
|
|
|
bool cost_cache_;
|
2016-06-21 01:01:03 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
TEST_P(DBTestSharedWriteBufferAcrossCFs, SharedWriteBufferAcrossCFs) {
|
|
|
|
Options options = CurrentOptions();
|
2017-06-02 21:13:59 +00:00
|
|
|
options.arena_block_size = 4096;
|
2021-04-08 06:17:41 +00:00
|
|
|
auto flush_listener = std::make_shared<FlushCounterListener>();
|
|
|
|
options.listeners.push_back(flush_listener);
|
|
|
|
// Don't trip the listener at shutdown.
|
|
|
|
options.avoid_flush_during_shutdown = true;
|
2017-06-02 21:13:59 +00:00
|
|
|
|
|
|
|
// Avoid undeterministic value by malloc_usable_size();
|
|
|
|
// Force arena block size to 1
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2017-06-02 21:13:59 +00:00
|
|
|
"Arena::Arena:0", [&](void* arg) {
|
|
|
|
size_t* block_size = static_cast<size_t*>(arg);
|
|
|
|
*block_size = 1;
|
|
|
|
});
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2017-06-02 21:13:59 +00:00
|
|
|
"Arena::AllocateNewBlock:0", [&](void* arg) {
|
|
|
|
std::pair<size_t*, size_t*>* pair =
|
|
|
|
static_cast<std::pair<size_t*, size_t*>*>(arg);
|
|
|
|
*std::get<0>(*pair) = *std::get<1>(*pair);
|
|
|
|
});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2017-06-02 21:13:59 +00:00
|
|
|
|
|
|
|
// The total soft write buffer size is about 105000
|
|
|
|
std::shared_ptr<Cache> cache = NewLRUCache(4 * 1024 * 1024, 2);
|
2019-04-16 18:59:35 +00:00
|
|
|
ASSERT_LT(cache->GetUsage(), 256 * 1024);
|
2017-06-02 21:13:59 +00:00
|
|
|
|
2016-06-21 01:01:03 +00:00
|
|
|
if (use_old_interface_) {
|
2017-06-02 21:13:59 +00:00
|
|
|
options.db_write_buffer_size = 120000; // this is the real limit
|
|
|
|
} else if (!cost_cache_) {
|
|
|
|
options.write_buffer_manager.reset(new WriteBufferManager(114285));
|
2016-06-21 01:01:03 +00:00
|
|
|
} else {
|
2017-06-02 21:13:59 +00:00
|
|
|
options.write_buffer_manager.reset(new WriteBufferManager(114285, cache));
|
2016-06-21 01:01:03 +00:00
|
|
|
}
|
|
|
|
options.write_buffer_size = 500000; // this is never hit
|
|
|
|
CreateAndReopenWithCF({"pikachu", "dobrynia", "nikitich"}, options);
|
|
|
|
|
2017-03-21 17:59:57 +00:00
|
|
|
WriteOptions wo;
|
|
|
|
wo.disableWAL = true;
|
|
|
|
|
2017-06-02 21:13:59 +00:00
|
|
|
std::function<void()> wait_flush = [&]() {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable(handles_[0]));
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable(handles_[1]));
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable(handles_[2]));
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable(handles_[3]));
|
2022-02-22 20:13:39 +00:00
|
|
|
// Ensure background work is fully finished including listener callbacks
|
|
|
|
// before accessing listener state.
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForBackgroundWork());
|
2017-06-02 21:13:59 +00:00
|
|
|
};
|
|
|
|
|
2017-03-21 17:59:57 +00:00
|
|
|
// Create some data and flush "default" and "nikitich" so that they
|
|
|
|
// are newer CFs created.
|
2021-04-08 06:17:41 +00:00
|
|
|
flush_listener->expected_flush_reason = FlushReason::kManualFlush;
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(3, Key(1), DummyString(1), wo));
|
2023-08-09 22:46:44 +00:00
|
|
|
ASSERT_OK(Flush(3));
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(3, Key(1), DummyString(1), wo));
|
|
|
|
ASSERT_OK(Put(0, Key(1), DummyString(1), wo));
|
2023-08-09 22:46:44 +00:00
|
|
|
ASSERT_OK(Flush(0));
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "default"),
|
|
|
|
static_cast<uint64_t>(1));
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "nikitich"),
|
|
|
|
static_cast<uint64_t>(1));
|
|
|
|
|
2021-04-08 06:17:41 +00:00
|
|
|
flush_listener->expected_flush_reason = FlushReason::kWriteBufferManager;
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(3, Key(1), DummyString(30000), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
if (cost_cache_) {
|
2019-04-16 18:59:35 +00:00
|
|
|
ASSERT_GE(cache->GetUsage(), 256 * 1024);
|
|
|
|
ASSERT_LE(cache->GetUsage(), 2 * 256 * 1024);
|
2017-06-02 21:13:59 +00:00
|
|
|
}
|
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(0, Key(1), DummyString(60000), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
if (cost_cache_) {
|
2019-04-16 18:59:35 +00:00
|
|
|
ASSERT_GE(cache->GetUsage(), 256 * 1024);
|
|
|
|
ASSERT_LE(cache->GetUsage(), 2 * 256 * 1024);
|
2017-06-02 21:13:59 +00:00
|
|
|
}
|
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(2, Key(1), DummyString(1), wo));
|
|
|
|
// No flush should trigger
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2016-06-21 01:01:03 +00:00
|
|
|
{
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "default"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(1));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"),
|
|
|
|
static_cast<uint64_t>(0));
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "dobrynia"),
|
|
|
|
static_cast<uint64_t>(0));
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "nikitich"),
|
|
|
|
static_cast<uint64_t>(1));
|
|
|
|
}
|
|
|
|
|
2017-03-21 17:59:57 +00:00
|
|
|
// Trigger a flush. Flushing "nikitich".
|
|
|
|
ASSERT_OK(Put(3, Key(2), DummyString(30000), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(0, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2016-06-21 01:01:03 +00:00
|
|
|
{
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "default"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(1));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"),
|
|
|
|
static_cast<uint64_t>(0));
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "dobrynia"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(0));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "nikitich"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(2));
|
2016-06-21 01:01:03 +00:00
|
|
|
}
|
|
|
|
|
2017-03-21 17:59:57 +00:00
|
|
|
// Without hitting the threshold, no flush should trigger.
|
|
|
|
ASSERT_OK(Put(2, Key(1), DummyString(30000), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(2, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(2, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2016-06-21 01:01:03 +00:00
|
|
|
{
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "default"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(1));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"),
|
|
|
|
static_cast<uint64_t>(0));
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "dobrynia"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(0));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "nikitich"),
|
|
|
|
static_cast<uint64_t>(2));
|
|
|
|
}
|
|
|
|
|
2017-03-21 17:59:57 +00:00
|
|
|
// Hit the write buffer limit again. "default"
|
|
|
|
// will have been flushed.
|
|
|
|
ASSERT_OK(Put(2, Key(2), DummyString(10000), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(3, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(0, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(0, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(0, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2016-06-21 01:01:03 +00:00
|
|
|
{
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "default"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(2));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"),
|
|
|
|
static_cast<uint64_t>(0));
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "dobrynia"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(0));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "nikitich"),
|
|
|
|
static_cast<uint64_t>(2));
|
|
|
|
}
|
|
|
|
|
2017-03-21 17:59:57 +00:00
|
|
|
// Trigger another flush. This time "dobrynia". "pikachu" should not
|
|
|
|
// be flushed, althrough it was never flushed.
|
|
|
|
ASSERT_OK(Put(1, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(2, Key(1), DummyString(80000), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(1, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(2, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
|
|
|
|
2016-06-21 01:01:03 +00:00
|
|
|
{
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "default"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(2));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(0));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "dobrynia"),
|
|
|
|
static_cast<uint64_t>(1));
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "nikitich"),
|
|
|
|
static_cast<uint64_t>(2));
|
|
|
|
}
|
2017-06-02 21:13:59 +00:00
|
|
|
if (cost_cache_) {
|
2019-04-16 18:59:35 +00:00
|
|
|
ASSERT_GE(cache->GetUsage(), 256 * 1024);
|
2017-06-02 21:13:59 +00:00
|
|
|
Close();
|
|
|
|
options.write_buffer_manager.reset();
|
2018-12-29 02:00:00 +00:00
|
|
|
last_options_.write_buffer_manager.reset();
|
2019-04-16 18:59:35 +00:00
|
|
|
ASSERT_LT(cache->GetUsage(), 256 * 1024);
|
2017-06-02 21:13:59 +00:00
|
|
|
}
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2016-06-21 01:01:03 +00:00
|
|
|
}
|
|
|
|
|
2020-06-03 22:53:09 +00:00
|
|
|
INSTANTIATE_TEST_CASE_P(DBTestSharedWriteBufferAcrossCFs,
|
|
|
|
DBTestSharedWriteBufferAcrossCFs,
|
|
|
|
::testing::Values(std::make_tuple(true, false),
|
|
|
|
std::make_tuple(false, false),
|
|
|
|
std::make_tuple(false, true)));
|
2016-06-21 01:01:03 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, SharedWriteBufferLimitAcrossDB) {
|
2018-07-14 00:18:39 +00:00
|
|
|
std::string dbname2 = test::PerThreadDBPath("db_shared_wb_db2");
|
2016-06-21 01:01:03 +00:00
|
|
|
Options options = CurrentOptions();
|
2017-06-02 21:13:59 +00:00
|
|
|
options.arena_block_size = 4096;
|
2021-04-08 06:17:41 +00:00
|
|
|
auto flush_listener = std::make_shared<FlushCounterListener>();
|
|
|
|
options.listeners.push_back(flush_listener);
|
|
|
|
// Don't trip the listener at shutdown.
|
|
|
|
options.avoid_flush_during_shutdown = true;
|
2017-06-02 21:13:59 +00:00
|
|
|
// Avoid undeterministic value by malloc_usable_size();
|
|
|
|
// Force arena block size to 1
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2017-06-02 21:13:59 +00:00
|
|
|
"Arena::Arena:0", [&](void* arg) {
|
|
|
|
size_t* block_size = static_cast<size_t*>(arg);
|
|
|
|
*block_size = 1;
|
|
|
|
});
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2017-06-02 21:13:59 +00:00
|
|
|
"Arena::AllocateNewBlock:0", [&](void* arg) {
|
|
|
|
std::pair<size_t*, size_t*>* pair =
|
|
|
|
static_cast<std::pair<size_t*, size_t*>*>(arg);
|
|
|
|
*std::get<0>(*pair) = *std::get<1>(*pair);
|
|
|
|
});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2017-06-02 21:13:59 +00:00
|
|
|
|
2016-06-21 01:01:03 +00:00
|
|
|
options.write_buffer_size = 500000; // this is never hit
|
2017-06-02 21:13:59 +00:00
|
|
|
// Use a write buffer total size so that the soft limit is about
|
|
|
|
// 105000.
|
|
|
|
options.write_buffer_manager.reset(new WriteBufferManager(120000));
|
2016-06-21 01:01:03 +00:00
|
|
|
CreateAndReopenWithCF({"cf1", "cf2"}, options);
|
|
|
|
|
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
|
|
|
DB* db2 = nullptr;
|
|
|
|
ASSERT_OK(DB::Open(options, dbname2, &db2));
|
|
|
|
|
|
|
|
WriteOptions wo;
|
2017-03-21 17:59:57 +00:00
|
|
|
wo.disableWAL = true;
|
2016-06-21 01:01:03 +00:00
|
|
|
|
2017-06-02 21:13:59 +00:00
|
|
|
std::function<void()> wait_flush = [&]() {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable(handles_[0]));
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable(handles_[1]));
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable(handles_[2]));
|
|
|
|
ASSERT_OK(static_cast<DBImpl*>(db2)->TEST_WaitForFlushMemTable());
|
2022-02-22 20:13:39 +00:00
|
|
|
// Ensure background work is fully finished including listener callbacks
|
|
|
|
// before accessing listener state.
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForBackgroundWork());
|
|
|
|
ASSERT_OK(
|
|
|
|
static_cast_with_check<DBImpl>(db2)->TEST_WaitForBackgroundWork());
|
2017-06-02 21:13:59 +00:00
|
|
|
};
|
|
|
|
|
2016-06-21 01:01:03 +00:00
|
|
|
// Trigger a flush on cf2
|
2021-04-08 06:17:41 +00:00
|
|
|
flush_listener->expected_flush_reason = FlushReason::kWriteBufferManager;
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(2, Key(1), DummyString(70000), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(0, Key(1), DummyString(20000), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2016-06-21 01:01:03 +00:00
|
|
|
|
|
|
|
// Insert to DB2
|
|
|
|
ASSERT_OK(db2->Put(wo, Key(2), DummyString(20000)));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2016-06-21 01:01:03 +00:00
|
|
|
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(2, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(static_cast<DBImpl*>(db2)->TEST_WaitForFlushMemTable());
|
2016-06-21 01:01:03 +00:00
|
|
|
{
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "default") +
|
|
|
|
GetNumberOfSstFilesForColumnFamily(db_, "cf1") +
|
|
|
|
GetNumberOfSstFilesForColumnFamily(db_, "cf2"),
|
2016-06-21 01:01:03 +00:00
|
|
|
static_cast<uint64_t>(1));
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db2, "default"),
|
|
|
|
static_cast<uint64_t>(0));
|
|
|
|
}
|
|
|
|
|
2017-03-21 17:59:57 +00:00
|
|
|
// Triggering to flush another CF in DB1
|
|
|
|
ASSERT_OK(db2->Put(wo, Key(2), DummyString(70000)));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(Put(2, Key(1), DummyString(1), wo));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2016-06-21 01:01:03 +00:00
|
|
|
{
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "default"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(1));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "cf1"),
|
|
|
|
static_cast<uint64_t>(0));
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "cf2"),
|
|
|
|
static_cast<uint64_t>(1));
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db2, "default"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(0));
|
2016-06-21 01:01:03 +00:00
|
|
|
}
|
|
|
|
|
2017-03-21 17:59:57 +00:00
|
|
|
// Triggering flush in DB2.
|
|
|
|
ASSERT_OK(db2->Put(wo, Key(3), DummyString(40000)));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2017-03-21 17:59:57 +00:00
|
|
|
ASSERT_OK(db2->Put(wo, Key(1), DummyString(1)));
|
2017-06-02 21:13:59 +00:00
|
|
|
wait_flush();
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(static_cast<DBImpl*>(db2)->TEST_WaitForFlushMemTable());
|
2016-06-21 01:01:03 +00:00
|
|
|
{
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "default"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(1));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "cf1"),
|
|
|
|
static_cast<uint64_t>(0));
|
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "cf2"),
|
2017-03-21 17:59:57 +00:00
|
|
|
static_cast<uint64_t>(1));
|
2016-06-21 01:01:03 +00:00
|
|
|
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db2, "default"),
|
|
|
|
static_cast<uint64_t>(1));
|
|
|
|
}
|
|
|
|
|
|
|
|
delete db2;
|
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
2017-06-02 21:13:59 +00:00
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2016-06-21 01:01:03 +00:00
|
|
|
}
|
|
|
|
|
2018-11-19 00:51:15 +00:00
|
|
|
TEST_F(DBTest2, TestWriteBufferNoLimitWithCache) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.arena_block_size = 4096;
|
2022-08-13 00:59:06 +00:00
|
|
|
std::shared_ptr<Cache> cache = NewLRUCache(LRUCacheOptions(
|
|
|
|
10000000 /* capacity */, 1 /* num_shard_bits */,
|
|
|
|
false /* strict_capacity_limit */, 0.0 /* high_pri_pool_ratio */,
|
|
|
|
nullptr /* memory_allocator */, kDefaultToAdaptiveMutex,
|
|
|
|
kDontChargeCacheMetadata));
|
|
|
|
|
2018-11-19 00:51:15 +00:00
|
|
|
options.write_buffer_size = 50000; // this is never hit
|
|
|
|
// Use a write buffer total size so that the soft limit is about
|
|
|
|
// 105000.
|
|
|
|
options.write_buffer_manager.reset(new WriteBufferManager(0, cache));
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
2019-04-16 18:59:35 +00:00
|
|
|
// One dummy entry is 256KB.
|
|
|
|
ASSERT_GT(cache->GetUsage(), 128000);
|
2018-11-19 00:51:15 +00:00
|
|
|
}
|
|
|
|
|
2016-03-22 19:07:15 +00:00
|
|
|
namespace {
|
2022-11-02 21:34:24 +00:00
|
|
|
void ValidateKeyExistence(DB* db, const std::vector<Slice>& keys_must_exist,
|
|
|
|
const std::vector<Slice>& keys_must_not_exist) {
|
|
|
|
// Ensure that expected keys exist
|
|
|
|
std::vector<std::string> values;
|
|
|
|
if (keys_must_exist.size() > 0) {
|
|
|
|
std::vector<Status> status_list =
|
2016-03-22 19:07:15 +00:00
|
|
|
db->MultiGet(ReadOptions(), keys_must_exist, &values);
|
2022-11-02 21:34:24 +00:00
|
|
|
for (size_t i = 0; i < keys_must_exist.size(); i++) {
|
|
|
|
ASSERT_OK(status_list[i]);
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
2022-11-02 21:34:24 +00:00
|
|
|
}
|
2016-03-22 19:07:15 +00:00
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
// Ensure that given keys don't exist
|
|
|
|
if (keys_must_not_exist.size() > 0) {
|
|
|
|
std::vector<Status> status_list =
|
2016-03-22 19:07:15 +00:00
|
|
|
db->MultiGet(ReadOptions(), keys_must_not_exist, &values);
|
2022-11-02 21:34:24 +00:00
|
|
|
for (size_t i = 0; i < keys_must_not_exist.size(); i++) {
|
|
|
|
ASSERT_TRUE(status_list[i].IsNotFound());
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
}
|
2022-11-02 21:34:24 +00:00
|
|
|
}
|
2016-03-22 19:07:15 +00:00
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
} // anonymous namespace
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, WalFilterTest) {
|
|
|
|
class TestWalFilter : public WalFilter {
|
2022-11-02 21:34:24 +00:00
|
|
|
private:
|
2016-03-22 19:07:15 +00:00
|
|
|
// Processing option that is requested to be applied at the given index
|
|
|
|
WalFilter::WalProcessingOption wal_processing_option_;
|
|
|
|
// Index at which to apply wal_processing_option_
|
|
|
|
// At other indexes default wal_processing_option::kContinueProcessing is
|
|
|
|
// returned.
|
|
|
|
size_t apply_option_at_record_index_;
|
|
|
|
// Current record index, incremented with each record encountered.
|
|
|
|
size_t current_record_index_;
|
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
public:
|
2016-03-22 19:07:15 +00:00
|
|
|
TestWalFilter(WalFilter::WalProcessingOption wal_processing_option,
|
2022-11-02 21:34:24 +00:00
|
|
|
size_t apply_option_for_record_index)
|
|
|
|
: wal_processing_option_(wal_processing_option),
|
|
|
|
apply_option_at_record_index_(apply_option_for_record_index),
|
|
|
|
current_record_index_(0) {}
|
2016-03-22 19:07:15 +00:00
|
|
|
|
2019-02-14 21:52:47 +00:00
|
|
|
WalProcessingOption LogRecord(const WriteBatch& /*batch*/,
|
|
|
|
WriteBatch* /*new_batch*/,
|
|
|
|
bool* /*batch_changed*/) const override {
|
2016-03-22 19:07:15 +00:00
|
|
|
WalFilter::WalProcessingOption option_to_return;
|
|
|
|
|
|
|
|
if (current_record_index_ == apply_option_at_record_index_) {
|
|
|
|
option_to_return = wal_processing_option_;
|
2022-11-02 21:34:24 +00:00
|
|
|
} else {
|
2016-03-22 19:07:15 +00:00
|
|
|
option_to_return = WalProcessingOption::kContinueProcessing;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Filter is passed as a const object for RocksDB to not modify the
|
|
|
|
// object, however we modify it for our own purpose here and hence
|
|
|
|
// cast the constness away.
|
|
|
|
(const_cast<TestWalFilter*>(this)->current_record_index_)++;
|
|
|
|
|
|
|
|
return option_to_return;
|
|
|
|
}
|
|
|
|
|
2019-02-14 21:52:47 +00:00
|
|
|
const char* Name() const override { return "TestWalFilter"; }
|
2016-03-22 19:07:15 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
// Create 3 batches with two keys each
|
|
|
|
std::vector<std::vector<std::string>> batch_keys(3);
|
|
|
|
|
|
|
|
batch_keys[0].push_back("key1");
|
|
|
|
batch_keys[0].push_back("key2");
|
|
|
|
batch_keys[1].push_back("key3");
|
|
|
|
batch_keys[1].push_back("key4");
|
|
|
|
batch_keys[2].push_back("key5");
|
|
|
|
batch_keys[2].push_back("key6");
|
|
|
|
|
|
|
|
// Test with all WAL processing options
|
|
|
|
for (int option = 0;
|
2022-11-02 21:34:24 +00:00
|
|
|
option < static_cast<int>(
|
|
|
|
WalFilter::WalProcessingOption::kWalProcessingOptionMax);
|
|
|
|
option++) {
|
2016-03-22 19:07:15 +00:00
|
|
|
Options options = OptionsForLogIterTest();
|
|
|
|
DestroyAndReopen(options);
|
2022-11-02 21:34:24 +00:00
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
// Write given keys in given batches
|
|
|
|
for (size_t i = 0; i < batch_keys.size(); i++) {
|
|
|
|
WriteBatch batch;
|
|
|
|
for (size_t j = 0; j < batch_keys[i].size(); j++) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(batch.Put(handles_[0], batch_keys[i][j], DummyString(1024)));
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->Write(WriteOptions(), &batch));
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
WalFilter::WalProcessingOption wal_processing_option =
|
2022-11-02 21:34:24 +00:00
|
|
|
static_cast<WalFilter::WalProcessingOption>(option);
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
// Create a test filter that would apply wal_processing_option at the first
|
|
|
|
// record
|
|
|
|
size_t apply_option_for_record_index = 1;
|
|
|
|
TestWalFilter test_wal_filter(wal_processing_option,
|
2022-11-02 21:34:24 +00:00
|
|
|
apply_option_for_record_index);
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
// Reopen database with option to use WAL filter
|
|
|
|
options = OptionsForLogIterTest();
|
|
|
|
options.wal_filter = &test_wal_filter;
|
|
|
|
Status status =
|
2022-11-02 21:34:24 +00:00
|
|
|
TryReopenWithColumnFamilies({"default", "pikachu"}, options);
|
2016-03-22 19:07:15 +00:00
|
|
|
if (wal_processing_option ==
|
2022-11-02 21:34:24 +00:00
|
|
|
WalFilter::WalProcessingOption::kCorruptedRecord) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_NOK(status);
|
2016-03-22 19:07:15 +00:00
|
|
|
// In case of corruption we can turn off paranoid_checks to reopen
|
|
|
|
// databse
|
|
|
|
options.paranoid_checks = false;
|
2022-11-02 21:34:24 +00:00
|
|
|
ReopenWithColumnFamilies({"default", "pikachu"}, options);
|
|
|
|
} else {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(status);
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// Compute which keys we expect to be found
|
|
|
|
// and which we expect not to be found after recovery.
|
|
|
|
std::vector<Slice> keys_must_exist;
|
|
|
|
std::vector<Slice> keys_must_not_exist;
|
|
|
|
switch (wal_processing_option) {
|
2022-11-02 21:34:24 +00:00
|
|
|
case WalFilter::WalProcessingOption::kCorruptedRecord:
|
|
|
|
case WalFilter::WalProcessingOption::kContinueProcessing: {
|
|
|
|
fprintf(stderr, "Testing with complete WAL processing\n");
|
|
|
|
// we expect all records to be processed
|
|
|
|
for (size_t i = 0; i < batch_keys.size(); i++) {
|
|
|
|
for (size_t j = 0; j < batch_keys[i].size(); j++) {
|
2024-01-05 19:53:57 +00:00
|
|
|
keys_must_exist.emplace_back(batch_keys[i][j]);
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
}
|
2022-11-02 21:34:24 +00:00
|
|
|
break;
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
2022-11-02 21:34:24 +00:00
|
|
|
case WalFilter::WalProcessingOption::kIgnoreCurrentRecord: {
|
|
|
|
fprintf(stderr,
|
|
|
|
"Testing with ignoring record %" ROCKSDB_PRIszt " only\n",
|
|
|
|
apply_option_for_record_index);
|
|
|
|
// We expect the record with apply_option_for_record_index to be not
|
|
|
|
// found.
|
|
|
|
for (size_t i = 0; i < batch_keys.size(); i++) {
|
|
|
|
for (size_t j = 0; j < batch_keys[i].size(); j++) {
|
|
|
|
if (i == apply_option_for_record_index) {
|
2024-01-05 19:53:57 +00:00
|
|
|
keys_must_not_exist.emplace_back(batch_keys[i][j]);
|
2022-11-02 21:34:24 +00:00
|
|
|
} else {
|
2024-01-05 19:53:57 +00:00
|
|
|
keys_must_exist.emplace_back(batch_keys[i][j]);
|
2022-11-02 21:34:24 +00:00
|
|
|
}
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
2022-11-02 21:34:24 +00:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case WalFilter::WalProcessingOption::kStopReplay: {
|
|
|
|
fprintf(stderr,
|
|
|
|
"Testing with stopping replay from record %" ROCKSDB_PRIszt
|
|
|
|
"\n",
|
|
|
|
apply_option_for_record_index);
|
|
|
|
// We expect records beyond apply_option_for_record_index to be not
|
|
|
|
// found.
|
|
|
|
for (size_t i = 0; i < batch_keys.size(); i++) {
|
|
|
|
for (size_t j = 0; j < batch_keys[i].size(); j++) {
|
|
|
|
if (i >= apply_option_for_record_index) {
|
2024-01-05 19:53:57 +00:00
|
|
|
keys_must_not_exist.emplace_back(batch_keys[i][j]);
|
2022-11-02 21:34:24 +00:00
|
|
|
} else {
|
2024-01-05 19:53:57 +00:00
|
|
|
keys_must_exist.emplace_back(batch_keys[i][j]);
|
2022-11-02 21:34:24 +00:00
|
|
|
}
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
}
|
2022-11-02 21:34:24 +00:00
|
|
|
break;
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
2022-11-02 21:34:24 +00:00
|
|
|
default:
|
|
|
|
FAIL(); // unhandled case
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
bool checked_after_reopen = false;
|
|
|
|
|
|
|
|
while (true) {
|
|
|
|
// Ensure that expected keys exists
|
|
|
|
// and not expected keys don't exist after recovery
|
|
|
|
ValidateKeyExistence(db_, keys_must_exist, keys_must_not_exist);
|
|
|
|
|
|
|
|
if (checked_after_reopen) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
// reopen database again to make sure previous log(s) are not used
|
|
|
|
//(even if they were skipped)
|
|
|
|
// reopn database with option to use WAL filter
|
|
|
|
options = OptionsForLogIterTest();
|
2022-11-02 21:34:24 +00:00
|
|
|
ReopenWithColumnFamilies({"default", "pikachu"}, options);
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
checked_after_reopen = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, WalFilterTestWithChangeBatch) {
|
|
|
|
class ChangeBatchHandler : public WriteBatch::Handler {
|
2022-11-02 21:34:24 +00:00
|
|
|
private:
|
2016-03-22 19:07:15 +00:00
|
|
|
// Batch to insert keys in
|
|
|
|
WriteBatch* new_write_batch_;
|
|
|
|
// Number of keys to add in the new batch
|
|
|
|
size_t num_keys_to_add_in_new_batch_;
|
|
|
|
// Number of keys added to new batch
|
|
|
|
size_t num_keys_added_;
|
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
public:
|
2016-03-22 19:07:15 +00:00
|
|
|
ChangeBatchHandler(WriteBatch* new_write_batch,
|
2022-11-02 21:34:24 +00:00
|
|
|
size_t num_keys_to_add_in_new_batch)
|
|
|
|
: new_write_batch_(new_write_batch),
|
|
|
|
num_keys_to_add_in_new_batch_(num_keys_to_add_in_new_batch),
|
|
|
|
num_keys_added_(0) {}
|
2019-02-14 21:52:47 +00:00
|
|
|
void Put(const Slice& key, const Slice& value) override {
|
2016-03-22 19:07:15 +00:00
|
|
|
if (num_keys_added_ < num_keys_to_add_in_new_batch_) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(new_write_batch_->Put(key, value));
|
2016-03-22 19:07:15 +00:00
|
|
|
++num_keys_added_;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
class TestWalFilterWithChangeBatch : public WalFilter {
|
2022-11-02 21:34:24 +00:00
|
|
|
private:
|
2016-03-22 19:07:15 +00:00
|
|
|
// Index at which to start changing records
|
|
|
|
size_t change_records_from_index_;
|
|
|
|
// Number of keys to add in the new batch
|
|
|
|
size_t num_keys_to_add_in_new_batch_;
|
|
|
|
// Current record index, incremented with each record encountered.
|
|
|
|
size_t current_record_index_;
|
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
public:
|
2016-03-22 19:07:15 +00:00
|
|
|
TestWalFilterWithChangeBatch(size_t change_records_from_index,
|
2022-11-02 21:34:24 +00:00
|
|
|
size_t num_keys_to_add_in_new_batch)
|
|
|
|
: change_records_from_index_(change_records_from_index),
|
|
|
|
num_keys_to_add_in_new_batch_(num_keys_to_add_in_new_batch),
|
|
|
|
current_record_index_(0) {}
|
2016-03-22 19:07:15 +00:00
|
|
|
|
2019-02-14 21:52:47 +00:00
|
|
|
WalProcessingOption LogRecord(const WriteBatch& batch,
|
|
|
|
WriteBatch* new_batch,
|
|
|
|
bool* batch_changed) const override {
|
2016-03-22 19:07:15 +00:00
|
|
|
if (current_record_index_ >= change_records_from_index_) {
|
|
|
|
ChangeBatchHandler handler(new_batch, num_keys_to_add_in_new_batch_);
|
2021-08-16 15:09:46 +00:00
|
|
|
Status s = batch.Iterate(&handler);
|
|
|
|
if (s.ok()) {
|
|
|
|
*batch_changed = true;
|
|
|
|
} else {
|
|
|
|
assert(false);
|
|
|
|
}
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// Filter is passed as a const object for RocksDB to not modify the
|
|
|
|
// object, however we modify it for our own purpose here and hence
|
|
|
|
// cast the constness away.
|
|
|
|
(const_cast<TestWalFilterWithChangeBatch*>(this)
|
2022-11-02 21:34:24 +00:00
|
|
|
->current_record_index_)++;
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
return WalProcessingOption::kContinueProcessing;
|
|
|
|
}
|
|
|
|
|
2019-02-14 21:52:47 +00:00
|
|
|
const char* Name() const override { return "TestWalFilterWithChangeBatch"; }
|
2016-03-22 19:07:15 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
std::vector<std::vector<std::string>> batch_keys(3);
|
|
|
|
|
|
|
|
batch_keys[0].push_back("key1");
|
|
|
|
batch_keys[0].push_back("key2");
|
|
|
|
batch_keys[1].push_back("key3");
|
|
|
|
batch_keys[1].push_back("key4");
|
|
|
|
batch_keys[2].push_back("key5");
|
|
|
|
batch_keys[2].push_back("key6");
|
|
|
|
|
|
|
|
Options options = OptionsForLogIterTest();
|
|
|
|
DestroyAndReopen(options);
|
2022-11-02 21:34:24 +00:00
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
// Write given keys in given batches
|
|
|
|
for (size_t i = 0; i < batch_keys.size(); i++) {
|
|
|
|
WriteBatch batch;
|
|
|
|
for (size_t j = 0; j < batch_keys[i].size(); j++) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(batch.Put(handles_[0], batch_keys[i][j], DummyString(1024)));
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->Write(WriteOptions(), &batch));
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// Create a test filter that would apply wal_processing_option at the first
|
|
|
|
// record
|
|
|
|
size_t change_records_from_index = 1;
|
|
|
|
size_t num_keys_to_add_in_new_batch = 1;
|
|
|
|
TestWalFilterWithChangeBatch test_wal_filter_with_change_batch(
|
2022-11-02 21:34:24 +00:00
|
|
|
change_records_from_index, num_keys_to_add_in_new_batch);
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
// Reopen database with option to use WAL filter
|
|
|
|
options = OptionsForLogIterTest();
|
|
|
|
options.wal_filter = &test_wal_filter_with_change_batch;
|
2022-11-02 21:34:24 +00:00
|
|
|
ReopenWithColumnFamilies({"default", "pikachu"}, options);
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
// Ensure that all keys exist before change_records_from_index_
|
|
|
|
// And after that index only single key exists
|
|
|
|
// as our filter adds only single key for each batch
|
|
|
|
std::vector<Slice> keys_must_exist;
|
|
|
|
std::vector<Slice> keys_must_not_exist;
|
|
|
|
|
|
|
|
for (size_t i = 0; i < batch_keys.size(); i++) {
|
|
|
|
for (size_t j = 0; j < batch_keys[i].size(); j++) {
|
2016-04-12 17:35:15 +00:00
|
|
|
if (i >= change_records_from_index && j >= num_keys_to_add_in_new_batch) {
|
2024-01-05 19:53:57 +00:00
|
|
|
keys_must_not_exist.emplace_back(batch_keys[i][j]);
|
2022-11-02 21:34:24 +00:00
|
|
|
} else {
|
2024-01-05 19:53:57 +00:00
|
|
|
keys_must_exist.emplace_back(batch_keys[i][j]);
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
bool checked_after_reopen = false;
|
|
|
|
|
|
|
|
while (true) {
|
|
|
|
// Ensure that expected keys exists
|
|
|
|
// and not expected keys don't exist after recovery
|
|
|
|
ValidateKeyExistence(db_, keys_must_exist, keys_must_not_exist);
|
|
|
|
|
|
|
|
if (checked_after_reopen) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
// reopen database again to make sure previous log(s) are not used
|
|
|
|
//(even if they were skipped)
|
|
|
|
// reopn database with option to use WAL filter
|
|
|
|
options = OptionsForLogIterTest();
|
2022-11-02 21:34:24 +00:00
|
|
|
ReopenWithColumnFamilies({"default", "pikachu"}, options);
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
checked_after_reopen = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, WalFilterTestWithChangeBatchExtraKeys) {
|
|
|
|
class TestWalFilterWithChangeBatchAddExtraKeys : public WalFilter {
|
2022-11-02 21:34:24 +00:00
|
|
|
public:
|
|
|
|
WalProcessingOption LogRecord(const WriteBatch& batch,
|
|
|
|
WriteBatch* new_batch,
|
|
|
|
bool* batch_changed) const override {
|
|
|
|
*new_batch = batch;
|
|
|
|
Status s = new_batch->Put("key_extra", "value_extra");
|
|
|
|
if (s.ok()) {
|
|
|
|
*batch_changed = true;
|
|
|
|
} else {
|
|
|
|
assert(false);
|
|
|
|
}
|
|
|
|
return WalProcessingOption::kContinueProcessing;
|
|
|
|
}
|
|
|
|
|
|
|
|
const char* Name() const override {
|
|
|
|
return "WalFilterTestWithChangeBatchExtraKeys";
|
|
|
|
}
|
2016-03-22 19:07:15 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
std::vector<std::vector<std::string>> batch_keys(3);
|
|
|
|
|
|
|
|
batch_keys[0].push_back("key1");
|
|
|
|
batch_keys[0].push_back("key2");
|
|
|
|
batch_keys[1].push_back("key3");
|
|
|
|
batch_keys[1].push_back("key4");
|
|
|
|
batch_keys[2].push_back("key5");
|
|
|
|
batch_keys[2].push_back("key6");
|
|
|
|
|
|
|
|
Options options = OptionsForLogIterTest();
|
|
|
|
DestroyAndReopen(options);
|
2022-11-02 21:34:24 +00:00
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
// Write given keys in given batches
|
|
|
|
for (size_t i = 0; i < batch_keys.size(); i++) {
|
|
|
|
WriteBatch batch;
|
|
|
|
for (size_t j = 0; j < batch_keys[i].size(); j++) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(batch.Put(handles_[0], batch_keys[i][j], DummyString(1024)));
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->Write(WriteOptions(), &batch));
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// Create a test filter that would add extra keys
|
|
|
|
TestWalFilterWithChangeBatchAddExtraKeys test_wal_filter_extra_keys;
|
|
|
|
|
|
|
|
// Reopen database with option to use WAL filter
|
|
|
|
options = OptionsForLogIterTest();
|
|
|
|
options.wal_filter = &test_wal_filter_extra_keys;
|
2016-04-12 17:35:15 +00:00
|
|
|
Status status = TryReopenWithColumnFamilies({"default", "pikachu"}, options);
|
2016-03-22 19:07:15 +00:00
|
|
|
ASSERT_TRUE(status.IsNotSupported());
|
|
|
|
|
|
|
|
// Reopen without filter, now reopen should succeed - previous
|
|
|
|
// attempt to open must not have altered the db.
|
|
|
|
options = OptionsForLogIterTest();
|
2022-11-02 21:34:24 +00:00
|
|
|
ReopenWithColumnFamilies({"default", "pikachu"}, options);
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
std::vector<Slice> keys_must_exist;
|
|
|
|
std::vector<Slice> keys_must_not_exist; // empty vector
|
|
|
|
|
|
|
|
for (size_t i = 0; i < batch_keys.size(); i++) {
|
|
|
|
for (size_t j = 0; j < batch_keys[i].size(); j++) {
|
2024-01-05 19:53:57 +00:00
|
|
|
keys_must_exist.emplace_back(batch_keys[i][j]);
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ValidateKeyExistence(db_, keys_must_exist, keys_must_not_exist);
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, WalFilterTestWithColumnFamilies) {
|
|
|
|
class TestWalFilterWithColumnFamilies : public WalFilter {
|
2022-11-02 21:34:24 +00:00
|
|
|
private:
|
2016-03-22 19:07:15 +00:00
|
|
|
// column_family_id -> log_number map (provided to WALFilter)
|
|
|
|
std::map<uint32_t, uint64_t> cf_log_number_map_;
|
|
|
|
// column_family_name -> column_family_id map (provided to WALFilter)
|
|
|
|
std::map<std::string, uint32_t> cf_name_id_map_;
|
|
|
|
// column_family_name -> keys_found_in_wal map
|
|
|
|
// We store keys that are applicable to the column_family
|
|
|
|
// during recovery (i.e. aren't already flushed to SST file(s))
|
|
|
|
// for verification against the keys we expect.
|
|
|
|
std::map<uint32_t, std::vector<std::string>> cf_wal_keys_;
|
2022-11-02 21:34:24 +00:00
|
|
|
|
|
|
|
public:
|
|
|
|
void ColumnFamilyLogNumberMap(
|
|
|
|
const std::map<uint32_t, uint64_t>& cf_lognumber_map,
|
|
|
|
const std::map<std::string, uint32_t>& cf_name_id_map) override {
|
|
|
|
cf_log_number_map_ = cf_lognumber_map;
|
|
|
|
cf_name_id_map_ = cf_name_id_map;
|
|
|
|
}
|
|
|
|
|
|
|
|
WalProcessingOption LogRecordFound(unsigned long long log_number,
|
|
|
|
const std::string& /*log_file_name*/,
|
|
|
|
const WriteBatch& batch,
|
|
|
|
WriteBatch* /*new_batch*/,
|
|
|
|
bool* /*batch_changed*/) override {
|
|
|
|
class LogRecordBatchHandler : public WriteBatch::Handler {
|
|
|
|
private:
|
|
|
|
const std::map<uint32_t, uint64_t>& cf_log_number_map_;
|
|
|
|
std::map<uint32_t, std::vector<std::string>>& cf_wal_keys_;
|
2016-03-22 19:07:15 +00:00
|
|
|
unsigned long long log_number_;
|
2022-11-02 21:34:24 +00:00
|
|
|
|
|
|
|
public:
|
|
|
|
LogRecordBatchHandler(
|
|
|
|
unsigned long long current_log_number,
|
|
|
|
const std::map<uint32_t, uint64_t>& cf_log_number_map,
|
|
|
|
std::map<uint32_t, std::vector<std::string>>& cf_wal_keys)
|
|
|
|
: cf_log_number_map_(cf_log_number_map),
|
|
|
|
cf_wal_keys_(cf_wal_keys),
|
|
|
|
log_number_(current_log_number) {}
|
2016-03-22 19:07:15 +00:00
|
|
|
|
2019-02-14 21:52:47 +00:00
|
|
|
Status PutCF(uint32_t column_family_id, const Slice& key,
|
|
|
|
const Slice& /*value*/) override {
|
2016-03-22 19:07:15 +00:00
|
|
|
auto it = cf_log_number_map_.find(column_family_id);
|
|
|
|
assert(it != cf_log_number_map_.end());
|
|
|
|
unsigned long long log_number_for_cf = it->second;
|
|
|
|
// If the current record is applicable for column_family_id
|
|
|
|
// (i.e. isn't flushed to SST file(s) for column_family_id)
|
|
|
|
// add it to the cf_wal_keys_ map for verification.
|
|
|
|
if (log_number_ >= log_number_for_cf) {
|
2022-11-02 21:34:24 +00:00
|
|
|
cf_wal_keys_[column_family_id].push_back(
|
|
|
|
std::string(key.data(), key.size()));
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
} handler(log_number, cf_log_number_map_, cf_wal_keys_);
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
Status s = batch.Iterate(&handler);
|
|
|
|
if (!s.ok()) {
|
|
|
|
// TODO(AR) is this ok?
|
|
|
|
return WalProcessingOption::kCorruptedRecord;
|
|
|
|
}
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
return WalProcessingOption::kContinueProcessing;
|
2022-11-02 21:34:24 +00:00
|
|
|
}
|
2016-03-22 19:07:15 +00:00
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
const char* Name() const override {
|
|
|
|
return "WalFilterTestWithColumnFamilies";
|
|
|
|
}
|
2016-03-22 19:07:15 +00:00
|
|
|
|
2016-04-12 17:35:15 +00:00
|
|
|
const std::map<uint32_t, std::vector<std::string>>& GetColumnFamilyKeys() {
|
2016-03-22 19:07:15 +00:00
|
|
|
return cf_wal_keys_;
|
|
|
|
}
|
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
const std::map<std::string, uint32_t>& GetColumnFamilyNameIdMap() {
|
2016-03-22 19:07:15 +00:00
|
|
|
return cf_name_id_map_;
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
std::vector<std::vector<std::string>> batch_keys_pre_flush(3);
|
|
|
|
|
|
|
|
batch_keys_pre_flush[0].push_back("key1");
|
|
|
|
batch_keys_pre_flush[0].push_back("key2");
|
|
|
|
batch_keys_pre_flush[1].push_back("key3");
|
|
|
|
batch_keys_pre_flush[1].push_back("key4");
|
|
|
|
batch_keys_pre_flush[2].push_back("key5");
|
|
|
|
batch_keys_pre_flush[2].push_back("key6");
|
|
|
|
|
|
|
|
Options options = OptionsForLogIterTest();
|
|
|
|
DestroyAndReopen(options);
|
2022-11-02 21:34:24 +00:00
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
// Write given keys in given batches
|
|
|
|
for (size_t i = 0; i < batch_keys_pre_flush.size(); i++) {
|
|
|
|
WriteBatch batch;
|
|
|
|
for (size_t j = 0; j < batch_keys_pre_flush[i].size(); j++) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(batch.Put(handles_[0], batch_keys_pre_flush[i][j],
|
|
|
|
DummyString(1024)));
|
|
|
|
ASSERT_OK(batch.Put(handles_[1], batch_keys_pre_flush[i][j],
|
|
|
|
DummyString(1024)));
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->Write(WriteOptions(), &batch));
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
// Flush default column-family
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(db_->Flush(FlushOptions(), handles_[0]));
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
// Do some more writes
|
|
|
|
std::vector<std::vector<std::string>> batch_keys_post_flush(3);
|
|
|
|
|
|
|
|
batch_keys_post_flush[0].push_back("key7");
|
|
|
|
batch_keys_post_flush[0].push_back("key8");
|
|
|
|
batch_keys_post_flush[1].push_back("key9");
|
|
|
|
batch_keys_post_flush[1].push_back("key10");
|
|
|
|
batch_keys_post_flush[2].push_back("key11");
|
|
|
|
batch_keys_post_flush[2].push_back("key12");
|
|
|
|
|
|
|
|
// Write given keys in given batches
|
|
|
|
for (size_t i = 0; i < batch_keys_post_flush.size(); i++) {
|
|
|
|
WriteBatch batch;
|
|
|
|
for (size_t j = 0; j < batch_keys_post_flush[i].size(); j++) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(batch.Put(handles_[0], batch_keys_post_flush[i][j],
|
|
|
|
DummyString(1024)));
|
|
|
|
ASSERT_OK(batch.Put(handles_[1], batch_keys_post_flush[i][j],
|
|
|
|
DummyString(1024)));
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->Write(WriteOptions(), &batch));
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// On Recovery we should only find the second batch applicable to default CF
|
|
|
|
// But both batches applicable to pikachu CF
|
|
|
|
|
|
|
|
// Create a test filter that would add extra keys
|
|
|
|
TestWalFilterWithColumnFamilies test_wal_filter_column_families;
|
|
|
|
|
|
|
|
// Reopen database with option to use WAL filter
|
|
|
|
options = OptionsForLogIterTest();
|
|
|
|
options.wal_filter = &test_wal_filter_column_families;
|
2022-11-02 21:34:24 +00:00
|
|
|
Status status = TryReopenWithColumnFamilies({"default", "pikachu"}, options);
|
2016-03-22 19:07:15 +00:00
|
|
|
ASSERT_TRUE(status.ok());
|
|
|
|
|
|
|
|
// verify that handles_[0] only has post_flush keys
|
|
|
|
// while handles_[1] has pre and post flush keys
|
|
|
|
auto cf_wal_keys = test_wal_filter_column_families.GetColumnFamilyKeys();
|
2016-04-12 17:35:15 +00:00
|
|
|
auto name_id_map = test_wal_filter_column_families.GetColumnFamilyNameIdMap();
|
2016-03-22 19:07:15 +00:00
|
|
|
size_t index = 0;
|
|
|
|
auto keys_cf = cf_wal_keys[name_id_map[kDefaultColumnFamilyName]];
|
2022-11-02 21:34:24 +00:00
|
|
|
// default column-family, only post_flush keys are expected
|
2016-03-22 19:07:15 +00:00
|
|
|
for (size_t i = 0; i < batch_keys_post_flush.size(); i++) {
|
|
|
|
for (size_t j = 0; j < batch_keys_post_flush[i].size(); j++) {
|
|
|
|
Slice key_from_the_log(keys_cf[index++]);
|
|
|
|
Slice batch_key(batch_keys_post_flush[i][j]);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_EQ(key_from_the_log.compare(batch_key), 0);
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_EQ(index, keys_cf.size());
|
2016-03-22 19:07:15 +00:00
|
|
|
|
|
|
|
index = 0;
|
|
|
|
keys_cf = cf_wal_keys[name_id_map["pikachu"]];
|
2022-11-02 21:34:24 +00:00
|
|
|
// pikachu column-family, all keys are expected
|
2016-03-22 19:07:15 +00:00
|
|
|
for (size_t i = 0; i < batch_keys_pre_flush.size(); i++) {
|
|
|
|
for (size_t j = 0; j < batch_keys_pre_flush[i].size(); j++) {
|
|
|
|
Slice key_from_the_log(keys_cf[index++]);
|
|
|
|
Slice batch_key(batch_keys_pre_flush[i][j]);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_EQ(key_from_the_log.compare(batch_key), 0);
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
for (size_t i = 0; i < batch_keys_post_flush.size(); i++) {
|
|
|
|
for (size_t j = 0; j < batch_keys_post_flush[i].size(); j++) {
|
|
|
|
Slice key_from_the_log(keys_cf[index++]);
|
|
|
|
Slice batch_key(batch_keys_post_flush[i][j]);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_EQ(key_from_the_log.compare(batch_key), 0);
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_EQ(index, keys_cf.size());
|
2016-03-22 19:07:15 +00:00
|
|
|
}
|
2016-04-28 22:11:28 +00:00
|
|
|
|
2019-05-30 23:07:57 +00:00
|
|
|
TEST_F(DBTest2, PresetCompressionDict) {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
// Verifies that compression ratio improves when dictionary is enabled, and
|
|
|
|
// improves even further when the dictionary is trained by ZSTD.
|
2016-04-28 22:11:28 +00:00
|
|
|
const size_t kBlockSizeBytes = 4 << 10;
|
|
|
|
const size_t kL0FileBytes = 128 << 10;
|
|
|
|
const size_t kApproxPerBlockOverheadBytes = 50;
|
|
|
|
const int kNumL0Files = 5;
|
|
|
|
|
|
|
|
Options options;
|
2019-05-30 23:07:57 +00:00
|
|
|
// Make sure to use any custom env that the test is configured with.
|
|
|
|
options.env = CurrentOptions().env;
|
2016-11-16 17:24:52 +00:00
|
|
|
options.allow_concurrent_memtable_write = false;
|
2016-04-28 22:11:28 +00:00
|
|
|
options.arena_block_size = kBlockSizeBytes;
|
|
|
|
options.create_if_missing = true;
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
options.level0_file_num_compaction_trigger = kNumL0Files;
|
|
|
|
options.memtable_factory.reset(
|
2021-09-08 14:45:59 +00:00
|
|
|
test::NewSpecialSkipListFactory(kL0FileBytes / kBlockSizeBytes));
|
2016-04-28 22:11:28 +00:00
|
|
|
options.num_levels = 2;
|
|
|
|
options.target_file_size_base = kL0FileBytes;
|
|
|
|
options.target_file_size_multiplier = 2;
|
|
|
|
options.write_buffer_size = kL0FileBytes;
|
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
table_options.block_size = kBlockSizeBytes;
|
|
|
|
std::vector<CompressionType> compression_types;
|
|
|
|
if (Zlib_Supported()) {
|
|
|
|
compression_types.push_back(kZlibCompression);
|
|
|
|
}
|
|
|
|
#if LZ4_VERSION_NUMBER >= 10400 // r124+
|
|
|
|
compression_types.push_back(kLZ4Compression);
|
|
|
|
compression_types.push_back(kLZ4HCCompression);
|
2022-11-02 21:34:24 +00:00
|
|
|
#endif // LZ4_VERSION_NUMBER >= 10400
|
2016-09-01 22:28:40 +00:00
|
|
|
if (ZSTD_Supported()) {
|
|
|
|
compression_types.push_back(kZSTD);
|
|
|
|
}
|
2016-04-28 22:11:28 +00:00
|
|
|
|
2019-05-30 23:07:57 +00:00
|
|
|
enum DictionaryTypes : int {
|
|
|
|
kWithoutDict,
|
|
|
|
kWithDict,
|
Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
```
# linked local ZSTD with version 1.5.2
# DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1 EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
dict_bytes=16384
train_bytes=1048576
echo "========== No Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== Raw Content Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== FinalizeDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== TrainDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
# Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
# before compression data size: 1.2GB
dict_bytes=16384
max_dict_buffer_bytes = 1048576
space cpu/memory
No Dictionary 468M 14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
Raw Dictionary 251M 15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
FinalizeDictionary 236M 11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
TrainDictionary 84M 7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
```
#### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
```
dict_bytes=16384
train_bytes=1048576
for sst_file in `ls ../temp/myrock-sst/`
do
echo "********** $sst_file **********"
echo "========== No Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
echo "========== Raw Content Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
echo "========== FinalizeDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
echo "========== TrainDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
done
010240.sst (Size/Time) 011029.sst 013184.sst 021552.sst 185054.sst 185137.sst 191666.sst 7560381.sst 7604174.sst 7635312.sst
No Dictionary 28165569 / 2614419 32899411 / 2976832 32977848 / 3055542 31966329 / 2004590 33614351 / 1755877 33429029 / 1717042 33611933 / 1776936 33634045 / 2771417 33789721 / 2205414 33592194 / 388254
Raw Content Dictionary 28019950 / 2697961 33748665 / 3572422 33896373 / 3534701 26418431 / 2259658 28560825 / 1839168 28455030 / 1846039 28494319 / 1861349 32391599 / 3095649 33772142 / 2407843 33592230 / 474523
FinalizeDictionary 27896012 / 2650029 33763886 / 3719427 33904283 / 3552793 26008225 / 2198033 28111872 / 1869530 28014374 / 1789771 28047706 / 1848300 32296254 / 3204027 33698698 / 2381468 33592344 / 517433
TrainDictionary 28046089 / 2740037 33706480 / 3679019 33885741 / 3629351 25087123 / 2204558 27194353 / 1970207 27234229 / 1896811 27166710 / 1903119 32011041 / 3322315 32730692 / 2406146 33608631 / 570593
```
#### Decompression/Read test:
With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
```
dict_bytes=16384
train_bytes=1048576
echo "No Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
echo "Raw Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
echo "FinalizeDict"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
echo "Train Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
No Dictionary
readrandom : 12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations; 9.1 MB/s (1000000 of 1000000 found)
Raw Dictionary
readrandom : 12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found)
FinalizeDict
readrandom : 9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations; 11.3 MB/s (1000000 of 1000000 found)
Train Dictionary
readrandom : 9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations; 11.4 MB/s (1000000 of 1000000 found)
```
Reviewed By: ajkr
Differential Revision: D35720026
Pulled By: cbi42
fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
2022-05-20 19:09:09 +00:00
|
|
|
kWithZSTDfinalizeDict,
|
2019-05-30 23:07:57 +00:00
|
|
|
kWithZSTDTrainedDict,
|
|
|
|
kDictEnd,
|
|
|
|
};
|
|
|
|
|
2016-04-28 22:11:28 +00:00
|
|
|
for (auto compression_type : compression_types) {
|
|
|
|
options.compression = compression_type;
|
2019-05-30 23:07:57 +00:00
|
|
|
size_t bytes_without_dict = 0;
|
|
|
|
size_t bytes_with_dict = 0;
|
Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
```
# linked local ZSTD with version 1.5.2
# DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1 EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
dict_bytes=16384
train_bytes=1048576
echo "========== No Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== Raw Content Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== FinalizeDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== TrainDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
# Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
# before compression data size: 1.2GB
dict_bytes=16384
max_dict_buffer_bytes = 1048576
space cpu/memory
No Dictionary 468M 14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
Raw Dictionary 251M 15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
FinalizeDictionary 236M 11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
TrainDictionary 84M 7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
```
#### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
```
dict_bytes=16384
train_bytes=1048576
for sst_file in `ls ../temp/myrock-sst/`
do
echo "********** $sst_file **********"
echo "========== No Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
echo "========== Raw Content Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
echo "========== FinalizeDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
echo "========== TrainDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
done
010240.sst (Size/Time) 011029.sst 013184.sst 021552.sst 185054.sst 185137.sst 191666.sst 7560381.sst 7604174.sst 7635312.sst
No Dictionary 28165569 / 2614419 32899411 / 2976832 32977848 / 3055542 31966329 / 2004590 33614351 / 1755877 33429029 / 1717042 33611933 / 1776936 33634045 / 2771417 33789721 / 2205414 33592194 / 388254
Raw Content Dictionary 28019950 / 2697961 33748665 / 3572422 33896373 / 3534701 26418431 / 2259658 28560825 / 1839168 28455030 / 1846039 28494319 / 1861349 32391599 / 3095649 33772142 / 2407843 33592230 / 474523
FinalizeDictionary 27896012 / 2650029 33763886 / 3719427 33904283 / 3552793 26008225 / 2198033 28111872 / 1869530 28014374 / 1789771 28047706 / 1848300 32296254 / 3204027 33698698 / 2381468 33592344 / 517433
TrainDictionary 28046089 / 2740037 33706480 / 3679019 33885741 / 3629351 25087123 / 2204558 27194353 / 1970207 27234229 / 1896811 27166710 / 1903119 32011041 / 3322315 32730692 / 2406146 33608631 / 570593
```
#### Decompression/Read test:
With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
```
dict_bytes=16384
train_bytes=1048576
echo "No Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
echo "Raw Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
echo "FinalizeDict"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
echo "Train Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
No Dictionary
readrandom : 12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations; 9.1 MB/s (1000000 of 1000000 found)
Raw Dictionary
readrandom : 12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found)
FinalizeDict
readrandom : 9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations; 11.3 MB/s (1000000 of 1000000 found)
Train Dictionary
readrandom : 9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations; 11.4 MB/s (1000000 of 1000000 found)
```
Reviewed By: ajkr
Differential Revision: D35720026
Pulled By: cbi42
fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
2022-05-20 19:09:09 +00:00
|
|
|
size_t bytes_with_zstd_finalize_dict = 0;
|
2019-05-30 23:07:57 +00:00
|
|
|
size_t bytes_with_zstd_trained_dict = 0;
|
|
|
|
for (int i = kWithoutDict; i < kDictEnd; i++) {
|
2016-04-28 22:11:28 +00:00
|
|
|
// First iteration: compress without preset dictionary
|
|
|
|
// Second iteration: compress with preset dictionary
|
2017-11-03 05:46:13 +00:00
|
|
|
// Third iteration (zstd only): compress with zstd-trained dictionary
|
|
|
|
//
|
|
|
|
// To make sure the compression dictionary has the intended effect, we
|
|
|
|
// verify the compressed size is smaller in successive iterations. Also in
|
|
|
|
// the non-first iterations, verify the data we get out is the same data
|
|
|
|
// we put in.
|
|
|
|
switch (i) {
|
2019-05-30 23:07:57 +00:00
|
|
|
case kWithoutDict:
|
2017-11-03 05:46:13 +00:00
|
|
|
options.compression_opts.max_dict_bytes = 0;
|
|
|
|
options.compression_opts.zstd_max_train_bytes = 0;
|
|
|
|
break;
|
2019-05-30 23:07:57 +00:00
|
|
|
case kWithDict:
|
|
|
|
options.compression_opts.max_dict_bytes = kBlockSizeBytes;
|
2017-11-03 05:46:13 +00:00
|
|
|
options.compression_opts.zstd_max_train_bytes = 0;
|
|
|
|
break;
|
Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
```
# linked local ZSTD with version 1.5.2
# DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1 EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
dict_bytes=16384
train_bytes=1048576
echo "========== No Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== Raw Content Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== FinalizeDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== TrainDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
# Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
# before compression data size: 1.2GB
dict_bytes=16384
max_dict_buffer_bytes = 1048576
space cpu/memory
No Dictionary 468M 14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
Raw Dictionary 251M 15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
FinalizeDictionary 236M 11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
TrainDictionary 84M 7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
```
#### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
```
dict_bytes=16384
train_bytes=1048576
for sst_file in `ls ../temp/myrock-sst/`
do
echo "********** $sst_file **********"
echo "========== No Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
echo "========== Raw Content Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
echo "========== FinalizeDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
echo "========== TrainDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
done
010240.sst (Size/Time) 011029.sst 013184.sst 021552.sst 185054.sst 185137.sst 191666.sst 7560381.sst 7604174.sst 7635312.sst
No Dictionary 28165569 / 2614419 32899411 / 2976832 32977848 / 3055542 31966329 / 2004590 33614351 / 1755877 33429029 / 1717042 33611933 / 1776936 33634045 / 2771417 33789721 / 2205414 33592194 / 388254
Raw Content Dictionary 28019950 / 2697961 33748665 / 3572422 33896373 / 3534701 26418431 / 2259658 28560825 / 1839168 28455030 / 1846039 28494319 / 1861349 32391599 / 3095649 33772142 / 2407843 33592230 / 474523
FinalizeDictionary 27896012 / 2650029 33763886 / 3719427 33904283 / 3552793 26008225 / 2198033 28111872 / 1869530 28014374 / 1789771 28047706 / 1848300 32296254 / 3204027 33698698 / 2381468 33592344 / 517433
TrainDictionary 28046089 / 2740037 33706480 / 3679019 33885741 / 3629351 25087123 / 2204558 27194353 / 1970207 27234229 / 1896811 27166710 / 1903119 32011041 / 3322315 32730692 / 2406146 33608631 / 570593
```
#### Decompression/Read test:
With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
```
dict_bytes=16384
train_bytes=1048576
echo "No Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
echo "Raw Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
echo "FinalizeDict"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
echo "Train Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
No Dictionary
readrandom : 12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations; 9.1 MB/s (1000000 of 1000000 found)
Raw Dictionary
readrandom : 12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found)
FinalizeDict
readrandom : 9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations; 11.3 MB/s (1000000 of 1000000 found)
Train Dictionary
readrandom : 9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations; 11.4 MB/s (1000000 of 1000000 found)
```
Reviewed By: ajkr
Differential Revision: D35720026
Pulled By: cbi42
fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
2022-05-20 19:09:09 +00:00
|
|
|
case kWithZSTDfinalizeDict:
|
2022-05-24 22:44:49 +00:00
|
|
|
if (compression_type != kZSTD ||
|
|
|
|
!ZSTD_FinalizeDictionarySupported()) {
|
Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
```
# linked local ZSTD with version 1.5.2
# DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1 EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
dict_bytes=16384
train_bytes=1048576
echo "========== No Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== Raw Content Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== FinalizeDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== TrainDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
# Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
# before compression data size: 1.2GB
dict_bytes=16384
max_dict_buffer_bytes = 1048576
space cpu/memory
No Dictionary 468M 14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
Raw Dictionary 251M 15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
FinalizeDictionary 236M 11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
TrainDictionary 84M 7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
```
#### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
```
dict_bytes=16384
train_bytes=1048576
for sst_file in `ls ../temp/myrock-sst/`
do
echo "********** $sst_file **********"
echo "========== No Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
echo "========== Raw Content Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
echo "========== FinalizeDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
echo "========== TrainDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
done
010240.sst (Size/Time) 011029.sst 013184.sst 021552.sst 185054.sst 185137.sst 191666.sst 7560381.sst 7604174.sst 7635312.sst
No Dictionary 28165569 / 2614419 32899411 / 2976832 32977848 / 3055542 31966329 / 2004590 33614351 / 1755877 33429029 / 1717042 33611933 / 1776936 33634045 / 2771417 33789721 / 2205414 33592194 / 388254
Raw Content Dictionary 28019950 / 2697961 33748665 / 3572422 33896373 / 3534701 26418431 / 2259658 28560825 / 1839168 28455030 / 1846039 28494319 / 1861349 32391599 / 3095649 33772142 / 2407843 33592230 / 474523
FinalizeDictionary 27896012 / 2650029 33763886 / 3719427 33904283 / 3552793 26008225 / 2198033 28111872 / 1869530 28014374 / 1789771 28047706 / 1848300 32296254 / 3204027 33698698 / 2381468 33592344 / 517433
TrainDictionary 28046089 / 2740037 33706480 / 3679019 33885741 / 3629351 25087123 / 2204558 27194353 / 1970207 27234229 / 1896811 27166710 / 1903119 32011041 / 3322315 32730692 / 2406146 33608631 / 570593
```
#### Decompression/Read test:
With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
```
dict_bytes=16384
train_bytes=1048576
echo "No Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
echo "Raw Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
echo "FinalizeDict"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
echo "Train Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
No Dictionary
readrandom : 12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations; 9.1 MB/s (1000000 of 1000000 found)
Raw Dictionary
readrandom : 12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found)
FinalizeDict
readrandom : 9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations; 11.3 MB/s (1000000 of 1000000 found)
Train Dictionary
readrandom : 9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations; 11.4 MB/s (1000000 of 1000000 found)
```
Reviewed By: ajkr
Differential Revision: D35720026
Pulled By: cbi42
fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
2022-05-20 19:09:09 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
options.compression_opts.max_dict_bytes = kBlockSizeBytes;
|
|
|
|
options.compression_opts.zstd_max_train_bytes = kL0FileBytes;
|
|
|
|
options.compression_opts.use_zstd_dict_trainer = false;
|
|
|
|
break;
|
2019-05-30 23:07:57 +00:00
|
|
|
case kWithZSTDTrainedDict:
|
2022-05-24 22:44:49 +00:00
|
|
|
if (compression_type != kZSTD || !ZSTD_TrainDictionarySupported()) {
|
2017-11-03 05:46:13 +00:00
|
|
|
continue;
|
|
|
|
}
|
2019-05-30 23:07:57 +00:00
|
|
|
options.compression_opts.max_dict_bytes = kBlockSizeBytes;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
options.compression_opts.zstd_max_train_bytes = kL0FileBytes;
|
Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
```
# linked local ZSTD with version 1.5.2
# DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1 EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
dict_bytes=16384
train_bytes=1048576
echo "========== No Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== Raw Content Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== FinalizeDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== TrainDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
# Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
# before compression data size: 1.2GB
dict_bytes=16384
max_dict_buffer_bytes = 1048576
space cpu/memory
No Dictionary 468M 14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
Raw Dictionary 251M 15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
FinalizeDictionary 236M 11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
TrainDictionary 84M 7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
```
#### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
```
dict_bytes=16384
train_bytes=1048576
for sst_file in `ls ../temp/myrock-sst/`
do
echo "********** $sst_file **********"
echo "========== No Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
echo "========== Raw Content Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
echo "========== FinalizeDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
echo "========== TrainDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
done
010240.sst (Size/Time) 011029.sst 013184.sst 021552.sst 185054.sst 185137.sst 191666.sst 7560381.sst 7604174.sst 7635312.sst
No Dictionary 28165569 / 2614419 32899411 / 2976832 32977848 / 3055542 31966329 / 2004590 33614351 / 1755877 33429029 / 1717042 33611933 / 1776936 33634045 / 2771417 33789721 / 2205414 33592194 / 388254
Raw Content Dictionary 28019950 / 2697961 33748665 / 3572422 33896373 / 3534701 26418431 / 2259658 28560825 / 1839168 28455030 / 1846039 28494319 / 1861349 32391599 / 3095649 33772142 / 2407843 33592230 / 474523
FinalizeDictionary 27896012 / 2650029 33763886 / 3719427 33904283 / 3552793 26008225 / 2198033 28111872 / 1869530 28014374 / 1789771 28047706 / 1848300 32296254 / 3204027 33698698 / 2381468 33592344 / 517433
TrainDictionary 28046089 / 2740037 33706480 / 3679019 33885741 / 3629351 25087123 / 2204558 27194353 / 1970207 27234229 / 1896811 27166710 / 1903119 32011041 / 3322315 32730692 / 2406146 33608631 / 570593
```
#### Decompression/Read test:
With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
```
dict_bytes=16384
train_bytes=1048576
echo "No Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
echo "Raw Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
echo "FinalizeDict"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
echo "Train Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
No Dictionary
readrandom : 12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations; 9.1 MB/s (1000000 of 1000000 found)
Raw Dictionary
readrandom : 12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found)
FinalizeDict
readrandom : 9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations; 11.3 MB/s (1000000 of 1000000 found)
Train Dictionary
readrandom : 9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations; 11.4 MB/s (1000000 of 1000000 found)
```
Reviewed By: ajkr
Differential Revision: D35720026
Pulled By: cbi42
fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
2022-05-20 19:09:09 +00:00
|
|
|
options.compression_opts.use_zstd_dict_trainer = true;
|
2017-11-03 05:46:13 +00:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
assert(false);
|
2016-04-28 22:11:28 +00:00
|
|
|
}
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
2016-04-28 22:11:28 +00:00
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
Random rnd(301);
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
std::string seq_datas[10];
|
|
|
|
for (int j = 0; j < 10; ++j) {
|
|
|
|
seq_datas[j] =
|
2020-07-09 21:33:42 +00:00
|
|
|
rnd.RandomString(kBlockSizeBytes - kApproxPerBlockOverheadBytes);
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
}
|
2016-04-28 22:11:28 +00:00
|
|
|
|
|
|
|
ASSERT_EQ(0, NumTableFilesAtLevel(0, 1));
|
|
|
|
for (int j = 0; j < kNumL0Files; ++j) {
|
|
|
|
for (size_t k = 0; k < kL0FileBytes / kBlockSizeBytes + 1; ++k) {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
auto key_num = j * (kL0FileBytes / kBlockSizeBytes) + k;
|
|
|
|
ASSERT_OK(Put(1, Key(static_cast<int>(key_num)),
|
|
|
|
seq_datas[(key_num / 10) % 10]));
|
2016-04-28 22:11:28 +00:00
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable(handles_[1]));
|
2016-04-28 22:11:28 +00:00
|
|
|
ASSERT_EQ(j + 1, NumTableFilesAtLevel(0, 1));
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_CompactRange(0, nullptr, nullptr, handles_[1],
|
|
|
|
true /* disallow_trivial_move */));
|
2016-04-28 22:11:28 +00:00
|
|
|
ASSERT_EQ(0, NumTableFilesAtLevel(0, 1));
|
|
|
|
ASSERT_GT(NumTableFilesAtLevel(1, 1), 0);
|
|
|
|
|
2019-05-30 23:07:57 +00:00
|
|
|
// Get the live sst files size
|
|
|
|
size_t total_sst_bytes = TotalSize(1);
|
|
|
|
if (i == kWithoutDict) {
|
|
|
|
bytes_without_dict = total_sst_bytes;
|
|
|
|
} else if (i == kWithDict) {
|
|
|
|
bytes_with_dict = total_sst_bytes;
|
Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
```
# linked local ZSTD with version 1.5.2
# DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1 EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
dict_bytes=16384
train_bytes=1048576
echo "========== No Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== Raw Content Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== FinalizeDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== TrainDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
# Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
# before compression data size: 1.2GB
dict_bytes=16384
max_dict_buffer_bytes = 1048576
space cpu/memory
No Dictionary 468M 14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
Raw Dictionary 251M 15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
FinalizeDictionary 236M 11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
TrainDictionary 84M 7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
```
#### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
```
dict_bytes=16384
train_bytes=1048576
for sst_file in `ls ../temp/myrock-sst/`
do
echo "********** $sst_file **********"
echo "========== No Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
echo "========== Raw Content Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
echo "========== FinalizeDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
echo "========== TrainDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
done
010240.sst (Size/Time) 011029.sst 013184.sst 021552.sst 185054.sst 185137.sst 191666.sst 7560381.sst 7604174.sst 7635312.sst
No Dictionary 28165569 / 2614419 32899411 / 2976832 32977848 / 3055542 31966329 / 2004590 33614351 / 1755877 33429029 / 1717042 33611933 / 1776936 33634045 / 2771417 33789721 / 2205414 33592194 / 388254
Raw Content Dictionary 28019950 / 2697961 33748665 / 3572422 33896373 / 3534701 26418431 / 2259658 28560825 / 1839168 28455030 / 1846039 28494319 / 1861349 32391599 / 3095649 33772142 / 2407843 33592230 / 474523
FinalizeDictionary 27896012 / 2650029 33763886 / 3719427 33904283 / 3552793 26008225 / 2198033 28111872 / 1869530 28014374 / 1789771 28047706 / 1848300 32296254 / 3204027 33698698 / 2381468 33592344 / 517433
TrainDictionary 28046089 / 2740037 33706480 / 3679019 33885741 / 3629351 25087123 / 2204558 27194353 / 1970207 27234229 / 1896811 27166710 / 1903119 32011041 / 3322315 32730692 / 2406146 33608631 / 570593
```
#### Decompression/Read test:
With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
```
dict_bytes=16384
train_bytes=1048576
echo "No Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
echo "Raw Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
echo "FinalizeDict"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
echo "Train Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
No Dictionary
readrandom : 12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations; 9.1 MB/s (1000000 of 1000000 found)
Raw Dictionary
readrandom : 12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found)
FinalizeDict
readrandom : 9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations; 11.3 MB/s (1000000 of 1000000 found)
Train Dictionary
readrandom : 9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations; 11.4 MB/s (1000000 of 1000000 found)
```
Reviewed By: ajkr
Differential Revision: D35720026
Pulled By: cbi42
fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
2022-05-20 19:09:09 +00:00
|
|
|
} else if (i == kWithZSTDfinalizeDict) {
|
|
|
|
bytes_with_zstd_finalize_dict = total_sst_bytes;
|
2019-05-30 23:07:57 +00:00
|
|
|
} else if (i == kWithZSTDTrainedDict) {
|
|
|
|
bytes_with_zstd_trained_dict = total_sst_bytes;
|
2016-04-28 22:11:28 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
for (size_t j = 0; j < kNumL0Files * (kL0FileBytes / kBlockSizeBytes);
|
|
|
|
j++) {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
ASSERT_EQ(seq_datas[(j / 10) % 10], Get(1, Key(static_cast<int>(j))));
|
2016-04-28 22:11:28 +00:00
|
|
|
}
|
2019-05-30 23:07:57 +00:00
|
|
|
if (i == kWithDict) {
|
|
|
|
ASSERT_GT(bytes_without_dict, bytes_with_dict);
|
Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](https://github.com/facebook/rocksdb/blob/fb9a167a55e0970b1ef6f67c1600c8d9c4c6114f/tools/db_bench_tool.cc#L1766) to 16384 to make synthetic data more compressible.
```
# linked local ZSTD with version 1.5.2
# DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1 EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench
dict_bytes=16384
train_bytes=1048576
echo "========== No Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== Raw Content Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== FinalizeDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
echo "========== TrainDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total
# Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
# before compression data size: 1.2GB
dict_bytes=16384
max_dict_buffer_bytes = 1048576
space cpu/memory
No Dictionary 468M 14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
Raw Dictionary 251M 15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
FinalizeDictionary 236M 11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
TrainDictionary 84M 7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
```
#### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
```
dict_bytes=16384
train_bytes=1048576
for sst_file in `ls ../temp/myrock-sst/`
do
echo "********** $sst_file **********"
echo "========== No Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD
echo "========== Raw Content Dictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes
echo "========== FinalizeDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict
echo "========== TrainDictionary =========="
./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
done
010240.sst (Size/Time) 011029.sst 013184.sst 021552.sst 185054.sst 185137.sst 191666.sst 7560381.sst 7604174.sst 7635312.sst
No Dictionary 28165569 / 2614419 32899411 / 2976832 32977848 / 3055542 31966329 / 2004590 33614351 / 1755877 33429029 / 1717042 33611933 / 1776936 33634045 / 2771417 33789721 / 2205414 33592194 / 388254
Raw Content Dictionary 28019950 / 2697961 33748665 / 3572422 33896373 / 3534701 26418431 / 2259658 28560825 / 1839168 28455030 / 1846039 28494319 / 1861349 32391599 / 3095649 33772142 / 2407843 33592230 / 474523
FinalizeDictionary 27896012 / 2650029 33763886 / 3719427 33904283 / 3552793 26008225 / 2198033 28111872 / 1869530 28014374 / 1789771 28047706 / 1848300 32296254 / 3204027 33698698 / 2381468 33592344 / 517433
TrainDictionary 28046089 / 2740037 33706480 / 3679019 33885741 / 3629351 25087123 / 2204558 27194353 / 1970207 27234229 / 1896811 27166710 / 1903119 32011041 / 3322315 32730692 / 2406146 33608631 / 570593
```
#### Decompression/Read test:
With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
```
dict_bytes=16384
train_bytes=1048576
echo "No Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s
echo "Raw Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s
echo "FinalizeDict"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s
echo "Train Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s
No Dictionary
readrandom : 12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations; 9.1 MB/s (1000000 of 1000000 found)
Raw Dictionary
readrandom : 12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found)
FinalizeDict
readrandom : 9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations; 11.3 MB/s (1000000 of 1000000 found)
Train Dictionary
readrandom : 9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations; 11.4 MB/s (1000000 of 1000000 found)
```
Reviewed By: ajkr
Differential Revision: D35720026
Pulled By: cbi42
fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
2022-05-20 19:09:09 +00:00
|
|
|
} else if (i == kWithZSTDTrainedDict) {
|
|
|
|
// In zstd compression, it is sometimes possible that using a finalized
|
|
|
|
// dictionary does not get as good a compression ratio as raw content
|
|
|
|
// dictionary. But using a dictionary should always get better
|
|
|
|
// compression ratio than not using one.
|
|
|
|
ASSERT_TRUE(bytes_with_dict > bytes_with_zstd_finalize_dict ||
|
|
|
|
bytes_without_dict > bytes_with_zstd_finalize_dict);
|
2019-05-30 23:07:57 +00:00
|
|
|
} else if (i == kWithZSTDTrainedDict) {
|
|
|
|
// In zstd compression, it is sometimes possible that using a trained
|
|
|
|
// dictionary does not get as good a compression ratio as without
|
|
|
|
// training.
|
|
|
|
// But using a dictionary (with or without training) should always get
|
|
|
|
// better compression ratio than not using one.
|
|
|
|
ASSERT_TRUE(bytes_with_dict > bytes_with_zstd_trained_dict ||
|
|
|
|
bytes_without_dict > bytes_with_zstd_trained_dict);
|
2016-04-28 22:11:28 +00:00
|
|
|
}
|
2019-05-30 23:07:57 +00:00
|
|
|
|
2016-04-28 22:11:28 +00:00
|
|
|
DestroyAndReopen(options);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
TEST_F(DBTest2, PresetCompressionDictLocality) {
|
|
|
|
if (!ZSTD_Supported()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
// Verifies that compression dictionary is generated from local data. The
|
|
|
|
// verification simply checks all output SSTs have different compression
|
|
|
|
// dictionaries. We do not verify effectiveness as that'd likely be flaky in
|
|
|
|
// the future.
|
|
|
|
const int kNumEntriesPerFile = 1 << 10; // 1KB
|
|
|
|
const int kNumBytesPerEntry = 1 << 10; // 1KB
|
|
|
|
const int kNumFiles = 4;
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.compression = kZSTD;
|
|
|
|
options.compression_opts.max_dict_bytes = 1 << 14; // 16KB
|
|
|
|
options.compression_opts.zstd_max_train_bytes = 1 << 18; // 256KB
|
2020-02-20 20:07:53 +00:00
|
|
|
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
options.target_file_size_base = kNumEntriesPerFile * kNumBytesPerEntry;
|
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
table_options.cache_index_and_filter_blocks = true;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
for (int i = 0; i < kNumFiles; ++i) {
|
|
|
|
for (int j = 0; j < kNumEntriesPerFile; ++j) {
|
|
|
|
ASSERT_OK(Put(Key(i * kNumEntriesPerFile + j),
|
2020-07-09 21:33:42 +00:00
|
|
|
rnd.RandomString(kNumBytesPerEntry)));
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
MoveFilesToLevel(1);
|
|
|
|
ASSERT_EQ(NumTableFilesAtLevel(1), i + 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Store all the dictionaries generated during a full compaction.
|
|
|
|
std::vector<std::string> compression_dicts;
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
"BlockBasedTableBuilder::WriteCompressionDictBlock:RawDict",
|
|
|
|
[&](void* arg) {
|
|
|
|
compression_dicts.emplace_back(static_cast<Slice*>(arg)->ToString());
|
|
|
|
});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
CompactRangeOptions compact_range_opts;
|
|
|
|
compact_range_opts.bottommost_level_compaction =
|
2019-04-17 06:29:32 +00:00
|
|
|
BottommostLevelCompaction::kForceOptimized;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-12 03:42:25 +00:00
|
|
|
ASSERT_OK(db_->CompactRange(compact_range_opts, nullptr, nullptr));
|
|
|
|
|
|
|
|
// Dictionary compression should not be so good as to compress four totally
|
|
|
|
// random files into one. If it does then there's probably something wrong
|
|
|
|
// with the test.
|
|
|
|
ASSERT_GT(NumTableFilesAtLevel(1), 1);
|
|
|
|
|
|
|
|
// Furthermore, there should be one compression dictionary generated per file.
|
|
|
|
// And they should all be different from each other.
|
|
|
|
ASSERT_EQ(NumTableFilesAtLevel(1),
|
|
|
|
static_cast<int>(compression_dicts.size()));
|
|
|
|
for (size_t i = 1; i < compression_dicts.size(); ++i) {
|
|
|
|
std::string& a = compression_dicts[i - 1];
|
|
|
|
std::string& b = compression_dicts[i];
|
|
|
|
size_t alen = a.size();
|
|
|
|
size_t blen = b.size();
|
|
|
|
ASSERT_TRUE(alen != blen || memcmp(a.data(), b.data(), alen) != 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-11-03 03:20:15 +00:00
|
|
|
class PresetCompressionDictTest
|
|
|
|
: public DBTestBase,
|
|
|
|
public testing::WithParamInterface<std::tuple<CompressionType, bool>> {
|
|
|
|
public:
|
|
|
|
PresetCompressionDictTest()
|
2021-07-23 15:37:27 +00:00
|
|
|
: DBTestBase("db_test2", false /* env_do_fsync */),
|
2020-11-03 03:20:15 +00:00
|
|
|
compression_type_(std::get<0>(GetParam())),
|
|
|
|
bottommost_(std::get<1>(GetParam())) {}
|
|
|
|
|
|
|
|
protected:
|
|
|
|
const CompressionType compression_type_;
|
|
|
|
const bool bottommost_;
|
|
|
|
};
|
|
|
|
|
|
|
|
INSTANTIATE_TEST_CASE_P(
|
|
|
|
DBTest2, PresetCompressionDictTest,
|
|
|
|
::testing::Combine(::testing::ValuesIn(GetSupportedDictCompressions()),
|
|
|
|
::testing::Bool()));
|
|
|
|
|
|
|
|
TEST_P(PresetCompressionDictTest, Flush) {
|
|
|
|
// Verifies that dictionary is generated and written during flush only when
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
// `ColumnFamilyOptions::compression` enables dictionary. Also verifies the
|
|
|
|
// size of the dictionary is within expectations according to the limit on
|
|
|
|
// buffering set by `CompressionOptions::max_dict_buffer_bytes`.
|
2020-11-03 03:20:15 +00:00
|
|
|
const size_t kValueLen = 256;
|
|
|
|
const size_t kKeysPerFile = 1 << 10;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
const size_t kDictLen = 16 << 10;
|
|
|
|
const size_t kBlockLen = 4 << 10;
|
2020-11-03 03:20:15 +00:00
|
|
|
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
if (bottommost_) {
|
|
|
|
options.bottommost_compression = compression_type_;
|
|
|
|
options.bottommost_compression_opts.enabled = true;
|
|
|
|
options.bottommost_compression_opts.max_dict_bytes = kDictLen;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
options.bottommost_compression_opts.max_dict_buffer_bytes = kBlockLen;
|
2020-11-03 03:20:15 +00:00
|
|
|
} else {
|
|
|
|
options.compression = compression_type_;
|
|
|
|
options.compression_opts.max_dict_bytes = kDictLen;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
options.compression_opts.max_dict_buffer_bytes = kBlockLen;
|
2020-11-03 03:20:15 +00:00
|
|
|
}
|
2021-09-08 14:45:59 +00:00
|
|
|
options.memtable_factory.reset(test::NewSpecialSkipListFactory(kKeysPerFile));
|
2020-11-03 03:20:15 +00:00
|
|
|
options.statistics = CreateDBStatistics();
|
|
|
|
BlockBasedTableOptions bbto;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
bbto.block_size = kBlockLen;
|
2020-11-03 03:20:15 +00:00
|
|
|
bbto.cache_index_and_filter_blocks = true;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
for (size_t i = 0; i <= kKeysPerFile; ++i) {
|
|
|
|
ASSERT_OK(Put(Key(static_cast<int>(i)), rnd.RandomString(kValueLen)));
|
|
|
|
}
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable());
|
|
|
|
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
// We can use `BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT` to detect whether a
|
|
|
|
// compression dictionary exists since dictionaries would be preloaded when
|
|
|
|
// the flush finishes.
|
2020-11-03 03:20:15 +00:00
|
|
|
if (bottommost_) {
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
// Flush is never considered bottommost. This should change in the future
|
|
|
|
// since flushed files may have nothing underneath them, like the one in
|
|
|
|
// this test case.
|
|
|
|
ASSERT_EQ(
|
|
|
|
TestGetTickerCount(options, BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT),
|
|
|
|
0);
|
2020-11-03 03:20:15 +00:00
|
|
|
} else {
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
ASSERT_GT(
|
|
|
|
TestGetTickerCount(options, BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT),
|
|
|
|
0);
|
|
|
|
// TODO(ajkr): fix the below assertion to work with ZSTD. The expectation on
|
|
|
|
// number of bytes needs to be adjusted in case the cached block is in
|
|
|
|
// ZSTD's digested dictionary format.
|
|
|
|
if (compression_type_ != kZSTD &&
|
|
|
|
compression_type_ != kZSTDNotFinalCompression) {
|
|
|
|
// Although we limited buffering to `kBlockLen`, there may be up to two
|
|
|
|
// blocks of data included in the dictionary since we only check limit
|
|
|
|
// after each block is built.
|
|
|
|
ASSERT_LE(TestGetTickerCount(options,
|
|
|
|
BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT),
|
|
|
|
2 * kBlockLen);
|
|
|
|
}
|
2020-11-03 03:20:15 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_P(PresetCompressionDictTest, CompactNonBottommost) {
|
|
|
|
// Verifies that dictionary is generated and written during compaction to
|
|
|
|
// non-bottommost level only when `ColumnFamilyOptions::compression` enables
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
// dictionary. Also verifies the size of the dictionary is within expectations
|
|
|
|
// according to the limit on buffering set by
|
|
|
|
// `CompressionOptions::max_dict_buffer_bytes`.
|
2020-11-03 03:20:15 +00:00
|
|
|
const size_t kValueLen = 256;
|
|
|
|
const size_t kKeysPerFile = 1 << 10;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
const size_t kDictLen = 16 << 10;
|
|
|
|
const size_t kBlockLen = 4 << 10;
|
2020-11-03 03:20:15 +00:00
|
|
|
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
if (bottommost_) {
|
|
|
|
options.bottommost_compression = compression_type_;
|
|
|
|
options.bottommost_compression_opts.enabled = true;
|
|
|
|
options.bottommost_compression_opts.max_dict_bytes = kDictLen;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
options.bottommost_compression_opts.max_dict_buffer_bytes = kBlockLen;
|
2020-11-03 03:20:15 +00:00
|
|
|
} else {
|
|
|
|
options.compression = compression_type_;
|
|
|
|
options.compression_opts.max_dict_bytes = kDictLen;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
options.compression_opts.max_dict_buffer_bytes = kBlockLen;
|
2020-11-03 03:20:15 +00:00
|
|
|
}
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
options.statistics = CreateDBStatistics();
|
|
|
|
BlockBasedTableOptions bbto;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
bbto.block_size = kBlockLen;
|
2020-11-03 03:20:15 +00:00
|
|
|
bbto.cache_index_and_filter_blocks = true;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
for (size_t j = 0; j <= kKeysPerFile; ++j) {
|
|
|
|
ASSERT_OK(Put(Key(static_cast<int>(j)), rnd.RandomString(kValueLen)));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
MoveFilesToLevel(2);
|
|
|
|
|
|
|
|
for (int i = 0; i < 2; ++i) {
|
|
|
|
for (size_t j = 0; j <= kKeysPerFile; ++j) {
|
|
|
|
ASSERT_OK(Put(Key(static_cast<int>(j)), rnd.RandomString(kValueLen)));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
|
|
|
ASSERT_EQ("2,0,1", FilesPerLevel(0));
|
|
|
|
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
uint64_t prev_compression_dict_bytes_inserted =
|
|
|
|
TestGetTickerCount(options, BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT);
|
2020-11-03 03:20:15 +00:00
|
|
|
// This L0->L1 compaction merges the two L0 files into L1. The produced L1
|
|
|
|
// file is not bottommost due to the existing L2 file covering the same key-
|
|
|
|
// range.
|
|
|
|
ASSERT_OK(dbfull()->TEST_CompactRange(0, nullptr, nullptr));
|
|
|
|
ASSERT_EQ("0,1,1", FilesPerLevel(0));
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
// We can use `BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT` to detect whether a
|
|
|
|
// compression dictionary exists since dictionaries would be preloaded when
|
|
|
|
// the compaction finishes.
|
2020-11-03 03:20:15 +00:00
|
|
|
if (bottommost_) {
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
ASSERT_EQ(
|
|
|
|
TestGetTickerCount(options, BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT),
|
|
|
|
prev_compression_dict_bytes_inserted);
|
2020-11-03 03:20:15 +00:00
|
|
|
} else {
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
ASSERT_GT(
|
|
|
|
TestGetTickerCount(options, BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT),
|
|
|
|
prev_compression_dict_bytes_inserted);
|
|
|
|
// TODO(ajkr): fix the below assertion to work with ZSTD. The expectation on
|
|
|
|
// number of bytes needs to be adjusted in case the cached block is in
|
|
|
|
// ZSTD's digested dictionary format.
|
|
|
|
if (compression_type_ != kZSTD &&
|
|
|
|
compression_type_ != kZSTDNotFinalCompression) {
|
|
|
|
// Although we limited buffering to `kBlockLen`, there may be up to two
|
|
|
|
// blocks of data included in the dictionary since we only check limit
|
|
|
|
// after each block is built.
|
|
|
|
ASSERT_LE(TestGetTickerCount(options,
|
|
|
|
BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT),
|
|
|
|
prev_compression_dict_bytes_inserted + 2 * kBlockLen);
|
|
|
|
}
|
2020-11-03 03:20:15 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_P(PresetCompressionDictTest, CompactBottommost) {
|
|
|
|
// Verifies that dictionary is generated and written during compaction to
|
|
|
|
// non-bottommost level only when either `ColumnFamilyOptions::compression` or
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
// `ColumnFamilyOptions::bottommost_compression` enables dictionary. Also
|
|
|
|
// verifies the size of the dictionary is within expectations according to the
|
|
|
|
// limit on buffering set by `CompressionOptions::max_dict_buffer_bytes`.
|
2020-11-03 03:20:15 +00:00
|
|
|
const size_t kValueLen = 256;
|
|
|
|
const size_t kKeysPerFile = 1 << 10;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
const size_t kDictLen = 16 << 10;
|
|
|
|
const size_t kBlockLen = 4 << 10;
|
2020-11-03 03:20:15 +00:00
|
|
|
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
if (bottommost_) {
|
|
|
|
options.bottommost_compression = compression_type_;
|
|
|
|
options.bottommost_compression_opts.enabled = true;
|
|
|
|
options.bottommost_compression_opts.max_dict_bytes = kDictLen;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
options.bottommost_compression_opts.max_dict_buffer_bytes = kBlockLen;
|
2020-11-03 03:20:15 +00:00
|
|
|
} else {
|
|
|
|
options.compression = compression_type_;
|
|
|
|
options.compression_opts.max_dict_bytes = kDictLen;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
options.compression_opts.max_dict_buffer_bytes = kBlockLen;
|
2020-11-03 03:20:15 +00:00
|
|
|
}
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
options.statistics = CreateDBStatistics();
|
|
|
|
BlockBasedTableOptions bbto;
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
bbto.block_size = kBlockLen;
|
2020-11-03 03:20:15 +00:00
|
|
|
bbto.cache_index_and_filter_blocks = true;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
for (int i = 0; i < 2; ++i) {
|
|
|
|
for (size_t j = 0; j <= kKeysPerFile; ++j) {
|
|
|
|
ASSERT_OK(Put(Key(static_cast<int>(j)), rnd.RandomString(kValueLen)));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
|
|
|
ASSERT_EQ("2", FilesPerLevel(0));
|
|
|
|
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
uint64_t prev_compression_dict_bytes_inserted =
|
|
|
|
TestGetTickerCount(options, BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT);
|
2020-11-03 03:20:15 +00:00
|
|
|
CompactRangeOptions cro;
|
|
|
|
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
|
|
|
|
ASSERT_EQ("0,1", FilesPerLevel(0));
|
Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 22:06:59 +00:00
|
|
|
ASSERT_GT(
|
|
|
|
TestGetTickerCount(options, BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT),
|
|
|
|
prev_compression_dict_bytes_inserted);
|
|
|
|
// TODO(ajkr): fix the below assertion to work with ZSTD. The expectation on
|
|
|
|
// number of bytes needs to be adjusted in case the cached block is in ZSTD's
|
|
|
|
// digested dictionary format.
|
|
|
|
if (compression_type_ != kZSTD &&
|
|
|
|
compression_type_ != kZSTDNotFinalCompression) {
|
|
|
|
// Although we limited buffering to `kBlockLen`, there may be up to two
|
|
|
|
// blocks of data included in the dictionary since we only check limit after
|
|
|
|
// each block is built.
|
|
|
|
ASSERT_LE(
|
|
|
|
TestGetTickerCount(options, BLOCK_CACHE_COMPRESSION_DICT_BYTES_INSERT),
|
|
|
|
prev_compression_dict_bytes_inserted + 2 * kBlockLen);
|
|
|
|
}
|
2020-11-03 03:20:15 +00:00
|
|
|
}
|
|
|
|
|
2016-05-09 22:57:19 +00:00
|
|
|
class CompactionCompressionListener : public EventListener {
|
|
|
|
public:
|
|
|
|
explicit CompactionCompressionListener(Options* db_options)
|
|
|
|
: db_options_(db_options) {}
|
|
|
|
|
|
|
|
void OnCompactionCompleted(DB* db, const CompactionJobInfo& ci) override {
|
|
|
|
// Figure out last level with files
|
|
|
|
int bottommost_level = 0;
|
|
|
|
for (int level = 0; level < db->NumberLevels(); level++) {
|
|
|
|
std::string files_at_level;
|
2022-05-06 20:03:58 +00:00
|
|
|
ASSERT_TRUE(
|
|
|
|
db->GetProperty("rocksdb.num-files-at-level" + std::to_string(level),
|
|
|
|
&files_at_level));
|
2016-05-09 22:57:19 +00:00
|
|
|
if (files_at_level != "0") {
|
|
|
|
bottommost_level = level;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db_options_->bottommost_compression != kDisableCompressionOption &&
|
2017-11-10 01:33:01 +00:00
|
|
|
ci.output_level == bottommost_level) {
|
2016-05-09 22:57:19 +00:00
|
|
|
ASSERT_EQ(ci.compression, db_options_->bottommost_compression);
|
|
|
|
} else if (db_options_->compression_per_level.size() != 0) {
|
|
|
|
ASSERT_EQ(ci.compression,
|
|
|
|
db_options_->compression_per_level[ci.output_level]);
|
|
|
|
} else {
|
|
|
|
ASSERT_EQ(ci.compression, db_options_->compression);
|
|
|
|
}
|
|
|
|
max_level_checked = std::max(max_level_checked, ci.output_level);
|
|
|
|
}
|
|
|
|
|
|
|
|
int max_level_checked = 0;
|
|
|
|
const Options* db_options_;
|
|
|
|
};
|
|
|
|
|
2020-06-11 23:32:51 +00:00
|
|
|
enum CompressionFailureType {
|
|
|
|
kTestCompressionFail,
|
|
|
|
kTestDecompressionFail,
|
|
|
|
kTestDecompressionCorruption
|
|
|
|
};
|
|
|
|
|
|
|
|
class CompressionFailuresTest
|
|
|
|
: public DBTest2,
|
|
|
|
public testing::WithParamInterface<std::tuple<
|
|
|
|
CompressionFailureType, CompressionType, uint32_t, uint32_t>> {
|
|
|
|
public:
|
|
|
|
CompressionFailuresTest() {
|
|
|
|
std::tie(compression_failure_type_, compression_type_,
|
|
|
|
compression_max_dict_bytes_, compression_parallel_threads_) =
|
|
|
|
GetParam();
|
|
|
|
}
|
|
|
|
|
|
|
|
CompressionFailureType compression_failure_type_ = kTestCompressionFail;
|
|
|
|
CompressionType compression_type_ = kNoCompression;
|
|
|
|
uint32_t compression_max_dict_bytes_ = 0;
|
|
|
|
uint32_t compression_parallel_threads_ = 0;
|
|
|
|
};
|
|
|
|
|
|
|
|
INSTANTIATE_TEST_CASE_P(
|
|
|
|
DBTest2, CompressionFailuresTest,
|
|
|
|
::testing::Combine(::testing::Values(kTestCompressionFail,
|
|
|
|
kTestDecompressionFail,
|
|
|
|
kTestDecompressionCorruption),
|
|
|
|
::testing::ValuesIn(GetSupportedCompressions()),
|
|
|
|
::testing::Values(0, 10), ::testing::Values(1, 4)));
|
|
|
|
|
|
|
|
TEST_P(CompressionFailuresTest, CompressionFailures) {
|
|
|
|
if (compression_type_ == kNoCompression) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2020-05-12 16:25:21 +00:00
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.level0_file_num_compaction_trigger = 2;
|
|
|
|
options.max_bytes_for_level_base = 1024;
|
|
|
|
options.max_bytes_for_level_multiplier = 2;
|
|
|
|
options.num_levels = 7;
|
|
|
|
options.max_background_compactions = 1;
|
|
|
|
options.target_file_size_base = 512;
|
|
|
|
|
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
table_options.block_size = 512;
|
|
|
|
table_options.verify_compression = true;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
2020-05-12 16:25:21 +00:00
|
|
|
|
2020-06-11 23:32:51 +00:00
|
|
|
options.compression = compression_type_;
|
|
|
|
options.compression_opts.parallel_threads = compression_parallel_threads_;
|
|
|
|
options.compression_opts.max_dict_bytes = compression_max_dict_bytes_;
|
|
|
|
options.bottommost_compression_opts.parallel_threads =
|
|
|
|
compression_parallel_threads_;
|
|
|
|
options.bottommost_compression_opts.max_dict_bytes =
|
|
|
|
compression_max_dict_bytes_;
|
|
|
|
|
|
|
|
if (compression_failure_type_ == kTestCompressionFail) {
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2020-08-13 01:24:27 +00:00
|
|
|
"CompressData:TamperWithReturnValue", [](void* arg) {
|
2020-06-11 23:32:51 +00:00
|
|
|
bool* ret = static_cast<bool*>(arg);
|
2020-05-12 16:25:21 +00:00
|
|
|
*ret = false;
|
2020-06-11 23:32:51 +00:00
|
|
|
});
|
|
|
|
} else if (compression_failure_type_ == kTestDecompressionFail) {
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
Refactor to avoid confusing "raw block" (#10408)
Summary:
We have a lot of confusing code because of mixed, sometimes
completely opposite uses of of the term "raw block" or "raw contents",
sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
`raw_block_contents` and `raw_size` generally referred to uncompressed block
contents and size, while `WriteRawBlock` referred to writing a block that
is already compressed if it is going to be. Meanwhile, in
`BlockBasedTable`, `raw_block_contents` either referred to a (maybe
compressed) block with trailer, or a maybe compressed block maybe
without trailer. (Note: left as follow-up work to use C++ typing to
better sort out the various kinds of BlockContents.)
This change primarily tries to apply some consistent terminology around
the kinds of block representations, avoiding the unclear "raw". (Any
meaning of "raw" assumes some bias toward the storage layer or toward
the logical data layer.) Preferred terminology:
* **Serialized block** - bytes that go into storage. For block-based table
(usually the case) this includes the block trailer. WART: block `size` may or
may not include the trailer; need to be clear about whether it does or not.
* **Maybe compressed block** - like a serialized block, but without the
trailer (or no promise of including a trailer). Must be accompanied by a
CompressionType.
* **Uncompressed block** - "payload" bytes that are either stored with no
compression, used as input to compression function, or result of
decompression function.
* **Parsed block** - an in-memory form of a block in block cache, as it is
used by the table reader. Different C++ types are used depending on the
block type (see block_like_traits.h).
Other refactorings:
* Misc corrections/improvements of internal API comments
* Remove a few misleading / unhelpful / redundant comments.
* Use move semantics in some places to simplify contracts
* Use better parameter names to indicate which parameters are used for
outputs
* Remove some extraneous `extern`
* Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408
Test Plan: existing tests
Reviewed By: akankshamahajan15
Differential Revision: D38172617
Pulled By: pdillinger
fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c
2022-09-22 18:25:32 +00:00
|
|
|
"UncompressBlockData:TamperWithReturnValue", [](void* arg) {
|
2020-06-11 23:32:51 +00:00
|
|
|
Status* ret = static_cast<Status*>(arg);
|
|
|
|
ASSERT_OK(*ret);
|
2020-05-12 16:25:21 +00:00
|
|
|
*ret = Status::Corruption("kTestDecompressionFail");
|
2020-06-11 23:32:51 +00:00
|
|
|
});
|
|
|
|
} else if (compression_failure_type_ == kTestDecompressionCorruption) {
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
Refactor to avoid confusing "raw block" (#10408)
Summary:
We have a lot of confusing code because of mixed, sometimes
completely opposite uses of of the term "raw block" or "raw contents",
sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
`raw_block_contents` and `raw_size` generally referred to uncompressed block
contents and size, while `WriteRawBlock` referred to writing a block that
is already compressed if it is going to be. Meanwhile, in
`BlockBasedTable`, `raw_block_contents` either referred to a (maybe
compressed) block with trailer, or a maybe compressed block maybe
without trailer. (Note: left as follow-up work to use C++ typing to
better sort out the various kinds of BlockContents.)
This change primarily tries to apply some consistent terminology around
the kinds of block representations, avoiding the unclear "raw". (Any
meaning of "raw" assumes some bias toward the storage layer or toward
the logical data layer.) Preferred terminology:
* **Serialized block** - bytes that go into storage. For block-based table
(usually the case) this includes the block trailer. WART: block `size` may or
may not include the trailer; need to be clear about whether it does or not.
* **Maybe compressed block** - like a serialized block, but without the
trailer (or no promise of including a trailer). Must be accompanied by a
CompressionType.
* **Uncompressed block** - "payload" bytes that are either stored with no
compression, used as input to compression function, or result of
decompression function.
* **Parsed block** - an in-memory form of a block in block cache, as it is
used by the table reader. Different C++ types are used depending on the
block type (see block_like_traits.h).
Other refactorings:
* Misc corrections/improvements of internal API comments
* Remove a few misleading / unhelpful / redundant comments.
* Use move semantics in some places to simplify contracts
* Use better parameter names to indicate which parameters are used for
outputs
* Remove some extraneous `extern`
* Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408
Test Plan: existing tests
Reviewed By: akankshamahajan15
Differential Revision: D38172617
Pulled By: pdillinger
fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c
2022-09-22 18:25:32 +00:00
|
|
|
"UncompressBlockData:"
|
2020-06-11 23:32:51 +00:00
|
|
|
"TamperWithDecompressionOutput",
|
|
|
|
[](void* arg) {
|
2020-05-12 16:25:21 +00:00
|
|
|
BlockContents* contents = static_cast<BlockContents*>(arg);
|
|
|
|
// Ensure uncompressed data != original data
|
2020-05-15 01:48:06 +00:00
|
|
|
const size_t len = contents->data.size() + 1;
|
|
|
|
std::unique_ptr<char[]> fake_data(new char[len]());
|
|
|
|
*contents = BlockContents(std::move(fake_data), len);
|
2020-06-11 23:32:51 +00:00
|
|
|
});
|
|
|
|
}
|
2020-05-12 16:25:21 +00:00
|
|
|
|
|
|
|
std::map<std::string, std::string> key_value_written;
|
|
|
|
|
|
|
|
const int kKeySize = 5;
|
|
|
|
const int kValUnitSize = 16;
|
|
|
|
const int kValSize = 256;
|
|
|
|
Random rnd(405);
|
|
|
|
|
|
|
|
Status s = Status::OK();
|
|
|
|
|
2020-06-11 23:32:51 +00:00
|
|
|
DestroyAndReopen(options);
|
|
|
|
// Write 10 random files
|
|
|
|
for (int i = 0; i < 10; i++) {
|
|
|
|
for (int j = 0; j < 5; j++) {
|
2020-07-09 21:33:42 +00:00
|
|
|
std::string key = rnd.RandomString(kKeySize);
|
2020-06-11 23:32:51 +00:00
|
|
|
// Ensure good compression ratio
|
2020-07-09 21:33:42 +00:00
|
|
|
std::string valueUnit = rnd.RandomString(kValUnitSize);
|
2020-06-11 23:32:51 +00:00
|
|
|
std::string value;
|
|
|
|
for (int k = 0; k < kValSize; k += kValUnitSize) {
|
|
|
|
value += valueUnit;
|
2020-05-12 16:25:21 +00:00
|
|
|
}
|
2020-06-11 23:32:51 +00:00
|
|
|
s = Put(key, value);
|
|
|
|
if (compression_failure_type_ == kTestCompressionFail) {
|
|
|
|
key_value_written[key] = value;
|
|
|
|
ASSERT_OK(s);
|
2020-05-12 16:25:21 +00:00
|
|
|
}
|
|
|
|
}
|
2020-06-11 23:32:51 +00:00
|
|
|
s = Flush();
|
|
|
|
if (compression_failure_type_ == kTestCompressionFail) {
|
|
|
|
ASSERT_OK(s);
|
|
|
|
}
|
|
|
|
s = dbfull()->TEST_WaitForCompact();
|
|
|
|
if (compression_failure_type_ == kTestCompressionFail) {
|
|
|
|
ASSERT_OK(s);
|
|
|
|
}
|
|
|
|
if (i == 4) {
|
|
|
|
// Make compression fail at the mid of table building
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
|
|
|
|
if (compression_failure_type_ == kTestCompressionFail) {
|
|
|
|
// Should be kNoCompression, check content consistency
|
|
|
|
std::unique_ptr<Iterator> db_iter(db_->NewIterator(ReadOptions()));
|
|
|
|
for (db_iter->SeekToFirst(); db_iter->Valid(); db_iter->Next()) {
|
|
|
|
std::string key = db_iter->key().ToString();
|
|
|
|
std::string value = db_iter->value().ToString();
|
|
|
|
ASSERT_NE(key_value_written.find(key), key_value_written.end());
|
|
|
|
ASSERT_EQ(key_value_written[key], value);
|
|
|
|
key_value_written.erase(key);
|
|
|
|
}
|
2023-10-18 16:38:38 +00:00
|
|
|
ASSERT_OK(db_iter->status());
|
2020-06-11 23:32:51 +00:00
|
|
|
ASSERT_EQ(0, key_value_written.size());
|
|
|
|
} else if (compression_failure_type_ == kTestDecompressionFail) {
|
|
|
|
ASSERT_EQ(std::string(s.getState()),
|
|
|
|
"Could not decompress: kTestDecompressionFail");
|
|
|
|
} else if (compression_failure_type_ == kTestDecompressionCorruption) {
|
|
|
|
ASSERT_EQ(std::string(s.getState()),
|
Refactor to avoid confusing "raw block" (#10408)
Summary:
We have a lot of confusing code because of mixed, sometimes
completely opposite uses of of the term "raw block" or "raw contents",
sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
`raw_block_contents` and `raw_size` generally referred to uncompressed block
contents and size, while `WriteRawBlock` referred to writing a block that
is already compressed if it is going to be. Meanwhile, in
`BlockBasedTable`, `raw_block_contents` either referred to a (maybe
compressed) block with trailer, or a maybe compressed block maybe
without trailer. (Note: left as follow-up work to use C++ typing to
better sort out the various kinds of BlockContents.)
This change primarily tries to apply some consistent terminology around
the kinds of block representations, avoiding the unclear "raw". (Any
meaning of "raw" assumes some bias toward the storage layer or toward
the logical data layer.) Preferred terminology:
* **Serialized block** - bytes that go into storage. For block-based table
(usually the case) this includes the block trailer. WART: block `size` may or
may not include the trailer; need to be clear about whether it does or not.
* **Maybe compressed block** - like a serialized block, but without the
trailer (or no promise of including a trailer). Must be accompanied by a
CompressionType.
* **Uncompressed block** - "payload" bytes that are either stored with no
compression, used as input to compression function, or result of
decompression function.
* **Parsed block** - an in-memory form of a block in block cache, as it is
used by the table reader. Different C++ types are used depending on the
block type (see block_like_traits.h).
Other refactorings:
* Misc corrections/improvements of internal API comments
* Remove a few misleading / unhelpful / redundant comments.
* Use move semantics in some places to simplify contracts
* Use better parameter names to indicate which parameters are used for
outputs
* Remove some extraneous `extern`
* Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408
Test Plan: existing tests
Reviewed By: akankshamahajan15
Differential Revision: D38172617
Pulled By: pdillinger
fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c
2022-09-22 18:25:32 +00:00
|
|
|
"Decompressed block did not match pre-compression block");
|
2020-05-12 16:25:21 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-05-09 22:57:19 +00:00
|
|
|
TEST_F(DBTest2, CompressionOptions) {
|
|
|
|
if (!Zlib_Supported() || !Snappy_Supported()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.level0_file_num_compaction_trigger = 2;
|
|
|
|
options.max_bytes_for_level_base = 100;
|
|
|
|
options.max_bytes_for_level_multiplier = 2;
|
|
|
|
options.num_levels = 7;
|
|
|
|
options.max_background_compactions = 1;
|
|
|
|
|
|
|
|
CompactionCompressionListener* listener =
|
|
|
|
new CompactionCompressionListener(&options);
|
|
|
|
options.listeners.emplace_back(listener);
|
|
|
|
|
|
|
|
const int kKeySize = 5;
|
|
|
|
const int kValSize = 20;
|
|
|
|
Random rnd(301);
|
|
|
|
|
2020-04-01 23:37:54 +00:00
|
|
|
std::vector<uint32_t> compression_parallel_threads = {1, 4};
|
|
|
|
|
|
|
|
std::map<std::string, std::string> key_value_written;
|
|
|
|
|
2016-05-09 22:57:19 +00:00
|
|
|
for (int iter = 0; iter <= 2; iter++) {
|
|
|
|
listener->max_level_checked = 0;
|
|
|
|
|
|
|
|
if (iter == 0) {
|
|
|
|
// Use different compression algorithms for different levels but
|
|
|
|
// always use Zlib for bottommost level
|
|
|
|
options.compression_per_level = {kNoCompression, kNoCompression,
|
|
|
|
kNoCompression, kSnappyCompression,
|
|
|
|
kSnappyCompression, kSnappyCompression,
|
|
|
|
kZlibCompression};
|
|
|
|
options.compression = kNoCompression;
|
|
|
|
options.bottommost_compression = kZlibCompression;
|
|
|
|
} else if (iter == 1) {
|
|
|
|
// Use Snappy except for bottommost level use ZLib
|
|
|
|
options.compression_per_level = {};
|
|
|
|
options.compression = kSnappyCompression;
|
|
|
|
options.bottommost_compression = kZlibCompression;
|
|
|
|
} else if (iter == 2) {
|
|
|
|
// Use Snappy everywhere
|
|
|
|
options.compression_per_level = {};
|
|
|
|
options.compression = kSnappyCompression;
|
|
|
|
options.bottommost_compression = kDisableCompressionOption;
|
|
|
|
}
|
|
|
|
|
2020-04-01 23:37:54 +00:00
|
|
|
for (auto num_threads : compression_parallel_threads) {
|
|
|
|
options.compression_opts.parallel_threads = num_threads;
|
|
|
|
options.bottommost_compression_opts.parallel_threads = num_threads;
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
// Write 10 random files
|
|
|
|
for (int i = 0; i < 10; i++) {
|
|
|
|
for (int j = 0; j < 5; j++) {
|
2020-07-09 21:33:42 +00:00
|
|
|
std::string key = rnd.RandomString(kKeySize);
|
|
|
|
std::string value = rnd.RandomString(kValSize);
|
2020-04-01 23:37:54 +00:00
|
|
|
key_value_written[key] = value;
|
|
|
|
ASSERT_OK(Put(key, value));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2016-05-09 22:57:19 +00:00
|
|
|
}
|
|
|
|
|
2020-04-01 23:37:54 +00:00
|
|
|
// Make sure that we wrote enough to check all 7 levels
|
|
|
|
ASSERT_EQ(listener->max_level_checked, 6);
|
|
|
|
|
|
|
|
// Make sure database content is the same as key_value_written
|
|
|
|
std::unique_ptr<Iterator> db_iter(db_->NewIterator(ReadOptions()));
|
|
|
|
for (db_iter->SeekToFirst(); db_iter->Valid(); db_iter->Next()) {
|
|
|
|
std::string key = db_iter->key().ToString();
|
|
|
|
std::string value = db_iter->value().ToString();
|
|
|
|
ASSERT_NE(key_value_written.find(key), key_value_written.end());
|
|
|
|
ASSERT_EQ(key_value_written[key], value);
|
|
|
|
key_value_written.erase(key);
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(db_iter->status());
|
2020-04-01 23:37:54 +00:00
|
|
|
ASSERT_EQ(0, key_value_written.size());
|
|
|
|
}
|
2016-05-09 22:57:19 +00:00
|
|
|
}
|
|
|
|
}
|
2016-05-18 21:56:30 +00:00
|
|
|
|
|
|
|
class CompactionStallTestListener : public EventListener {
|
|
|
|
public:
|
2022-11-02 21:34:24 +00:00
|
|
|
CompactionStallTestListener()
|
|
|
|
: compacting_files_cnt_(0), compacted_files_cnt_(0) {}
|
2018-10-11 00:30:22 +00:00
|
|
|
|
|
|
|
void OnCompactionBegin(DB* /*db*/, const CompactionJobInfo& ci) override {
|
|
|
|
ASSERT_EQ(ci.cf_name, "default");
|
|
|
|
ASSERT_EQ(ci.base_input_level, 0);
|
|
|
|
ASSERT_EQ(ci.compaction_reason, CompactionReason::kLevelL0FilesNum);
|
|
|
|
compacting_files_cnt_ += ci.input_files.size();
|
|
|
|
}
|
2016-05-18 21:56:30 +00:00
|
|
|
|
2018-03-05 21:08:17 +00:00
|
|
|
void OnCompactionCompleted(DB* /*db*/, const CompactionJobInfo& ci) override {
|
2016-05-18 21:56:30 +00:00
|
|
|
ASSERT_EQ(ci.cf_name, "default");
|
|
|
|
ASSERT_EQ(ci.base_input_level, 0);
|
|
|
|
ASSERT_EQ(ci.compaction_reason, CompactionReason::kLevelL0FilesNum);
|
|
|
|
compacted_files_cnt_ += ci.input_files.size();
|
|
|
|
}
|
2018-10-11 00:30:22 +00:00
|
|
|
|
|
|
|
std::atomic<size_t> compacting_files_cnt_;
|
2016-05-18 21:56:30 +00:00
|
|
|
std::atomic<size_t> compacted_files_cnt_;
|
|
|
|
};
|
|
|
|
|
|
|
|
TEST_F(DBTest2, CompactionStall) {
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
|
2016-05-18 21:56:30 +00:00
|
|
|
{{"DBImpl::BGWorkCompaction", "DBTest2::CompactionStall:0"},
|
|
|
|
{"DBImpl::BGWorkCompaction", "DBTest2::CompactionStall:1"},
|
|
|
|
{"DBTest2::CompactionStall:2",
|
2018-10-11 00:30:22 +00:00
|
|
|
"DBImpl::NotifyOnCompactionBegin::UnlockMutex"},
|
|
|
|
{"DBTest2::CompactionStall:3",
|
2016-05-18 21:56:30 +00:00
|
|
|
"DBImpl::NotifyOnCompactionCompleted::UnlockMutex"}});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2016-05-18 21:56:30 +00:00
|
|
|
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.level0_file_num_compaction_trigger = 4;
|
|
|
|
options.max_background_compactions = 40;
|
|
|
|
CompactionStallTestListener* listener = new CompactionStallTestListener();
|
|
|
|
options.listeners.emplace_back(listener);
|
|
|
|
DestroyAndReopen(options);
|
2017-05-24 18:25:38 +00:00
|
|
|
// make sure all background compaction jobs can be scheduled
|
|
|
|
auto stop_token =
|
|
|
|
dbfull()->TEST_write_controler().GetCompactionPressureToken();
|
2016-05-18 21:56:30 +00:00
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
|
|
|
|
// 4 Files in L0
|
|
|
|
for (int i = 0; i < 4; i++) {
|
|
|
|
for (int j = 0; j < 10; j++) {
|
2020-07-09 21:33:42 +00:00
|
|
|
ASSERT_OK(Put(rnd.RandomString(10), rnd.RandomString(10)));
|
2016-05-18 21:56:30 +00:00
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
|
|
|
|
|
|
|
// Wait for compaction to be triggered
|
|
|
|
TEST_SYNC_POINT("DBTest2::CompactionStall:0");
|
|
|
|
|
|
|
|
// Clear "DBImpl::BGWorkCompaction" SYNC_POINT since we want to hold it again
|
|
|
|
// at DBTest2::CompactionStall::1
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearTrace();
|
2016-05-18 21:56:30 +00:00
|
|
|
|
|
|
|
// Another 6 L0 files to trigger compaction again
|
|
|
|
for (int i = 0; i < 6; i++) {
|
|
|
|
for (int j = 0; j < 10; j++) {
|
2020-07-09 21:33:42 +00:00
|
|
|
ASSERT_OK(Put(rnd.RandomString(10), rnd.RandomString(10)));
|
2016-05-18 21:56:30 +00:00
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
|
|
|
|
|
|
|
// Wait for another compaction to be triggered
|
|
|
|
TEST_SYNC_POINT("DBTest2::CompactionStall:1");
|
|
|
|
|
2018-10-11 00:30:22 +00:00
|
|
|
// Hold NotifyOnCompactionBegin in the unlock mutex section
|
2016-05-18 21:56:30 +00:00
|
|
|
TEST_SYNC_POINT("DBTest2::CompactionStall:2");
|
|
|
|
|
2018-10-11 00:30:22 +00:00
|
|
|
// Hold NotifyOnCompactionCompleted in the unlock mutex section
|
|
|
|
TEST_SYNC_POINT("DBTest2::CompactionStall:3");
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2016-05-18 21:56:30 +00:00
|
|
|
ASSERT_LT(NumTableFilesAtLevel(0),
|
|
|
|
options.level0_file_num_compaction_trigger);
|
|
|
|
ASSERT_GT(listener->compacted_files_cnt_.load(),
|
|
|
|
10 - options.level0_file_num_compaction_trigger);
|
2022-11-02 21:34:24 +00:00
|
|
|
ASSERT_EQ(listener->compacting_files_cnt_.load(),
|
|
|
|
listener->compacted_files_cnt_.load());
|
2016-05-18 21:56:30 +00:00
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2016-05-18 21:56:30 +00:00
|
|
|
}
|
|
|
|
|
2016-05-09 23:08:30 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, FirstSnapshotTest) {
|
|
|
|
Options options;
|
|
|
|
options.write_buffer_size = 100000; // Small write buffer
|
|
|
|
options = CurrentOptions(options);
|
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
|
|
|
|
// This snapshot will have sequence number 0 what is expected behaviour.
|
|
|
|
const Snapshot* s1 = db_->GetSnapshot();
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put(1, "k1", std::string(100000, 'x'))); // Fill memtable
|
|
|
|
ASSERT_OK(Put(1, "k2", std::string(100000, 'y'))); // Trigger flush
|
2016-05-09 23:08:30 +00:00
|
|
|
|
|
|
|
db_->ReleaseSnapshot(s1);
|
|
|
|
}
|
2016-05-09 22:57:19 +00:00
|
|
|
|
2019-01-10 00:09:36 +00:00
|
|
|
TEST_F(DBTest2, DuplicateSnapshot) {
|
|
|
|
Options options;
|
|
|
|
options = CurrentOptions(options);
|
|
|
|
std::vector<const Snapshot*> snapshots;
|
2020-07-03 02:24:25 +00:00
|
|
|
DBImpl* dbi = static_cast_with_check<DBImpl>(db_);
|
2019-01-10 00:09:36 +00:00
|
|
|
SequenceNumber oldest_ww_snap, first_ww_snap;
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("k", "v")); // inc seq
|
2019-01-10 00:09:36 +00:00
|
|
|
snapshots.push_back(db_->GetSnapshot());
|
|
|
|
snapshots.push_back(db_->GetSnapshot());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("k", "v")); // inc seq
|
2019-01-10 00:09:36 +00:00
|
|
|
snapshots.push_back(db_->GetSnapshot());
|
|
|
|
snapshots.push_back(dbi->GetSnapshotForWriteConflictBoundary());
|
|
|
|
first_ww_snap = snapshots.back()->GetSequenceNumber();
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("k", "v")); // inc seq
|
2019-01-10 00:09:36 +00:00
|
|
|
snapshots.push_back(dbi->GetSnapshotForWriteConflictBoundary());
|
|
|
|
snapshots.push_back(db_->GetSnapshot());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("k", "v")); // inc seq
|
2019-01-10 00:09:36 +00:00
|
|
|
snapshots.push_back(db_->GetSnapshot());
|
|
|
|
|
|
|
|
{
|
|
|
|
InstrumentedMutexLock l(dbi->mutex());
|
|
|
|
auto seqs = dbi->snapshots().GetAll(&oldest_ww_snap);
|
|
|
|
ASSERT_EQ(seqs.size(), 4); // duplicates are not counted
|
|
|
|
ASSERT_EQ(oldest_ww_snap, first_ww_snap);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (auto s : snapshots) {
|
|
|
|
db_->ReleaseSnapshot(s);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-01-08 20:44:56 +00:00
|
|
|
class PinL0IndexAndFilterBlocksTest
|
|
|
|
: public DBTestBase,
|
|
|
|
public testing::WithParamInterface<std::tuple<bool, bool>> {
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
public:
|
2020-08-18 01:41:20 +00:00
|
|
|
PinL0IndexAndFilterBlocksTest()
|
2021-07-23 15:37:27 +00:00
|
|
|
: DBTestBase("db_pin_l0_index_bloom_test", /*env_do_fsync=*/true) {}
|
2019-02-14 21:52:47 +00:00
|
|
|
void SetUp() override {
|
2019-01-08 20:44:56 +00:00
|
|
|
infinite_max_files_ = std::get<0>(GetParam());
|
|
|
|
disallow_preload_ = std::get<1>(GetParam());
|
|
|
|
}
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
|
2017-03-22 16:11:23 +00:00
|
|
|
void CreateTwoLevels(Options* options, bool close_afterwards) {
|
2016-07-20 18:23:31 +00:00
|
|
|
if (infinite_max_files_) {
|
|
|
|
options->max_open_files = -1;
|
|
|
|
}
|
|
|
|
options->create_if_missing = true;
|
2020-02-20 20:07:53 +00:00
|
|
|
options->statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
2016-07-20 18:23:31 +00:00
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
table_options.cache_index_and_filter_blocks = true;
|
|
|
|
table_options.pin_l0_filter_and_index_blocks_in_cache = true;
|
|
|
|
table_options.filter_policy.reset(NewBloomFilterPolicy(20));
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
options->table_factory.reset(NewBlockBasedTableFactory(table_options));
|
2016-07-20 18:23:31 +00:00
|
|
|
CreateAndReopenWithCF({"pikachu"}, *options);
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put(1, "a", "begin"));
|
|
|
|
ASSERT_OK(Put(1, "z", "end"));
|
2016-07-20 18:23:31 +00:00
|
|
|
ASSERT_OK(Flush(1));
|
|
|
|
// move this table to L1
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_CompactRange(0, nullptr, nullptr, handles_[1]));
|
2023-06-01 22:27:29 +00:00
|
|
|
ASSERT_EQ(1, NumTableFilesAtLevel(1, 1));
|
2016-07-20 18:23:31 +00:00
|
|
|
|
|
|
|
// reset block cache
|
|
|
|
table_options.block_cache = NewLRUCache(64 * 1024);
|
|
|
|
options->table_factory.reset(NewBlockBasedTableFactory(table_options));
|
2023-08-09 22:46:44 +00:00
|
|
|
ASSERT_OK(TryReopenWithColumnFamilies({"default", "pikachu"}, *options));
|
2016-07-20 18:23:31 +00:00
|
|
|
// create new table at L0
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put(1, "a2", "begin2"));
|
|
|
|
ASSERT_OK(Put(1, "z2", "end2"));
|
2016-07-20 18:23:31 +00:00
|
|
|
ASSERT_OK(Flush(1));
|
|
|
|
|
2017-03-22 16:11:23 +00:00
|
|
|
if (close_afterwards) {
|
|
|
|
Close(); // This ensures that there is no ref to block cache entries
|
|
|
|
}
|
2016-07-20 18:23:31 +00:00
|
|
|
table_options.block_cache->EraseUnRefEntries();
|
|
|
|
}
|
|
|
|
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
bool infinite_max_files_;
|
2019-01-08 20:44:56 +00:00
|
|
|
bool disallow_preload_;
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
TEST_P(PinL0IndexAndFilterBlocksTest,
|
|
|
|
IndexAndFilterBlocksOfNewTableAddedToCacheWithPinning) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
if (infinite_max_files_) {
|
|
|
|
options.max_open_files = -1;
|
|
|
|
}
|
|
|
|
options.create_if_missing = true;
|
2020-02-20 20:07:53 +00:00
|
|
|
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
table_options.cache_index_and_filter_blocks = true;
|
|
|
|
table_options.pin_l0_filter_and_index_blocks_in_cache = true;
|
|
|
|
table_options.filter_policy.reset(NewBloomFilterPolicy(20));
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put(1, "key", "val"));
|
|
|
|
// Create a new table.
|
|
|
|
ASSERT_OK(Flush(1));
|
|
|
|
|
|
|
|
// index/filter blocks added to block cache right after table creation.
|
|
|
|
ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
|
|
|
ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
|
|
|
|
|
|
|
// only index/filter were added
|
|
|
|
ASSERT_EQ(2, TestGetTickerCount(options, BLOCK_CACHE_ADD));
|
|
|
|
ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_DATA_MISS));
|
|
|
|
|
|
|
|
std::string value;
|
|
|
|
// Miss and hit count should remain the same, they're all pinned.
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_TRUE(db_->KeyMayExist(ReadOptions(), handles_[1], "key", &value));
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
|
|
|
ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
|
|
|
|
|
|
|
// Miss and hit count should remain the same, they're all pinned.
|
|
|
|
value = Get(1, "key");
|
|
|
|
ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
|
|
|
ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_P(PinL0IndexAndFilterBlocksTest,
|
|
|
|
MultiLevelIndexAndFilterBlocksCachedWithPinning) {
|
|
|
|
Options options = CurrentOptions();
|
2017-03-22 16:11:23 +00:00
|
|
|
PinL0IndexAndFilterBlocksTest::CreateTwoLevels(&options, false);
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
// get base cache values
|
|
|
|
uint64_t fm = TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS);
|
|
|
|
uint64_t fh = TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT);
|
|
|
|
uint64_t im = TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS);
|
|
|
|
uint64_t ih = TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT);
|
|
|
|
|
|
|
|
std::string value;
|
|
|
|
// this should be read from L0
|
|
|
|
// so cache values don't change
|
|
|
|
value = Get(1, "a2");
|
|
|
|
ASSERT_EQ(fm, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(fh, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(im, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
|
|
|
ASSERT_EQ(ih, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
|
|
|
|
|
|
|
// this should be read from L1
|
|
|
|
// the file is opened, prefetching results in a cache filter miss
|
|
|
|
// the block is loaded and added to the cache,
|
|
|
|
// then the get results in a cache hit for L1
|
2016-07-20 18:23:31 +00:00
|
|
|
// When we have inifinite max_files, there is still cache miss because we have
|
|
|
|
// reset the block cache
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
value = Get(1, "a");
|
|
|
|
ASSERT_EQ(fm + 1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(im + 1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
|
|
|
}
|
|
|
|
|
2016-07-20 18:23:31 +00:00
|
|
|
TEST_P(PinL0IndexAndFilterBlocksTest, DisablePrefetchingNonL0IndexAndFilter) {
|
|
|
|
Options options = CurrentOptions();
|
2017-03-22 16:11:23 +00:00
|
|
|
// This ensures that db does not ref anything in the block cache, so
|
|
|
|
// EraseUnRefEntries could clear them up.
|
|
|
|
bool close_afterwards = true;
|
|
|
|
PinL0IndexAndFilterBlocksTest::CreateTwoLevels(&options, close_afterwards);
|
2016-07-20 18:23:31 +00:00
|
|
|
|
|
|
|
// Get base cache values
|
|
|
|
uint64_t fm = TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS);
|
|
|
|
uint64_t fh = TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT);
|
|
|
|
uint64_t im = TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS);
|
|
|
|
uint64_t ih = TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT);
|
|
|
|
|
2019-01-08 20:44:56 +00:00
|
|
|
if (disallow_preload_) {
|
2018-12-29 02:00:00 +00:00
|
|
|
// Now we have two files. We narrow the max open files to allow 3 entries
|
|
|
|
// so that preloading SST files won't happen.
|
|
|
|
options.max_open_files = 13;
|
|
|
|
// RocksDB sanitize max open files to at least 20. Modify it back.
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2018-12-29 02:00:00 +00:00
|
|
|
"SanitizeOptions::AfterChangeMaxOpenFiles", [&](void* arg) {
|
|
|
|
int* max_open_files = static_cast<int*>(arg);
|
|
|
|
*max_open_files = 13;
|
|
|
|
});
|
|
|
|
}
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2018-12-29 02:00:00 +00:00
|
|
|
|
2016-07-20 18:23:31 +00:00
|
|
|
// Reopen database. If max_open_files is set as -1, table readers will be
|
|
|
|
// preloaded. This will trigger a BlockBasedTable::Open() and prefetch
|
|
|
|
// L0 index and filter. Level 1's prefetching is disabled in DB::Open()
|
2023-08-09 22:46:44 +00:00
|
|
|
ASSERT_OK(TryReopenWithColumnFamilies({"default", "pikachu"}, options));
|
2016-07-20 18:23:31 +00:00
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2018-12-29 02:00:00 +00:00
|
|
|
|
2019-01-08 20:44:56 +00:00
|
|
|
if (!disallow_preload_) {
|
2016-07-20 18:23:31 +00:00
|
|
|
// After reopen, cache miss are increased by one because we read (and only
|
|
|
|
// read) filter and index on L0
|
|
|
|
ASSERT_EQ(fm + 1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(fh, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(im + 1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
|
|
|
ASSERT_EQ(ih, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
|
|
|
} else {
|
|
|
|
// If max_open_files is not -1, we do not preload table readers, so there is
|
|
|
|
// no change.
|
|
|
|
ASSERT_EQ(fm, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(fh, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(im, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
|
|
|
ASSERT_EQ(ih, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
|
|
|
}
|
|
|
|
std::string value;
|
|
|
|
// this should be read from L0
|
|
|
|
value = Get(1, "a2");
|
|
|
|
// If max_open_files is -1, we have pinned index and filter in Rep, so there
|
|
|
|
// will not be changes in index and filter misses or hits. If max_open_files
|
|
|
|
// is not -1, Get() will open a TableReader and prefetch index and filter.
|
|
|
|
ASSERT_EQ(fm + 1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(fh, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(im + 1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
|
|
|
ASSERT_EQ(ih, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
|
|
|
|
|
|
|
// this should be read from L1
|
|
|
|
value = Get(1, "a");
|
2019-01-08 20:44:56 +00:00
|
|
|
if (!disallow_preload_) {
|
2023-06-01 22:27:29 +00:00
|
|
|
// In infinite max files case, there's a cache miss in executing Get()
|
2016-07-20 18:23:31 +00:00
|
|
|
// because index and filter are not prefetched before.
|
|
|
|
ASSERT_EQ(fm + 2, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(fh, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(im + 2, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
|
|
|
ASSERT_EQ(ih, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
|
|
|
} else {
|
|
|
|
// In this case, cache miss will be increased by one in
|
|
|
|
// BlockBasedTable::Open() because this is not in DB::Open() code path so we
|
|
|
|
// will prefetch L1's index and filter. Cache hit will also be increased by
|
|
|
|
// one because Get() will read index and filter from the block cache
|
|
|
|
// prefetched in previous Open() call.
|
|
|
|
ASSERT_EQ(fm + 2, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(fh + 1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(im + 2, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
|
|
|
ASSERT_EQ(ih + 1, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
|
|
|
}
|
2019-01-08 20:44:56 +00:00
|
|
|
|
|
|
|
// Force a full compaction to one single file. There will be a block
|
|
|
|
// cache read for both of index and filter. If prefetch doesn't explicitly
|
|
|
|
// happen, it will happen when verifying the file.
|
|
|
|
Compact(1, "a", "zzzzz");
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2019-01-08 20:44:56 +00:00
|
|
|
|
|
|
|
if (!disallow_preload_) {
|
|
|
|
ASSERT_EQ(fm + 3, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(fh, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(im + 3, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
2023-06-01 22:27:29 +00:00
|
|
|
ASSERT_EQ(ih + 2, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
2019-01-08 20:44:56 +00:00
|
|
|
} else {
|
|
|
|
ASSERT_EQ(fm + 3, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(fh + 1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(im + 3, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
2023-06-01 22:27:29 +00:00
|
|
|
ASSERT_EQ(ih + 3, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
2019-01-08 20:44:56 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// Bloom and index hit will happen when a Get() happens.
|
|
|
|
value = Get(1, "a");
|
|
|
|
if (!disallow_preload_) {
|
|
|
|
ASSERT_EQ(fm + 3, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(fh + 1, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(im + 3, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
2023-06-01 22:27:29 +00:00
|
|
|
ASSERT_EQ(ih + 3, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
2019-01-08 20:44:56 +00:00
|
|
|
} else {
|
|
|
|
ASSERT_EQ(fm + 3, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
|
|
|
|
ASSERT_EQ(fh + 2, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
|
|
|
|
ASSERT_EQ(im + 3, TestGetTickerCount(options, BLOCK_CACHE_INDEX_MISS));
|
2023-06-01 22:27:29 +00:00
|
|
|
ASSERT_EQ(ih + 4, TestGetTickerCount(options, BLOCK_CACHE_INDEX_HIT));
|
2019-01-08 20:44:56 +00:00
|
|
|
}
|
2016-07-20 18:23:31 +00:00
|
|
|
}
|
|
|
|
|
2020-06-03 22:53:09 +00:00
|
|
|
INSTANTIATE_TEST_CASE_P(PinL0IndexAndFilterBlocksTest,
|
|
|
|
PinL0IndexAndFilterBlocksTest,
|
|
|
|
::testing::Values(std::make_tuple(true, false),
|
|
|
|
std::make_tuple(false, false),
|
|
|
|
std::make_tuple(false, true)));
|
2016-06-16 23:02:52 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, MaxCompactionBytesTest) {
|
|
|
|
Options options = CurrentOptions();
|
2021-09-08 14:45:59 +00:00
|
|
|
options.memtable_factory.reset(test::NewSpecialSkipListFactory(
|
|
|
|
DBTestBase::kNumKeysByGenerateNewRandomFile));
|
2016-06-16 23:02:52 +00:00
|
|
|
options.compaction_style = kCompactionStyleLevel;
|
|
|
|
options.write_buffer_size = 200 << 10;
|
|
|
|
options.arena_block_size = 4 << 10;
|
|
|
|
options.level0_file_num_compaction_trigger = 4;
|
|
|
|
options.num_levels = 4;
|
|
|
|
options.compression = kNoCompression;
|
|
|
|
options.max_bytes_for_level_base = 450 << 10;
|
|
|
|
options.target_file_size_base = 100 << 10;
|
|
|
|
// Infinite for full compaction.
|
|
|
|
options.max_compaction_bytes = options.target_file_size_base * 100;
|
|
|
|
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
|
|
|
|
for (int num = 0; num < 8; num++) {
|
|
|
|
GenerateNewRandomFile(&rnd);
|
|
|
|
}
|
|
|
|
CompactRangeOptions cro;
|
2022-07-05 17:10:37 +00:00
|
|
|
cro.bottommost_level_compaction = BottommostLevelCompaction::kForce;
|
2016-06-16 23:02:52 +00:00
|
|
|
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
|
|
|
|
ASSERT_EQ("0,0,8", FilesPerLevel(0));
|
|
|
|
|
|
|
|
// When compact from Ln -> Ln+1, cut a file if the file overlaps with
|
|
|
|
// more than three files in Ln+1.
|
|
|
|
options.max_compaction_bytes = options.target_file_size_base * 3;
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
GenerateNewRandomFile(&rnd);
|
|
|
|
// Add three more small files that overlap with the previous file
|
|
|
|
for (int i = 0; i < 3; i++) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("a", "z"));
|
2016-06-16 23:02:52 +00:00
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2016-06-16 23:02:52 +00:00
|
|
|
|
Align compaction output file boundaries to the next level ones (#10655)
Summary:
Try to align the compaction output file boundaries to the next level ones
(grandparent level), to reduce the level compaction write-amplification.
In level compaction, there are "wasted" data at the beginning and end of the
output level files. Align the file boundary can avoid such "wasted" compaction.
With this PR, it tries to align the non-bottommost level file boundaries to its
next level ones. It may cut file when the file size is large enough (at least
50% of target_file_size) and not too large (2x target_file_size).
db_bench shows about 12.56% compaction reduction:
```
TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432
# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds
# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
```
The compaction simulator shows a similar result (14% with 100G random data).
As a side effect, with this PR, the SST file size can exceed the
target_file_size, but is capped at 2x target_file_size. And there will be
smaller files. Here are file size statistics when loading 100GB with the target
file size 32MB:
```
baseline this_PR
count 1.656000e+03 1.705000e+03
mean 3.116062e+07 3.028076e+07
std 7.145242e+06 8.046139e+06
```
The feature is enabled by default, to revert to the old behavior disable it
with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`
Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
use case like user adding 2 or more non-overlapping data range at the same
time, it can reduce the overlapping of 2 datasets in the lower levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655
Reviewed By: cbi42
Differential Revision: D39552321
Pulled By: jay-zhuang
fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
2022-09-30 02:43:55 +00:00
|
|
|
// Output files to L1 are cut to 4 pieces, according to
|
|
|
|
// options.max_compaction_bytes (300K)
|
|
|
|
// There are 8 files on L2 (grandparents level), each one is 100K. The first
|
|
|
|
// file overlaps with a, b which max_compaction_bytes is less than 300K, the
|
|
|
|
// second one overlaps with d, e, which is also less than 300K. Including any
|
|
|
|
// extra grandparent file will make the future compaction larger than 300K.
|
|
|
|
// L1: [ 1 ] [ 2 ] [ 3 ] [ 4 ]
|
|
|
|
// L2: [a] [b] [c] [d] [e] [f] [g] [h]
|
|
|
|
ASSERT_EQ("0,4,8", FilesPerLevel(0));
|
2016-06-16 23:02:52 +00:00
|
|
|
}
|
|
|
|
|
2015-12-16 02:20:10 +00:00
|
|
|
static void UniqueIdCallback(void* arg) {
|
|
|
|
int* result = reinterpret_cast<int*>(arg);
|
|
|
|
if (*result == -1) {
|
|
|
|
*result = 0;
|
|
|
|
}
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearTrace();
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2015-12-16 02:20:10 +00:00
|
|
|
"GetUniqueIdFromFile:FS_IOC_GETVERSION", UniqueIdCallback);
|
|
|
|
}
|
|
|
|
|
|
|
|
class MockPersistentCache : public PersistentCache {
|
|
|
|
public:
|
|
|
|
explicit MockPersistentCache(const bool is_compressed, const size_t max_size)
|
|
|
|
: is_compressed_(is_compressed), max_size_(max_size) {
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2015-12-16 02:20:10 +00:00
|
|
|
"GetUniqueIdFromFile:FS_IOC_GETVERSION", UniqueIdCallback);
|
|
|
|
}
|
|
|
|
|
2024-01-05 19:53:57 +00:00
|
|
|
~MockPersistentCache() override = default;
|
2015-12-16 02:20:10 +00:00
|
|
|
|
2016-11-22 01:22:01 +00:00
|
|
|
PersistentCache::StatsType Stats() override {
|
|
|
|
return PersistentCache::StatsType();
|
|
|
|
}
|
|
|
|
|
2020-06-13 20:26:03 +00:00
|
|
|
uint64_t NewId() override {
|
|
|
|
return last_id_.fetch_add(1, std::memory_order_relaxed);
|
|
|
|
}
|
|
|
|
|
2015-12-16 02:20:10 +00:00
|
|
|
Status Insert(const Slice& page_key, const char* data,
|
|
|
|
const size_t size) override {
|
|
|
|
MutexLock _(&lock_);
|
|
|
|
|
|
|
|
if (size_ > max_size_) {
|
|
|
|
size_ -= data_.begin()->second.size();
|
|
|
|
data_.erase(data_.begin());
|
|
|
|
}
|
|
|
|
|
|
|
|
data_.insert(std::make_pair(page_key.ToString(), std::string(data, size)));
|
|
|
|
size_ += size;
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status Lookup(const Slice& page_key, std::unique_ptr<char[]>* data,
|
|
|
|
size_t* size) override {
|
|
|
|
MutexLock _(&lock_);
|
|
|
|
auto it = data_.find(page_key.ToString());
|
|
|
|
if (it == data_.end()) {
|
|
|
|
return Status::NotFound();
|
|
|
|
}
|
|
|
|
|
|
|
|
assert(page_key.ToString() == it->first);
|
|
|
|
data->reset(new char[it->second.size()]);
|
|
|
|
memcpy(data->get(), it->second.c_str(), it->second.size());
|
|
|
|
*size = it->second.size();
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
bool IsCompressed() override { return is_compressed_; }
|
|
|
|
|
2016-12-19 22:00:04 +00:00
|
|
|
std::string GetPrintableOptions() const override {
|
|
|
|
return "MockPersistentCache";
|
|
|
|
}
|
|
|
|
|
2015-12-16 02:20:10 +00:00
|
|
|
port::Mutex lock_;
|
|
|
|
std::map<std::string, std::string> data_;
|
|
|
|
const bool is_compressed_ = true;
|
|
|
|
size_t size_ = 0;
|
|
|
|
const size_t max_size_ = 10 * 1024; // 10KiB
|
2020-06-13 20:26:03 +00:00
|
|
|
std::atomic<uint64_t> last_id_{1};
|
2015-12-16 02:20:10 +00:00
|
|
|
};
|
|
|
|
|
2018-12-20 20:00:40 +00:00
|
|
|
#ifdef OS_LINUX
|
|
|
|
// Make sure that in CPU time perf context counters, Env::NowCPUNanos()
|
|
|
|
// is used, rather than Env::CPUNanos();
|
2019-03-26 23:20:52 +00:00
|
|
|
TEST_F(DBTest2, TestPerfContextGetCpuTime) {
|
2018-12-29 02:00:00 +00:00
|
|
|
// force resizing table cache so table handle is not preloaded so that
|
|
|
|
// we can measure find_table_nanos during Get().
|
|
|
|
dbfull()->TEST_table_cache()->SetCapacity(0);
|
2018-12-20 20:00:40 +00:00
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
env_->now_cpu_count_.store(0);
|
Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 19:39:49 +00:00
|
|
|
env_->SetMockSleep();
|
|
|
|
|
|
|
|
// NOTE: Presumed unnecessary and removed: resetting mock time in env
|
2018-12-20 20:00:40 +00:00
|
|
|
|
|
|
|
// CPU timing is not enabled with kEnableTimeExceptForMutex
|
|
|
|
SetPerfLevel(PerfLevel::kEnableTimeExceptForMutex);
|
|
|
|
ASSERT_EQ("bar", Get("foo"));
|
|
|
|
ASSERT_EQ(0, get_perf_context()->get_cpu_nanos);
|
|
|
|
ASSERT_EQ(0, env_->now_cpu_count_.load());
|
|
|
|
|
Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 19:39:49 +00:00
|
|
|
constexpr uint64_t kDummyAddonSeconds = uint64_t{1000000};
|
|
|
|
constexpr uint64_t kDummyAddonNanos = 1000000000U * kDummyAddonSeconds;
|
2018-12-20 20:00:40 +00:00
|
|
|
|
|
|
|
// Add time to NowNanos() reading.
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2018-12-20 20:00:40 +00:00
|
|
|
"TableCache::FindTable:0",
|
Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 19:39:49 +00:00
|
|
|
[&](void* /*arg*/) { env_->MockSleepForSeconds(kDummyAddonSeconds); });
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2018-12-20 20:00:40 +00:00
|
|
|
|
|
|
|
SetPerfLevel(PerfLevel::kEnableTimeAndCPUTimeExceptForMutex);
|
|
|
|
ASSERT_EQ("bar", Get("foo"));
|
2019-01-30 00:23:21 +00:00
|
|
|
ASSERT_GT(env_->now_cpu_count_.load(), 2);
|
Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 19:39:49 +00:00
|
|
|
ASSERT_LT(get_perf_context()->get_cpu_nanos, kDummyAddonNanos);
|
|
|
|
ASSERT_GT(get_perf_context()->find_table_nanos, kDummyAddonNanos);
|
2018-12-20 20:00:40 +00:00
|
|
|
|
|
|
|
SetPerfLevel(PerfLevel::kDisable);
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2018-12-20 20:00:40 +00:00
|
|
|
}
|
2019-03-26 23:20:52 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, TestPerfContextIterCpuTime) {
|
|
|
|
DestroyAndReopen(CurrentOptions());
|
|
|
|
// force resizing table cache so table handle is not preloaded so that
|
|
|
|
// we can measure find_table_nanos during iteration
|
|
|
|
dbfull()->TEST_table_cache()->SetCapacity(0);
|
|
|
|
|
|
|
|
const size_t kNumEntries = 10;
|
|
|
|
for (size_t i = 0; i < kNumEntries; ++i) {
|
2022-05-06 20:03:58 +00:00
|
|
|
ASSERT_OK(Put("k" + std::to_string(i), "v" + std::to_string(i)));
|
2019-03-26 23:20:52 +00:00
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
for (size_t i = 0; i < kNumEntries; ++i) {
|
2022-05-06 20:03:58 +00:00
|
|
|
ASSERT_EQ("v" + std::to_string(i), Get("k" + std::to_string(i)));
|
2019-03-26 23:20:52 +00:00
|
|
|
}
|
2022-05-06 20:03:58 +00:00
|
|
|
std::string last_key = "k" + std::to_string(kNumEntries - 1);
|
|
|
|
std::string last_value = "v" + std::to_string(kNumEntries - 1);
|
2019-03-26 23:20:52 +00:00
|
|
|
env_->now_cpu_count_.store(0);
|
Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 19:39:49 +00:00
|
|
|
env_->SetMockSleep();
|
|
|
|
|
|
|
|
// NOTE: Presumed unnecessary and removed: resetting mock time in env
|
2019-03-26 23:20:52 +00:00
|
|
|
|
|
|
|
// CPU timing is not enabled with kEnableTimeExceptForMutex
|
|
|
|
SetPerfLevel(PerfLevel::kEnableTimeExceptForMutex);
|
|
|
|
Iterator* iter = db_->NewIterator(ReadOptions());
|
|
|
|
iter->Seek("k0");
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("v0", iter->value().ToString());
|
|
|
|
iter->SeekForPrev(last_key);
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
iter->SeekToLast();
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ(last_value, iter->value().ToString());
|
|
|
|
iter->SeekToFirst();
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("v0", iter->value().ToString());
|
|
|
|
ASSERT_EQ(0, get_perf_context()->iter_seek_cpu_nanos);
|
|
|
|
iter->Next();
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("v1", iter->value().ToString());
|
|
|
|
ASSERT_EQ(0, get_perf_context()->iter_next_cpu_nanos);
|
|
|
|
iter->Prev();
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter->status());
|
2019-03-26 23:20:52 +00:00
|
|
|
ASSERT_EQ("v0", iter->value().ToString());
|
|
|
|
ASSERT_EQ(0, get_perf_context()->iter_prev_cpu_nanos);
|
|
|
|
ASSERT_EQ(0, env_->now_cpu_count_.load());
|
|
|
|
delete iter;
|
|
|
|
|
Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 19:39:49 +00:00
|
|
|
constexpr uint64_t kDummyAddonSeconds = uint64_t{1000000};
|
|
|
|
constexpr uint64_t kDummyAddonNanos = 1000000000U * kDummyAddonSeconds;
|
2019-03-26 23:20:52 +00:00
|
|
|
|
|
|
|
// Add time to NowNanos() reading.
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2019-03-26 23:20:52 +00:00
|
|
|
"TableCache::FindTable:0",
|
Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 19:39:49 +00:00
|
|
|
[&](void* /*arg*/) { env_->MockSleepForSeconds(kDummyAddonSeconds); });
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2019-03-26 23:20:52 +00:00
|
|
|
|
|
|
|
SetPerfLevel(PerfLevel::kEnableTimeAndCPUTimeExceptForMutex);
|
|
|
|
iter = db_->NewIterator(ReadOptions());
|
|
|
|
iter->Seek("k0");
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("v0", iter->value().ToString());
|
|
|
|
iter->SeekForPrev(last_key);
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
iter->SeekToLast();
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ(last_value, iter->value().ToString());
|
|
|
|
iter->SeekToFirst();
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("v0", iter->value().ToString());
|
|
|
|
ASSERT_GT(get_perf_context()->iter_seek_cpu_nanos, 0);
|
Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 19:39:49 +00:00
|
|
|
ASSERT_LT(get_perf_context()->iter_seek_cpu_nanos, kDummyAddonNanos);
|
2019-03-26 23:20:52 +00:00
|
|
|
iter->Next();
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("v1", iter->value().ToString());
|
|
|
|
ASSERT_GT(get_perf_context()->iter_next_cpu_nanos, 0);
|
Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 19:39:49 +00:00
|
|
|
ASSERT_LT(get_perf_context()->iter_next_cpu_nanos, kDummyAddonNanos);
|
2019-03-26 23:20:52 +00:00
|
|
|
iter->Prev();
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter->status());
|
2019-03-26 23:20:52 +00:00
|
|
|
ASSERT_EQ("v0", iter->value().ToString());
|
|
|
|
ASSERT_GT(get_perf_context()->iter_prev_cpu_nanos, 0);
|
Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 19:39:49 +00:00
|
|
|
ASSERT_LT(get_perf_context()->iter_prev_cpu_nanos, kDummyAddonNanos);
|
2019-03-26 23:20:52 +00:00
|
|
|
ASSERT_GE(env_->now_cpu_count_.load(), 12);
|
Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 19:39:49 +00:00
|
|
|
ASSERT_GT(get_perf_context()->find_table_nanos, kDummyAddonNanos);
|
2019-03-26 23:20:52 +00:00
|
|
|
|
|
|
|
SetPerfLevel(PerfLevel::kDisable);
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2019-03-26 23:20:52 +00:00
|
|
|
delete iter;
|
|
|
|
}
|
2018-12-20 20:00:40 +00:00
|
|
|
#endif // OS_LINUX
|
|
|
|
|
2020-06-13 20:26:03 +00:00
|
|
|
#if !defined OS_SOLARIS
|
2015-12-16 02:20:10 +00:00
|
|
|
TEST_F(DBTest2, PersistentCache) {
|
|
|
|
int num_iter = 80;
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
2016-04-28 00:36:03 +00:00
|
|
|
|
2015-12-16 02:20:10 +00:00
|
|
|
Options options;
|
|
|
|
options.write_buffer_size = 64 * 1024; // small write buffer
|
2020-02-20 20:07:53 +00:00
|
|
|
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
2015-12-16 02:20:10 +00:00
|
|
|
options = CurrentOptions(options);
|
|
|
|
|
|
|
|
auto bsizes = {/*no block cache*/ 0, /*1M*/ 1 * 1024 * 1024};
|
|
|
|
auto types = {/*compressed*/ 1, /*uncompressed*/ 0};
|
|
|
|
for (auto bsize : bsizes) {
|
|
|
|
for (auto type : types) {
|
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
table_options.persistent_cache.reset(
|
|
|
|
new MockPersistentCache(type, 10 * 1024));
|
|
|
|
table_options.no_block_cache = true;
|
|
|
|
table_options.block_cache = bsize ? NewLRUCache(bsize) : nullptr;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
// default column family doesn't have block cache
|
|
|
|
Options no_block_cache_opts;
|
|
|
|
no_block_cache_opts.statistics = options.statistics;
|
|
|
|
no_block_cache_opts = CurrentOptions(no_block_cache_opts);
|
|
|
|
BlockBasedTableOptions table_options_no_bc;
|
|
|
|
table_options_no_bc.no_block_cache = true;
|
|
|
|
no_block_cache_opts.table_factory.reset(
|
|
|
|
NewBlockBasedTableFactory(table_options_no_bc));
|
|
|
|
ReopenWithColumnFamilies(
|
|
|
|
{"default", "pikachu"},
|
|
|
|
std::vector<Options>({no_block_cache_opts, options}));
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
|
|
|
|
// Write 8MB (80 values, each 100K)
|
|
|
|
ASSERT_EQ(NumTableFilesAtLevel(0, 1), 0);
|
|
|
|
std::vector<std::string> values;
|
|
|
|
std::string str;
|
|
|
|
for (int i = 0; i < num_iter; i++) {
|
|
|
|
if (i % 4 == 0) { // high compression ratio
|
2020-07-09 21:33:42 +00:00
|
|
|
str = rnd.RandomString(1000);
|
2015-12-16 02:20:10 +00:00
|
|
|
}
|
|
|
|
values.push_back(str);
|
|
|
|
ASSERT_OK(Put(1, Key(i), values[i]));
|
|
|
|
}
|
|
|
|
|
|
|
|
// flush all data from memtable so that reads are from block cache
|
|
|
|
ASSERT_OK(Flush(1));
|
|
|
|
|
|
|
|
for (int i = 0; i < num_iter; i++) {
|
|
|
|
ASSERT_EQ(Get(1, Key(i)), values[i]);
|
|
|
|
}
|
|
|
|
|
|
|
|
auto hit = options.statistics->getTickerCount(PERSISTENT_CACHE_HIT);
|
|
|
|
auto miss = options.statistics->getTickerCount(PERSISTENT_CACHE_MISS);
|
|
|
|
|
|
|
|
ASSERT_GT(hit, 0);
|
|
|
|
ASSERT_GT(miss, 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2020-06-13 20:26:03 +00:00
|
|
|
#endif // !defined OS_SOLARIS
|
2016-07-07 18:29:14 +00:00
|
|
|
|
|
|
|
namespace {
|
|
|
|
void CountSyncPoint() {
|
|
|
|
TEST_SYNC_POINT_CALLBACK("DBTest2::MarkedPoint", nullptr /* arg */);
|
|
|
|
}
|
2022-11-02 21:34:24 +00:00
|
|
|
} // anonymous namespace
|
2016-07-07 18:29:14 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, SyncPointMarker) {
|
|
|
|
std::atomic<int> sync_point_called(0);
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2016-07-07 18:29:14 +00:00
|
|
|
"DBTest2::MarkedPoint",
|
2018-04-13 00:55:14 +00:00
|
|
|
[&](void* /*arg*/) { sync_point_called.fetch_add(1); });
|
2016-07-07 18:29:14 +00:00
|
|
|
|
|
|
|
// The first dependency enforces Marker can be loaded before MarkedPoint.
|
|
|
|
// The second checks that thread 1's MarkedPoint should be disabled here.
|
|
|
|
// Execution order:
|
|
|
|
// | Thread 1 | Thread 2 |
|
|
|
|
// | | Marker |
|
|
|
|
// | MarkedPoint | |
|
|
|
|
// | Thread1First | |
|
|
|
|
// | | MarkedPoint |
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependencyAndMarkers(
|
2016-07-07 18:29:14 +00:00
|
|
|
{{"DBTest2::SyncPointMarker:Thread1First", "DBTest2::MarkedPoint"}},
|
|
|
|
{{"DBTest2::SyncPointMarker:Marker", "DBTest2::MarkedPoint"}});
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2016-07-07 18:29:14 +00:00
|
|
|
|
|
|
|
std::function<void()> func1 = [&]() {
|
|
|
|
CountSyncPoint();
|
|
|
|
TEST_SYNC_POINT("DBTest2::SyncPointMarker:Thread1First");
|
|
|
|
};
|
|
|
|
|
|
|
|
std::function<void()> func2 = [&]() {
|
|
|
|
TEST_SYNC_POINT("DBTest2::SyncPointMarker:Marker");
|
|
|
|
CountSyncPoint();
|
|
|
|
};
|
|
|
|
|
2017-02-06 22:43:55 +00:00
|
|
|
auto thread1 = port::Thread(func1);
|
|
|
|
auto thread2 = port::Thread(func2);
|
2016-07-07 18:29:14 +00:00
|
|
|
thread1.join();
|
|
|
|
thread2.join();
|
|
|
|
|
|
|
|
// Callback is only executed once
|
|
|
|
ASSERT_EQ(sync_point_called.load(), 1);
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2016-07-07 18:29:14 +00:00
|
|
|
}
|
Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>
This diff is stacked on top of D56493 and D56511
In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand
```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s
[master]
readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```
```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s
[master]
readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s
readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s
readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s
readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s
readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]
[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s
readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s
[master]
readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s
readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
[FullMergeV2]
readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s
readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s
[master]
readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s
readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: lovro, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57075
2016-07-20 16:49:03 +00:00
|
|
|
|
2016-08-27 01:55:58 +00:00
|
|
|
size_t GetEncodedEntrySize(size_t key_size, size_t value_size) {
|
|
|
|
std::string buffer;
|
|
|
|
|
|
|
|
PutVarint32(&buffer, static_cast<uint32_t>(0));
|
|
|
|
PutVarint32(&buffer, static_cast<uint32_t>(key_size));
|
|
|
|
PutVarint32(&buffer, static_cast<uint32_t>(value_size));
|
|
|
|
|
|
|
|
return buffer.size() + key_size + value_size;
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, ReadAmpBitmap) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
BlockBasedTableOptions bbto;
|
2017-05-10 18:50:10 +00:00
|
|
|
uint32_t bytes_per_bit[2] = {1, 16};
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
for (size_t k = 0; k < 2; k++) {
|
|
|
|
// Disable delta encoding to make it easier to calculate read amplification
|
|
|
|
bbto.use_delta_encoding = false;
|
|
|
|
// Huge block cache to make it easier to calculate read amplification
|
|
|
|
bbto.block_cache = NewLRUCache(1024 * 1024 * 1024);
|
|
|
|
bbto.read_amp_bytes_per_bit = bytes_per_bit[k];
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
|
2020-02-20 20:07:53 +00:00
|
|
|
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
DestroyAndReopen(options);
|
2016-08-27 01:55:58 +00:00
|
|
|
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
const size_t kNumEntries = 10000;
|
2016-08-27 01:55:58 +00:00
|
|
|
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
Random rnd(301);
|
|
|
|
for (size_t i = 0; i < kNumEntries; i++) {
|
2020-07-09 21:33:42 +00:00
|
|
|
ASSERT_OK(Put(Key(static_cast<int>(i)), rnd.RandomString(100)));
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
2016-08-27 01:55:58 +00:00
|
|
|
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
Close();
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
// Read keys/values randomly and verify that reported read amp error
|
|
|
|
// is less than 2%
|
|
|
|
uint64_t total_useful_bytes = 0;
|
|
|
|
std::set<int> read_keys;
|
|
|
|
std::string value;
|
|
|
|
for (size_t i = 0; i < kNumEntries * 5; i++) {
|
|
|
|
int key_idx = rnd.Next() % kNumEntries;
|
|
|
|
std::string key = Key(key_idx);
|
|
|
|
ASSERT_OK(db_->Get(ReadOptions(), key, &value));
|
|
|
|
|
|
|
|
if (read_keys.find(key_idx) == read_keys.end()) {
|
|
|
|
auto internal_key = InternalKey(key, 0, ValueType::kTypeValue);
|
|
|
|
total_useful_bytes +=
|
|
|
|
GetEncodedEntrySize(internal_key.size(), value.size());
|
|
|
|
read_keys.insert(key_idx);
|
|
|
|
}
|
2016-08-27 01:55:58 +00:00
|
|
|
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
double expected_read_amp =
|
|
|
|
static_cast<double>(total_useful_bytes) /
|
|
|
|
options.statistics->getTickerCount(READ_AMP_TOTAL_READ_BYTES);
|
2016-08-27 01:55:58 +00:00
|
|
|
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
double read_amp =
|
|
|
|
static_cast<double>(options.statistics->getTickerCount(
|
|
|
|
READ_AMP_ESTIMATE_USEFUL_BYTES)) /
|
|
|
|
options.statistics->getTickerCount(READ_AMP_TOTAL_READ_BYTES);
|
2016-08-27 01:55:58 +00:00
|
|
|
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
double error_pct = fabs(expected_read_amp - read_amp) * 100;
|
|
|
|
// Error between reported read amp and real read amp should be less than
|
|
|
|
// 2%
|
|
|
|
EXPECT_LE(error_pct, 2);
|
|
|
|
}
|
2016-08-27 01:55:58 +00:00
|
|
|
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
// Make sure we read every thing in the DB (which is smaller than our cache)
|
|
|
|
Iterator* iter = db_->NewIterator(ReadOptions());
|
|
|
|
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {
|
|
|
|
ASSERT_EQ(iter->value().ToString(), Get(iter->key().ToString()));
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter->status());
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
delete iter;
|
2016-08-27 01:55:58 +00:00
|
|
|
|
unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 08:32:52 +00:00
|
|
|
// Read amp is on average 100% since we read all what we loaded in memory
|
|
|
|
if (k == 0) {
|
|
|
|
ASSERT_EQ(
|
|
|
|
options.statistics->getTickerCount(READ_AMP_ESTIMATE_USEFUL_BYTES),
|
|
|
|
options.statistics->getTickerCount(READ_AMP_TOTAL_READ_BYTES));
|
|
|
|
} else {
|
|
|
|
ASSERT_NEAR(
|
|
|
|
options.statistics->getTickerCount(READ_AMP_ESTIMATE_USEFUL_BYTES) *
|
|
|
|
1.0f /
|
|
|
|
options.statistics->getTickerCount(READ_AMP_TOTAL_READ_BYTES),
|
|
|
|
1, .01);
|
|
|
|
}
|
2016-08-27 01:55:58 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
#ifndef OS_SOLARIS // GetUniqueIdFromFile is not implemented
|
2016-08-27 01:55:58 +00:00
|
|
|
TEST_F(DBTest2, ReadAmpBitmapLiveInCacheAfterDBClose) {
|
2018-01-31 00:41:21 +00:00
|
|
|
{
|
|
|
|
const int kIdBufLen = 100;
|
|
|
|
char id_buf[kIdBufLen];
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
Status s = Status::NotSupported();
|
2018-03-06 19:47:42 +00:00
|
|
|
#ifndef OS_WIN
|
|
|
|
// You can't open a directory on windows using random access file
|
2018-01-31 00:41:21 +00:00
|
|
|
std::unique_ptr<RandomAccessFile> file;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
s = env_->NewRandomAccessFile(dbname_, &file, EnvOptions());
|
|
|
|
if (s.ok()) {
|
|
|
|
if (file->GetUniqueId(id_buf, kIdBufLen) == 0) {
|
|
|
|
// fs holding db directory doesn't support getting a unique file id,
|
|
|
|
// this means that running this test will fail because lru_cache will
|
|
|
|
// load the blocks again regardless of them being already in the cache
|
|
|
|
return;
|
|
|
|
}
|
2018-03-06 19:47:42 +00:00
|
|
|
}
|
|
|
|
#endif
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
if (!s.ok()) {
|
|
|
|
std::unique_ptr<Directory> dir;
|
|
|
|
ASSERT_OK(env_->NewDirectory(dbname_, &dir));
|
|
|
|
if (dir->GetUniqueId(id_buf, kIdBufLen) == 0) {
|
|
|
|
// fs holding db directory doesn't support getting a unique file id,
|
|
|
|
// this means that running this test will fail because lru_cache will
|
|
|
|
// load the blocks again regardless of them being already in the cache
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
2016-08-27 01:55:58 +00:00
|
|
|
}
|
2017-05-10 18:53:28 +00:00
|
|
|
uint32_t bytes_per_bit[2] = {1, 16};
|
|
|
|
for (size_t k = 0; k < 2; k++) {
|
|
|
|
std::shared_ptr<Cache> lru_cache = NewLRUCache(1024 * 1024 * 1024);
|
2020-02-20 20:07:53 +00:00
|
|
|
std::shared_ptr<Statistics> stats = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
2016-08-27 01:55:58 +00:00
|
|
|
|
2017-05-10 18:53:28 +00:00
|
|
|
Options options = CurrentOptions();
|
|
|
|
BlockBasedTableOptions bbto;
|
|
|
|
// Disable delta encoding to make it easier to calculate read amplification
|
|
|
|
bbto.use_delta_encoding = false;
|
|
|
|
// Huge block cache to make it easier to calculate read amplification
|
|
|
|
bbto.block_cache = lru_cache;
|
|
|
|
bbto.read_amp_bytes_per_bit = bytes_per_bit[k];
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
|
|
|
|
options.statistics = stats;
|
|
|
|
DestroyAndReopen(options);
|
2016-08-27 01:55:58 +00:00
|
|
|
|
2017-05-10 18:53:28 +00:00
|
|
|
const int kNumEntries = 10000;
|
2016-08-27 01:55:58 +00:00
|
|
|
|
2017-05-10 18:53:28 +00:00
|
|
|
Random rnd(301);
|
|
|
|
for (int i = 0; i < kNumEntries; i++) {
|
2020-07-09 21:33:42 +00:00
|
|
|
ASSERT_OK(Put(Key(i), rnd.RandomString(100)));
|
2017-05-10 18:53:28 +00:00
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
2016-08-27 01:55:58 +00:00
|
|
|
|
2017-05-10 18:53:28 +00:00
|
|
|
Close();
|
|
|
|
Reopen(options);
|
2016-08-27 01:55:58 +00:00
|
|
|
|
2017-05-10 18:53:28 +00:00
|
|
|
std::set<int> read_keys;
|
|
|
|
std::string value;
|
|
|
|
// Iter1: Read half the DB, Read even keys
|
|
|
|
// Key(0), Key(2), Key(4), Key(6), Key(8), ...
|
|
|
|
for (int i = 0; i < kNumEntries; i += 2) {
|
|
|
|
std::string key = Key(i);
|
|
|
|
ASSERT_OK(db_->Get(ReadOptions(), key, &value));
|
2016-08-27 01:55:58 +00:00
|
|
|
|
2017-05-10 18:53:28 +00:00
|
|
|
if (read_keys.find(i) == read_keys.end()) {
|
|
|
|
auto internal_key = InternalKey(key, 0, ValueType::kTypeValue);
|
|
|
|
read_keys.insert(i);
|
|
|
|
}
|
2016-08-27 01:55:58 +00:00
|
|
|
}
|
|
|
|
|
2017-05-10 18:53:28 +00:00
|
|
|
size_t total_useful_bytes_iter1 =
|
|
|
|
options.statistics->getTickerCount(READ_AMP_ESTIMATE_USEFUL_BYTES);
|
|
|
|
size_t total_loaded_bytes_iter1 =
|
|
|
|
options.statistics->getTickerCount(READ_AMP_TOTAL_READ_BYTES);
|
|
|
|
|
|
|
|
Close();
|
2020-02-20 20:07:53 +00:00
|
|
|
std::shared_ptr<Statistics> new_statistics =
|
|
|
|
ROCKSDB_NAMESPACE::CreateDBStatistics();
|
2017-05-10 18:53:28 +00:00
|
|
|
// Destroy old statistics obj that the blocks in lru_cache are pointing to
|
|
|
|
options.statistics.reset();
|
|
|
|
// Use the statistics object that we just created
|
|
|
|
options.statistics = new_statistics;
|
|
|
|
Reopen(options);
|
2016-08-27 01:55:58 +00:00
|
|
|
|
2017-05-10 18:53:28 +00:00
|
|
|
// Iter2: Read half the DB, Read odd keys
|
|
|
|
// Key(1), Key(3), Key(5), Key(7), Key(9), ...
|
|
|
|
for (int i = 1; i < kNumEntries; i += 2) {
|
|
|
|
std::string key = Key(i);
|
|
|
|
ASSERT_OK(db_->Get(ReadOptions(), key, &value));
|
2016-08-27 01:55:58 +00:00
|
|
|
|
2017-05-10 18:53:28 +00:00
|
|
|
if (read_keys.find(i) == read_keys.end()) {
|
|
|
|
auto internal_key = InternalKey(key, 0, ValueType::kTypeValue);
|
|
|
|
read_keys.insert(i);
|
|
|
|
}
|
2016-08-27 01:55:58 +00:00
|
|
|
}
|
|
|
|
|
2017-05-10 18:53:28 +00:00
|
|
|
size_t total_useful_bytes_iter2 =
|
|
|
|
options.statistics->getTickerCount(READ_AMP_ESTIMATE_USEFUL_BYTES);
|
|
|
|
size_t total_loaded_bytes_iter2 =
|
|
|
|
options.statistics->getTickerCount(READ_AMP_TOTAL_READ_BYTES);
|
|
|
|
|
|
|
|
// Read amp is on average 100% since we read all what we loaded in memory
|
|
|
|
if (k == 0) {
|
|
|
|
ASSERT_EQ(total_useful_bytes_iter1 + total_useful_bytes_iter2,
|
|
|
|
total_loaded_bytes_iter1 + total_loaded_bytes_iter2);
|
|
|
|
} else {
|
|
|
|
ASSERT_NEAR((total_useful_bytes_iter1 + total_useful_bytes_iter2) * 1.0f /
|
|
|
|
(total_loaded_bytes_iter1 + total_loaded_bytes_iter2),
|
|
|
|
1, .01);
|
|
|
|
}
|
|
|
|
}
|
2016-08-27 01:55:58 +00:00
|
|
|
}
|
2022-11-02 21:34:24 +00:00
|
|
|
#endif // !OS_SOLARIS
|
2016-10-13 17:49:06 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, AutomaticCompactionOverlapManualCompaction) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.num_levels = 3;
|
|
|
|
options.IncreaseParallelism(20);
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put(Key(0), "a"));
|
|
|
|
ASSERT_OK(Put(Key(5), "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
ASSERT_OK(Put(Key(10), "a"));
|
|
|
|
ASSERT_OK(Put(Key(15), "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
CompactRangeOptions cro;
|
|
|
|
cro.change_level = true;
|
|
|
|
cro.target_level = 2;
|
|
|
|
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
|
|
|
|
|
2016-11-12 04:45:47 +00:00
|
|
|
auto get_stat = [](std::string level_str, LevelStatType type,
|
2017-08-30 22:20:48 +00:00
|
|
|
std::map<std::string, std::string> props) {
|
2016-11-12 04:45:47 +00:00
|
|
|
auto prop_str =
|
2017-08-30 22:20:48 +00:00
|
|
|
"compaction." + level_str + "." +
|
2016-11-12 04:45:47 +00:00
|
|
|
InternalStats::compaction_level_stats.at(type).property_name.c_str();
|
|
|
|
auto prop_item = props.find(prop_str);
|
2017-08-30 22:20:48 +00:00
|
|
|
return prop_item == props.end() ? 0 : std::stod(prop_item->second);
|
2016-11-12 04:45:47 +00:00
|
|
|
};
|
|
|
|
|
2016-10-13 17:49:06 +00:00
|
|
|
// Trivial move 2 files to L2
|
|
|
|
ASSERT_EQ("0,0,2", FilesPerLevel());
|
2016-11-12 04:45:47 +00:00
|
|
|
// Also test that the stats GetMapProperty API reporting the same result
|
|
|
|
{
|
2017-08-30 22:20:48 +00:00
|
|
|
std::map<std::string, std::string> prop;
|
2016-11-12 04:45:47 +00:00
|
|
|
ASSERT_TRUE(dbfull()->GetMapProperty("rocksdb.cfstats", &prop));
|
|
|
|
ASSERT_EQ(0, get_stat("L0", LevelStatType::NUM_FILES, prop));
|
|
|
|
ASSERT_EQ(0, get_stat("L1", LevelStatType::NUM_FILES, prop));
|
|
|
|
ASSERT_EQ(2, get_stat("L2", LevelStatType::NUM_FILES, prop));
|
|
|
|
ASSERT_EQ(2, get_stat("Sum", LevelStatType::NUM_FILES, prop));
|
|
|
|
}
|
2016-10-13 17:49:06 +00:00
|
|
|
|
|
|
|
// While the compaction is running, we will create 2 new files that
|
|
|
|
// can fit in L2, these 2 files will be moved to L2 and overlap with
|
|
|
|
// the running compaction and break the LSM consistency.
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2018-04-13 00:55:14 +00:00
|
|
|
"CompactionJob::Run():Start", [&](void* /*arg*/) {
|
2016-10-13 17:49:06 +00:00
|
|
|
ASSERT_OK(
|
|
|
|
dbfull()->SetOptions({{"level0_file_num_compaction_trigger", "2"},
|
|
|
|
{"max_bytes_for_level_base", "1"}}));
|
|
|
|
ASSERT_OK(Put(Key(6), "a"));
|
|
|
|
ASSERT_OK(Put(Key(7), "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
ASSERT_OK(Put(Key(8), "a"));
|
|
|
|
ASSERT_OK(Put(Key(9), "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2016-10-13 17:49:06 +00:00
|
|
|
|
|
|
|
// Run a manual compaction that will compact the 2 files in L2
|
|
|
|
// into 1 file in L2
|
|
|
|
cro.exclusive_manual_compaction = false;
|
2019-04-17 06:29:32 +00:00
|
|
|
cro.bottommost_level_compaction = BottommostLevelCompaction::kForceOptimized;
|
2016-10-13 17:49:06 +00:00
|
|
|
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2016-11-12 04:45:47 +00:00
|
|
|
|
|
|
|
// Test that the stats GetMapProperty API reporting 1 file in L2
|
|
|
|
{
|
2017-08-30 22:20:48 +00:00
|
|
|
std::map<std::string, std::string> prop;
|
2016-11-12 04:45:47 +00:00
|
|
|
ASSERT_TRUE(dbfull()->GetMapProperty("rocksdb.cfstats", &prop));
|
|
|
|
ASSERT_EQ(1, get_stat("L2", LevelStatType::NUM_FILES, prop));
|
|
|
|
}
|
2016-10-13 17:49:06 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, ManualCompactionOverlapManualCompaction) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.num_levels = 2;
|
|
|
|
options.IncreaseParallelism(20);
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put(Key(0), "a"));
|
|
|
|
ASSERT_OK(Put(Key(5), "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
ASSERT_OK(Put(Key(10), "a"));
|
|
|
|
ASSERT_OK(Put(Key(15), "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));
|
|
|
|
|
|
|
|
// Trivial move 2 files to L1
|
|
|
|
ASSERT_EQ("0,2", FilesPerLevel());
|
|
|
|
|
|
|
|
std::function<void()> bg_manual_compact = [&]() {
|
|
|
|
std::string k1 = Key(6);
|
|
|
|
std::string k2 = Key(9);
|
|
|
|
Slice k1s(k1);
|
|
|
|
Slice k2s(k2);
|
|
|
|
CompactRangeOptions cro;
|
|
|
|
cro.exclusive_manual_compaction = false;
|
|
|
|
ASSERT_OK(db_->CompactRange(cro, &k1s, &k2s));
|
|
|
|
};
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::port::Thread bg_thread;
|
2016-10-13 17:49:06 +00:00
|
|
|
|
|
|
|
// While the compaction is running, we will create 2 new files that
|
|
|
|
// can fit in L1, these 2 files will be moved to L1 and overlap with
|
|
|
|
// the running compaction and break the LSM consistency.
|
|
|
|
std::atomic<bool> flag(false);
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2018-04-13 00:55:14 +00:00
|
|
|
"CompactionJob::Run():Start", [&](void* /*arg*/) {
|
2016-10-13 17:49:06 +00:00
|
|
|
if (flag.exchange(true)) {
|
|
|
|
// We want to make sure to call this callback only once
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
ASSERT_OK(Put(Key(6), "a"));
|
|
|
|
ASSERT_OK(Put(Key(7), "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
ASSERT_OK(Put(Key(8), "a"));
|
|
|
|
ASSERT_OK(Put(Key(9), "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
// Start a non-exclusive manual compaction in a bg thread
|
2017-02-06 22:43:55 +00:00
|
|
|
bg_thread = port::Thread(bg_manual_compact);
|
2016-10-13 17:49:06 +00:00
|
|
|
// This manual compaction conflict with the other manual compaction
|
|
|
|
// so it should wait until the first compaction finish
|
|
|
|
env_->SleepForMicroseconds(1000000);
|
|
|
|
});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2016-10-13 17:49:06 +00:00
|
|
|
|
|
|
|
// Run a manual compaction that will compact the 2 files in L1
|
|
|
|
// into 1 file in L1
|
|
|
|
CompactRangeOptions cro;
|
|
|
|
cro.exclusive_manual_compaction = false;
|
2019-04-17 06:29:32 +00:00
|
|
|
cro.bottommost_level_compaction = BottommostLevelCompaction::kForceOptimized;
|
2016-10-13 17:49:06 +00:00
|
|
|
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
|
|
|
|
bg_thread.join();
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2016-10-13 17:49:06 +00:00
|
|
|
}
|
2017-01-20 18:43:59 +00:00
|
|
|
|
2019-09-17 04:00:13 +00:00
|
|
|
TEST_F(DBTest2, PausingManualCompaction1) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
options.num_levels = 7;
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
Random rnd(301);
|
|
|
|
// Generate a file containing 10 keys.
|
|
|
|
for (int i = 0; i < 10; i++) {
|
2020-07-09 21:33:42 +00:00
|
|
|
ASSERT_OK(Put(Key(i), rnd.RandomString(50)));
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
// Generate another file containing same keys
|
|
|
|
for (int i = 0; i < 10; i++) {
|
2020-07-09 21:33:42 +00:00
|
|
|
ASSERT_OK(Put(Key(i), rnd.RandomString(50)));
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
int manual_compactions_paused = 0;
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2019-09-17 04:00:13 +00:00
|
|
|
"CompactionJob::Run():PausingManualCompaction:1", [&](void* arg) {
|
2022-06-07 01:32:26 +00:00
|
|
|
auto canceled = static_cast<std::atomic<bool>*>(arg);
|
|
|
|
// CompactRange triggers manual compaction and cancel the compaction
|
|
|
|
// by set *canceled as true
|
|
|
|
if (canceled != nullptr) {
|
|
|
|
canceled->store(true, std::memory_order_release);
|
|
|
|
}
|
|
|
|
manual_compactions_paused += 1;
|
|
|
|
});
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"TestCompactFiles:PausingManualCompaction:3", [&](void* arg) {
|
2020-08-14 18:28:12 +00:00
|
|
|
auto paused = static_cast<std::atomic<int>*>(arg);
|
2022-06-07 01:32:26 +00:00
|
|
|
// CompactFiles() relies on manual_compactions_paused to
|
|
|
|
// determine if thie compaction should be paused or not
|
2020-08-14 18:28:12 +00:00
|
|
|
ASSERT_EQ(0, paused->load(std::memory_order_acquire));
|
|
|
|
paused->fetch_add(1, std::memory_order_release);
|
2019-09-17 04:00:13 +00:00
|
|
|
});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2019-09-17 04:00:13 +00:00
|
|
|
|
|
|
|
std::vector<std::string> files_before_compact, files_after_compact;
|
|
|
|
// Remember file name before compaction is triggered
|
|
|
|
std::vector<LiveFileMetaData> files_meta;
|
|
|
|
dbfull()->GetLiveFilesMetaData(&files_meta);
|
2024-01-05 19:53:57 +00:00
|
|
|
for (const auto& file : files_meta) {
|
2019-09-17 04:00:13 +00:00
|
|
|
files_before_compact.push_back(file.name);
|
|
|
|
}
|
|
|
|
|
|
|
|
// OK, now trigger a manual compaction
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_TRUE(dbfull()
|
|
|
|
->CompactRange(CompactRangeOptions(), nullptr, nullptr)
|
|
|
|
.IsManualCompactionPaused());
|
2019-09-17 04:00:13 +00:00
|
|
|
|
|
|
|
// Wait for compactions to get scheduled and stopped
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2019-09-17 04:00:13 +00:00
|
|
|
|
|
|
|
// Get file names after compaction is stopped
|
|
|
|
files_meta.clear();
|
|
|
|
dbfull()->GetLiveFilesMetaData(&files_meta);
|
2024-01-05 19:53:57 +00:00
|
|
|
for (const auto& file : files_meta) {
|
2019-09-17 04:00:13 +00:00
|
|
|
files_after_compact.push_back(file.name);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Like nothing happened
|
|
|
|
ASSERT_EQ(files_before_compact, files_after_compact);
|
|
|
|
ASSERT_EQ(manual_compactions_paused, 1);
|
|
|
|
|
|
|
|
manual_compactions_paused = 0;
|
|
|
|
// Now make sure CompactFiles also not run
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_TRUE(dbfull()
|
|
|
|
->CompactFiles(ROCKSDB_NAMESPACE::CompactionOptions(),
|
|
|
|
files_before_compact, 0)
|
|
|
|
.IsManualCompactionPaused());
|
2019-09-17 04:00:13 +00:00
|
|
|
// Wait for manual compaction to get scheduled and finish
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2019-09-17 04:00:13 +00:00
|
|
|
|
|
|
|
files_meta.clear();
|
|
|
|
files_after_compact.clear();
|
|
|
|
dbfull()->GetLiveFilesMetaData(&files_meta);
|
2024-01-05 19:53:57 +00:00
|
|
|
for (const auto& file : files_meta) {
|
2019-09-17 04:00:13 +00:00
|
|
|
files_after_compact.push_back(file.name);
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT_EQ(files_before_compact, files_after_compact);
|
|
|
|
// CompactFiles returns at entry point
|
|
|
|
ASSERT_EQ(manual_compactions_paused, 0);
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// PausingManualCompaction does not affect auto compaction
|
|
|
|
TEST_F(DBTest2, PausingManualCompaction2) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.level0_file_num_compaction_trigger = 2;
|
|
|
|
options.disable_auto_compactions = false;
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
dbfull()->DisableManualCompaction();
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
for (int i = 0; i < 2; i++) {
|
2022-06-07 01:32:26 +00:00
|
|
|
// Generate a file containing 100 keys.
|
2019-09-17 04:00:13 +00:00
|
|
|
for (int j = 0; j < 100; j++) {
|
2020-07-09 21:33:42 +00:00
|
|
|
ASSERT_OK(Put(Key(j), rnd.RandomString(50)));
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2019-09-17 04:00:13 +00:00
|
|
|
|
|
|
|
std::vector<LiveFileMetaData> files_meta;
|
|
|
|
dbfull()->GetLiveFilesMetaData(&files_meta);
|
|
|
|
ASSERT_EQ(files_meta.size(), 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, PausingManualCompaction3) {
|
|
|
|
CompactRangeOptions compact_options;
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
options.num_levels = 7;
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
auto generate_files = [&]() {
|
|
|
|
for (int i = 0; i < options.num_levels; i++) {
|
2019-09-19 19:32:33 +00:00
|
|
|
for (int j = 0; j < options.num_levels - i + 1; j++) {
|
2019-09-17 04:00:13 +00:00
|
|
|
for (int k = 0; k < 1000; k++) {
|
2020-07-09 21:33:42 +00:00
|
|
|
ASSERT_OK(Put(Key(k + j * 1000), rnd.RandomString(50)));
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
|
|
|
|
2019-09-19 19:32:33 +00:00
|
|
|
for (int l = 1; l < options.num_levels - i; l++) {
|
2019-09-17 04:00:13 +00:00
|
|
|
MoveFilesToLevel(l);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
generate_files();
|
|
|
|
ASSERT_EQ("2,3,4,5,6,7,8", FilesPerLevel());
|
|
|
|
int run_manual_compactions = 0;
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2019-09-19 19:32:33 +00:00
|
|
|
"CompactionJob::Run():PausingManualCompaction:1",
|
|
|
|
[&](void* /*arg*/) { run_manual_compactions++; });
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2019-09-17 04:00:13 +00:00
|
|
|
|
|
|
|
dbfull()->DisableManualCompaction();
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_TRUE(dbfull()
|
|
|
|
->CompactRange(compact_options, nullptr, nullptr)
|
|
|
|
.IsManualCompactionPaused());
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2019-09-17 04:00:13 +00:00
|
|
|
// As manual compaction disabled, not even reach sync point
|
|
|
|
ASSERT_EQ(run_manual_compactions, 0);
|
|
|
|
ASSERT_EQ("2,3,4,5,6,7,8", FilesPerLevel());
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearCallBack(
|
2019-09-17 04:00:13 +00:00
|
|
|
"CompactionJob::Run():PausingManualCompaction:1");
|
|
|
|
dbfull()->EnableManualCompaction();
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->CompactRange(compact_options, nullptr, nullptr));
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2019-09-17 04:00:13 +00:00
|
|
|
ASSERT_EQ("0,0,0,0,0,0,2", FilesPerLevel());
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, PausingManualCompaction4) {
|
|
|
|
CompactRangeOptions compact_options;
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
options.num_levels = 7;
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
auto generate_files = [&]() {
|
|
|
|
for (int i = 0; i < options.num_levels; i++) {
|
2019-09-19 19:32:33 +00:00
|
|
|
for (int j = 0; j < options.num_levels - i + 1; j++) {
|
2019-09-17 04:00:13 +00:00
|
|
|
for (int k = 0; k < 1000; k++) {
|
2020-07-09 21:33:42 +00:00
|
|
|
ASSERT_OK(Put(Key(k + j * 1000), rnd.RandomString(50)));
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
|
|
|
|
2019-09-19 19:32:33 +00:00
|
|
|
for (int l = 1; l < options.num_levels - i; l++) {
|
2019-09-17 04:00:13 +00:00
|
|
|
MoveFilesToLevel(l);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
generate_files();
|
|
|
|
ASSERT_EQ("2,3,4,5,6,7,8", FilesPerLevel());
|
|
|
|
int run_manual_compactions = 0;
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2019-09-17 04:00:13 +00:00
|
|
|
"CompactionJob::Run():PausingManualCompaction:2", [&](void* arg) {
|
2022-06-07 01:32:26 +00:00
|
|
|
auto canceled = static_cast<std::atomic<bool>*>(arg);
|
|
|
|
// CompactRange triggers manual compaction and cancel the compaction
|
|
|
|
// by set *canceled as true
|
|
|
|
if (canceled != nullptr) {
|
|
|
|
canceled->store(true, std::memory_order_release);
|
|
|
|
}
|
|
|
|
run_manual_compactions++;
|
|
|
|
});
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"TestCompactFiles:PausingManualCompaction:3", [&](void* arg) {
|
2020-08-14 18:28:12 +00:00
|
|
|
auto paused = static_cast<std::atomic<int>*>(arg);
|
2022-06-07 01:32:26 +00:00
|
|
|
// CompactFiles() relies on manual_compactions_paused to
|
|
|
|
// determine if thie compaction should be paused or not
|
2020-08-14 18:28:12 +00:00
|
|
|
ASSERT_EQ(0, paused->load(std::memory_order_acquire));
|
|
|
|
paused->fetch_add(1, std::memory_order_release);
|
2019-09-17 04:00:13 +00:00
|
|
|
});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2019-09-17 04:00:13 +00:00
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_TRUE(dbfull()
|
|
|
|
->CompactRange(compact_options, nullptr, nullptr)
|
|
|
|
.IsManualCompactionPaused());
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2019-09-17 04:00:13 +00:00
|
|
|
ASSERT_EQ(run_manual_compactions, 1);
|
|
|
|
ASSERT_EQ("2,3,4,5,6,7,8", FilesPerLevel());
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearCallBack(
|
2019-09-17 04:00:13 +00:00
|
|
|
"CompactionJob::Run():PausingManualCompaction:2");
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->CompactRange(compact_options, nullptr, nullptr));
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2019-09-17 04:00:13 +00:00
|
|
|
ASSERT_EQ("0,0,0,0,0,0,2", FilesPerLevel());
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
|
|
|
|
2021-06-07 18:40:31 +00:00
|
|
|
TEST_F(DBTest2, CancelManualCompaction1) {
|
|
|
|
CompactRangeOptions compact_options;
|
|
|
|
auto canceledPtr =
|
|
|
|
std::unique_ptr<std::atomic<bool>>(new std::atomic<bool>{true});
|
|
|
|
compact_options.canceled = canceledPtr.get();
|
|
|
|
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
options.num_levels = 7;
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
auto generate_files = [&]() {
|
|
|
|
for (int i = 0; i < options.num_levels; i++) {
|
|
|
|
for (int j = 0; j < options.num_levels - i + 1; j++) {
|
|
|
|
for (int k = 0; k < 1000; k++) {
|
|
|
|
ASSERT_OK(Put(Key(k + j * 1000), rnd.RandomString(50)));
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2021-06-07 18:40:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
for (int l = 1; l < options.num_levels - i; l++) {
|
|
|
|
MoveFilesToLevel(l);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
generate_files();
|
|
|
|
ASSERT_EQ("2,3,4,5,6,7,8", FilesPerLevel());
|
|
|
|
|
|
|
|
int run_manual_compactions = 0;
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"CompactionJob::Run():PausingManualCompaction:1",
|
|
|
|
[&](void* /*arg*/) { run_manual_compactions++; });
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
// Setup a callback to disable compactions after a couple of levels are
|
|
|
|
// compacted
|
|
|
|
int compactions_run = 0;
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"DBImpl::RunManualCompaction()::1",
|
|
|
|
[&](void* /*arg*/) { ++compactions_run; });
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_TRUE(dbfull()
|
|
|
|
->CompactRange(compact_options, nullptr, nullptr)
|
|
|
|
.IsManualCompactionPaused());
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2021-06-07 18:40:31 +00:00
|
|
|
|
|
|
|
// Since compactions are disabled, we shouldn't start compacting.
|
|
|
|
// E.g. we should call the compaction function exactly one time.
|
|
|
|
ASSERT_EQ(compactions_run, 0);
|
|
|
|
ASSERT_EQ(run_manual_compactions, 0);
|
|
|
|
ASSERT_EQ("2,3,4,5,6,7,8", FilesPerLevel());
|
|
|
|
|
|
|
|
compactions_run = 0;
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearCallBack(
|
|
|
|
"DBImpl::RunManualCompaction()::1");
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"DBImpl::RunManualCompaction()::1", [&](void* /*arg*/) {
|
|
|
|
++compactions_run;
|
|
|
|
// After 3 compactions disable
|
|
|
|
if (compactions_run == 3) {
|
|
|
|
compact_options.canceled->store(true, std::memory_order_release);
|
|
|
|
}
|
|
|
|
});
|
|
|
|
|
|
|
|
compact_options.canceled->store(false, std::memory_order_release);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_TRUE(dbfull()
|
|
|
|
->CompactRange(compact_options, nullptr, nullptr)
|
|
|
|
.IsManualCompactionPaused());
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2021-06-07 18:40:31 +00:00
|
|
|
|
|
|
|
ASSERT_EQ(compactions_run, 3);
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearCallBack(
|
|
|
|
"DBImpl::RunManualCompaction()::1");
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearCallBack(
|
|
|
|
"CompactionJob::Run():PausingManualCompaction:1");
|
|
|
|
|
|
|
|
// Compactions should work again if we re-enable them..
|
|
|
|
compact_options.canceled->store(false, std::memory_order_relaxed);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->CompactRange(compact_options, nullptr, nullptr));
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2021-06-07 18:40:31 +00:00
|
|
|
ASSERT_EQ("0,0,0,0,0,0,2", FilesPerLevel());
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, CancelManualCompaction2) {
|
|
|
|
CompactRangeOptions compact_options;
|
|
|
|
auto canceledPtr =
|
|
|
|
std::unique_ptr<std::atomic<bool>>(new std::atomic<bool>{true});
|
|
|
|
compact_options.canceled = canceledPtr.get();
|
|
|
|
compact_options.max_subcompactions = 1;
|
|
|
|
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
options.num_levels = 7;
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
auto generate_files = [&]() {
|
|
|
|
for (int i = 0; i < options.num_levels; i++) {
|
|
|
|
for (int j = 0; j < options.num_levels - i + 1; j++) {
|
|
|
|
for (int k = 0; k < 1000; k++) {
|
|
|
|
ASSERT_OK(Put(Key(k + j * 1000), rnd.RandomString(50)));
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2021-06-07 18:40:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
for (int l = 1; l < options.num_levels - i; l++) {
|
|
|
|
MoveFilesToLevel(l);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
generate_files();
|
|
|
|
ASSERT_EQ("2,3,4,5,6,7,8", FilesPerLevel());
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
int compactions_run = 0;
|
|
|
|
std::atomic<int> kv_compactions{0};
|
|
|
|
int compactions_stopped_at = 0;
|
|
|
|
int kv_compactions_stopped_at = 0;
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"DBImpl::RunManualCompaction()::1", [&](void* /*arg*/) {
|
|
|
|
++compactions_run;
|
|
|
|
// After 3 compactions disable
|
|
|
|
});
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"CompactionIterator:ProcessKV", [&](void* /*arg*/) {
|
|
|
|
int kv_compactions_run =
|
|
|
|
kv_compactions.fetch_add(1, std::memory_order_release);
|
|
|
|
if (kv_compactions_run == 5) {
|
|
|
|
compact_options.canceled->store(true, std::memory_order_release);
|
|
|
|
kv_compactions_stopped_at = kv_compactions_run;
|
|
|
|
compactions_stopped_at = compactions_run;
|
|
|
|
}
|
|
|
|
});
|
|
|
|
|
|
|
|
compact_options.canceled->store(false, std::memory_order_release);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_TRUE(dbfull()
|
|
|
|
->CompactRange(compact_options, nullptr, nullptr)
|
|
|
|
.IsManualCompactionPaused());
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2021-06-07 18:40:31 +00:00
|
|
|
|
|
|
|
// NOTE: as we set compact_options.max_subcompacitons = 1, and store true to
|
|
|
|
// the canceled variable from the single compacting thread (via callback),
|
|
|
|
// this value is deterministically kv_compactions_stopped_at + 1.
|
|
|
|
ASSERT_EQ(kv_compactions, kv_compactions_stopped_at + 1);
|
|
|
|
ASSERT_EQ(compactions_run, compactions_stopped_at);
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearCallBack(
|
|
|
|
"CompactionIterator::ProcessKV");
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearCallBack(
|
|
|
|
"DBImpl::RunManualCompaction()::1");
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearCallBack(
|
|
|
|
"CompactionJob::Run():PausingManualCompaction:1");
|
|
|
|
|
|
|
|
// Compactions should work again if we re-enable them..
|
|
|
|
compact_options.canceled->store(false, std::memory_order_relaxed);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->CompactRange(compact_options, nullptr, nullptr));
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2021-06-07 18:40:31 +00:00
|
|
|
ASSERT_EQ("0,0,0,0,0,0,2", FilesPerLevel());
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
}
|
|
|
|
|
2021-07-02 02:17:21 +00:00
|
|
|
class CancelCompactionListener : public EventListener {
|
|
|
|
public:
|
|
|
|
CancelCompactionListener()
|
|
|
|
: num_compaction_started_(0), num_compaction_ended_(0) {}
|
|
|
|
|
|
|
|
void OnCompactionBegin(DB* /*db*/, const CompactionJobInfo& ci) override {
|
|
|
|
ASSERT_EQ(ci.cf_name, "default");
|
|
|
|
ASSERT_EQ(ci.base_input_level, 0);
|
|
|
|
num_compaction_started_++;
|
|
|
|
}
|
|
|
|
|
|
|
|
void OnCompactionCompleted(DB* /*db*/, const CompactionJobInfo& ci) override {
|
|
|
|
ASSERT_EQ(ci.cf_name, "default");
|
|
|
|
ASSERT_EQ(ci.base_input_level, 0);
|
|
|
|
ASSERT_EQ(ci.status.code(), code_);
|
|
|
|
ASSERT_EQ(ci.status.subcode(), subcode_);
|
|
|
|
num_compaction_ended_++;
|
|
|
|
}
|
|
|
|
|
|
|
|
std::atomic<size_t> num_compaction_started_;
|
|
|
|
std::atomic<size_t> num_compaction_ended_;
|
|
|
|
Status::Code code_;
|
|
|
|
Status::SubCode subcode_;
|
|
|
|
};
|
|
|
|
|
|
|
|
TEST_F(DBTest2, CancelManualCompactionWithListener) {
|
|
|
|
CompactRangeOptions compact_options;
|
|
|
|
auto canceledPtr =
|
|
|
|
std::unique_ptr<std::atomic<bool>>(new std::atomic<bool>{true});
|
|
|
|
compact_options.canceled = canceledPtr.get();
|
|
|
|
compact_options.max_subcompactions = 1;
|
|
|
|
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
CancelCompactionListener* listener = new CancelCompactionListener();
|
|
|
|
options.listeners.emplace_back(listener);
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
for (int i = 0; i < 10; i++) {
|
|
|
|
for (int j = 0; j < 10; j++) {
|
|
|
|
ASSERT_OK(Put(Key(i + j * 10), rnd.RandomString(50)));
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2021-07-02 02:17:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"CompactionIterator:ProcessKV", [&](void* /*arg*/) {
|
|
|
|
compact_options.canceled->store(true, std::memory_order_release);
|
|
|
|
});
|
|
|
|
|
|
|
|
int running_compaction = 0;
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"CompactionJob::FinishCompactionOutputFile1",
|
|
|
|
[&](void* /*arg*/) { running_compaction++; });
|
|
|
|
|
2022-06-07 01:32:26 +00:00
|
|
|
// Case I: 1 Notify begin compaction, 2 Set *canceled as true to disable
|
|
|
|
// manual compaction in the callback function, 3 Compaction not run,
|
|
|
|
// 4 Notify compaction end.
|
2021-07-02 02:17:21 +00:00
|
|
|
listener->code_ = Status::kIncomplete;
|
|
|
|
listener->subcode_ = Status::SubCode::kManualCompactionPaused;
|
|
|
|
|
|
|
|
compact_options.canceled->store(false, std::memory_order_release);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_TRUE(dbfull()
|
|
|
|
->CompactRange(compact_options, nullptr, nullptr)
|
|
|
|
.IsManualCompactionPaused());
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2021-07-02 02:17:21 +00:00
|
|
|
|
|
|
|
ASSERT_GT(listener->num_compaction_started_, 0);
|
|
|
|
ASSERT_EQ(listener->num_compaction_started_, listener->num_compaction_ended_);
|
|
|
|
ASSERT_EQ(running_compaction, 0);
|
|
|
|
|
|
|
|
listener->num_compaction_started_ = 0;
|
|
|
|
listener->num_compaction_ended_ = 0;
|
|
|
|
|
2022-06-07 01:32:26 +00:00
|
|
|
// Case II: 1 Set *canceled as true in the callback function to disable manual
|
|
|
|
// compaction, 2 Notify begin compaction (return without notifying), 3 Notify
|
|
|
|
// compaction end (return without notifying).
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_TRUE(dbfull()
|
|
|
|
->CompactRange(compact_options, nullptr, nullptr)
|
|
|
|
.IsManualCompactionPaused());
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2021-07-02 02:17:21 +00:00
|
|
|
|
|
|
|
ASSERT_EQ(listener->num_compaction_started_, 0);
|
|
|
|
ASSERT_EQ(listener->num_compaction_started_, listener->num_compaction_ended_);
|
|
|
|
ASSERT_EQ(running_compaction, 0);
|
|
|
|
|
|
|
|
// Case III: 1 Notify begin compaction, 2 Compaction in between
|
2022-06-07 01:32:26 +00:00
|
|
|
// 3. Set *canceled as true in the callback function to disable manual
|
|
|
|
// compaction, 4 Notify compaction end.
|
2021-07-02 02:17:21 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearCallBack(
|
|
|
|
"CompactionIterator:ProcessKV");
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"CompactionJob::Run:BeforeVerify", [&](void* /*arg*/) {
|
|
|
|
compact_options.canceled->store(true, std::memory_order_release);
|
|
|
|
});
|
|
|
|
|
|
|
|
listener->code_ = Status::kOk;
|
|
|
|
listener->subcode_ = Status::SubCode::kNone;
|
|
|
|
|
|
|
|
compact_options.canceled->store(false, std::memory_order_release);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->CompactRange(compact_options, nullptr, nullptr));
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2021-07-02 02:17:21 +00:00
|
|
|
|
|
|
|
ASSERT_GT(listener->num_compaction_started_, 0);
|
|
|
|
ASSERT_EQ(listener->num_compaction_started_, listener->num_compaction_ended_);
|
|
|
|
|
|
|
|
// Compaction job will succeed.
|
|
|
|
ASSERT_GT(running_compaction, 0);
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, CompactionOnBottomPriorityWithListener) {
|
|
|
|
int num_levels = 3;
|
|
|
|
const int kNumFilesTrigger = 4;
|
|
|
|
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
env_->SetBackgroundThreads(0, Env::Priority::HIGH);
|
|
|
|
env_->SetBackgroundThreads(0, Env::Priority::LOW);
|
|
|
|
env_->SetBackgroundThreads(1, Env::Priority::BOTTOM);
|
|
|
|
options.env = env_;
|
|
|
|
options.compaction_style = kCompactionStyleUniversal;
|
|
|
|
options.num_levels = num_levels;
|
|
|
|
options.write_buffer_size = 100 << 10; // 100KB
|
|
|
|
options.target_file_size_base = 32 << 10; // 32KB
|
|
|
|
options.level0_file_num_compaction_trigger = kNumFilesTrigger;
|
|
|
|
// Trigger compaction if size amplification exceeds 110%
|
|
|
|
options.compaction_options_universal.max_size_amplification_percent = 110;
|
|
|
|
|
|
|
|
CancelCompactionListener* listener = new CancelCompactionListener();
|
|
|
|
options.listeners.emplace_back(listener);
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
int num_bottom_thread_compaction_scheduled = 0;
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"DBImpl::BackgroundCompaction:ForwardToBottomPriPool",
|
|
|
|
[&](void* /*arg*/) { num_bottom_thread_compaction_scheduled++; });
|
|
|
|
|
|
|
|
int num_compaction_jobs = 0;
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"CompactionJob::Run():End",
|
|
|
|
[&](void* /*arg*/) { num_compaction_jobs++; });
|
|
|
|
|
|
|
|
listener->code_ = Status::kOk;
|
|
|
|
listener->subcode_ = Status::SubCode::kNone;
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
for (int i = 0; i < 1; ++i) {
|
|
|
|
for (int num = 0; num < kNumFilesTrigger; num++) {
|
|
|
|
int key_idx = 0;
|
|
|
|
GenerateNewFile(&rnd, &key_idx, true /* no_wait */);
|
|
|
|
// use no_wait above because that one waits for flush and compaction. We
|
|
|
|
// don't want to wait for compaction because the full compaction is
|
|
|
|
// intentionally blocked while more files are flushed.
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
|
|
|
ASSERT_GT(num_bottom_thread_compaction_scheduled, 0);
|
|
|
|
ASSERT_EQ(num_compaction_jobs, 1);
|
|
|
|
ASSERT_GT(listener->num_compaction_started_, 0);
|
|
|
|
ASSERT_EQ(listener->num_compaction_started_, listener->num_compaction_ended_);
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
}
|
|
|
|
|
2017-01-20 18:43:59 +00:00
|
|
|
TEST_F(DBTest2, OptimizeForPointLookup) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
Close();
|
|
|
|
options.OptimizeForPointLookup(2);
|
|
|
|
ASSERT_OK(DB::Open(options, dbname_, &db_));
|
|
|
|
|
|
|
|
ASSERT_OK(Put("foo", "v1"));
|
|
|
|
ASSERT_EQ("v1", Get("foo"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2017-01-20 18:43:59 +00:00
|
|
|
ASSERT_EQ("v1", Get("foo"));
|
|
|
|
}
|
2017-01-27 00:25:19 +00:00
|
|
|
|
2019-04-11 17:22:07 +00:00
|
|
|
TEST_F(DBTest2, OptimizeForSmallDB) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
Close();
|
|
|
|
options.OptimizeForSmallDb();
|
|
|
|
|
|
|
|
// Find the cache object
|
2020-09-14 23:59:00 +00:00
|
|
|
ASSERT_TRUE(options.table_factory->IsInstanceOf(
|
|
|
|
TableFactory::kBlockBasedTableName()));
|
|
|
|
auto table_options =
|
|
|
|
options.table_factory->GetOptions<BlockBasedTableOptions>();
|
|
|
|
|
2019-04-11 17:22:07 +00:00
|
|
|
ASSERT_TRUE(table_options != nullptr);
|
|
|
|
std::shared_ptr<Cache> cache = table_options->block_cache;
|
|
|
|
|
|
|
|
ASSERT_EQ(0, cache->GetUsage());
|
|
|
|
ASSERT_OK(DB::Open(options, dbname_, &db_));
|
|
|
|
ASSERT_OK(Put("foo", "v1"));
|
|
|
|
|
|
|
|
// memtable size is costed to the block cache
|
|
|
|
ASSERT_NE(0, cache->GetUsage());
|
|
|
|
|
|
|
|
ASSERT_EQ("v1", Get("foo"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2019-04-11 17:22:07 +00:00
|
|
|
|
|
|
|
size_t prev_size = cache->GetUsage();
|
|
|
|
// Remember block cache size, so that we can find that
|
|
|
|
// it is filled after Get().
|
|
|
|
// Use pinnable slice so that it can ping the block so that
|
|
|
|
// when we check the size it is not evicted.
|
|
|
|
PinnableSlice value;
|
|
|
|
ASSERT_OK(db_->Get(ReadOptions(), db_->DefaultColumnFamily(), "foo", &value));
|
|
|
|
ASSERT_GT(cache->GetUsage(), prev_size);
|
|
|
|
value.Reset();
|
|
|
|
}
|
|
|
|
|
2017-02-03 19:35:22 +00:00
|
|
|
|
2020-06-18 17:15:16 +00:00
|
|
|
TEST_F(DBTest2, IterRaceFlush1) {
|
|
|
|
ASSERT_OK(Put("foo", "v1"));
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
|
|
|
|
{{"DBImpl::NewIterator:1", "DBTest2::IterRaceFlush:1"},
|
|
|
|
{"DBTest2::IterRaceFlush:2", "DBImpl::NewIterator:2"}});
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::port::Thread t1([&] {
|
|
|
|
TEST_SYNC_POINT("DBTest2::IterRaceFlush:1");
|
|
|
|
ASSERT_OK(Put("foo", "v2"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-06-18 17:15:16 +00:00
|
|
|
TEST_SYNC_POINT("DBTest2::IterRaceFlush:2");
|
|
|
|
});
|
|
|
|
|
2021-10-02 00:21:39 +00:00
|
|
|
// iterator is created after the first Put(), and its snapshot sequence is
|
|
|
|
// assigned after second Put(), so it must see v2.
|
2020-06-18 17:15:16 +00:00
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> it(db_->NewIterator(ReadOptions()));
|
|
|
|
it->Seek("foo");
|
|
|
|
ASSERT_TRUE(it->Valid());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(it->status());
|
2020-06-18 17:15:16 +00:00
|
|
|
ASSERT_EQ("foo", it->key().ToString());
|
2021-10-02 00:21:39 +00:00
|
|
|
ASSERT_EQ("v2", it->value().ToString());
|
2020-06-18 17:15:16 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
t1.join();
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, IterRaceFlush2) {
|
|
|
|
ASSERT_OK(Put("foo", "v1"));
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
|
|
|
|
{{"DBImpl::NewIterator:3", "DBTest2::IterRaceFlush2:1"},
|
|
|
|
{"DBTest2::IterRaceFlush2:2", "DBImpl::NewIterator:4"}});
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::port::Thread t1([&] {
|
|
|
|
TEST_SYNC_POINT("DBTest2::IterRaceFlush2:1");
|
|
|
|
ASSERT_OK(Put("foo", "v2"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-06-18 17:15:16 +00:00
|
|
|
TEST_SYNC_POINT("DBTest2::IterRaceFlush2:2");
|
|
|
|
});
|
|
|
|
|
2021-10-02 00:21:39 +00:00
|
|
|
// iterator is created after the first Put(), and its snapshot sequence is
|
|
|
|
// assigned before second Put(), thus it must see v1.
|
2020-06-18 17:15:16 +00:00
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> it(db_->NewIterator(ReadOptions()));
|
|
|
|
it->Seek("foo");
|
|
|
|
ASSERT_TRUE(it->Valid());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(it->status());
|
2020-06-18 17:15:16 +00:00
|
|
|
ASSERT_EQ("foo", it->key().ToString());
|
2021-10-02 00:21:39 +00:00
|
|
|
ASSERT_EQ("v1", it->value().ToString());
|
2020-06-18 17:15:16 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
t1.join();
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, IterRefreshRaceFlush) {
|
|
|
|
ASSERT_OK(Put("foo", "v1"));
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
|
|
|
|
{{"ArenaWrappedDBIter::Refresh:1", "DBTest2::IterRefreshRaceFlush:1"},
|
|
|
|
{"DBTest2::IterRefreshRaceFlush:2", "ArenaWrappedDBIter::Refresh:2"}});
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
ROCKSDB_NAMESPACE::port::Thread t1([&] {
|
|
|
|
TEST_SYNC_POINT("DBTest2::IterRefreshRaceFlush:1");
|
|
|
|
ASSERT_OK(Put("foo", "v2"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-06-18 17:15:16 +00:00
|
|
|
TEST_SYNC_POINT("DBTest2::IterRefreshRaceFlush:2");
|
|
|
|
});
|
|
|
|
|
2021-10-02 00:21:39 +00:00
|
|
|
// iterator is refreshed after the first Put(), and its sequence number is
|
|
|
|
// assigned after second Put(), thus it must see v2.
|
2020-06-18 17:15:16 +00:00
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> it(db_->NewIterator(ReadOptions()));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(it->status());
|
|
|
|
ASSERT_OK(it->Refresh());
|
2020-06-18 17:15:16 +00:00
|
|
|
it->Seek("foo");
|
|
|
|
ASSERT_TRUE(it->Valid());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(it->status());
|
2020-06-18 17:15:16 +00:00
|
|
|
ASSERT_EQ("foo", it->key().ToString());
|
2021-10-02 00:21:39 +00:00
|
|
|
ASSERT_EQ("v2", it->value().ToString());
|
2020-06-18 17:15:16 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
t1.join();
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
}
|
|
|
|
|
2017-02-03 19:35:22 +00:00
|
|
|
TEST_F(DBTest2, GetRaceFlush1) {
|
2017-01-27 00:25:19 +00:00
|
|
|
ASSERT_OK(Put("foo", "v1"));
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
|
2017-01-27 00:25:19 +00:00
|
|
|
{{"DBImpl::GetImpl:1", "DBTest2::GetRaceFlush:1"},
|
|
|
|
{"DBTest2::GetRaceFlush:2", "DBImpl::GetImpl:2"}});
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2017-01-27 00:25:19 +00:00
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::port::Thread t1([&] {
|
2017-01-27 00:25:19 +00:00
|
|
|
TEST_SYNC_POINT("DBTest2::GetRaceFlush:1");
|
|
|
|
ASSERT_OK(Put("foo", "v2"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2017-01-27 00:25:19 +00:00
|
|
|
TEST_SYNC_POINT("DBTest2::GetRaceFlush:2");
|
|
|
|
});
|
|
|
|
|
|
|
|
// Get() is issued after the first Put(), so it should see either
|
|
|
|
// "v1" or "v2".
|
|
|
|
ASSERT_NE("NOT_FOUND", Get("foo"));
|
2017-02-03 19:35:22 +00:00
|
|
|
t1.join();
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2017-01-27 00:25:19 +00:00
|
|
|
}
|
2016-10-13 17:49:06 +00:00
|
|
|
|
2017-02-03 19:35:22 +00:00
|
|
|
TEST_F(DBTest2, GetRaceFlush2) {
|
|
|
|
ASSERT_OK(Put("foo", "v1"));
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
|
2017-02-03 19:35:22 +00:00
|
|
|
{{"DBImpl::GetImpl:3", "DBTest2::GetRaceFlush:1"},
|
|
|
|
{"DBTest2::GetRaceFlush:2", "DBImpl::GetImpl:4"}});
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2017-02-03 19:35:22 +00:00
|
|
|
|
2017-02-06 22:43:55 +00:00
|
|
|
port::Thread t1([&] {
|
2017-02-03 19:35:22 +00:00
|
|
|
TEST_SYNC_POINT("DBTest2::GetRaceFlush:1");
|
|
|
|
ASSERT_OK(Put("foo", "v2"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2017-02-03 19:35:22 +00:00
|
|
|
TEST_SYNC_POINT("DBTest2::GetRaceFlush:2");
|
|
|
|
});
|
|
|
|
|
|
|
|
// Get() is issued after the first Put(), so it should see either
|
|
|
|
// "v1" or "v2".
|
|
|
|
ASSERT_NE("NOT_FOUND", Get("foo"));
|
|
|
|
t1.join();
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2017-02-03 19:35:22 +00:00
|
|
|
}
|
2017-02-22 18:00:25 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, DirectIO) {
|
|
|
|
if (!IsDirectIOSupported()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
Options options = CurrentOptions();
|
2017-04-13 20:07:33 +00:00
|
|
|
options.use_direct_reads = options.use_direct_io_for_flush_and_compaction =
|
|
|
|
true;
|
2017-02-22 18:00:25 +00:00
|
|
|
options.allow_mmap_reads = options.allow_mmap_writes = false;
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put(Key(0), "a"));
|
|
|
|
ASSERT_OK(Put(Key(5), "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
ASSERT_OK(Put(Key(10), "a"));
|
|
|
|
ASSERT_OK(Put(Key(15), "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));
|
|
|
|
Reopen(options);
|
|
|
|
}
|
2017-03-07 19:50:02 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, MemtableOnlyIterator) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put(1, "foo", "first"));
|
|
|
|
ASSERT_OK(Put(1, "bar", "second"));
|
|
|
|
|
|
|
|
ReadOptions ropt;
|
|
|
|
ropt.read_tier = kMemtableTier;
|
|
|
|
std::string value;
|
|
|
|
Iterator* it = nullptr;
|
|
|
|
|
|
|
|
// Before flushing
|
|
|
|
// point lookups
|
|
|
|
ASSERT_OK(db_->Get(ropt, handles_[1], "foo", &value));
|
|
|
|
ASSERT_EQ("first", value);
|
|
|
|
ASSERT_OK(db_->Get(ropt, handles_[1], "bar", &value));
|
|
|
|
ASSERT_EQ("second", value);
|
|
|
|
|
|
|
|
// Memtable-only iterator (read_tier=kMemtableTier); data not flushed yet.
|
|
|
|
it = db_->NewIterator(ropt, handles_[1]);
|
|
|
|
int count = 0;
|
|
|
|
for (it->SeekToFirst(); it->Valid(); it->Next()) {
|
|
|
|
ASSERT_TRUE(it->Valid());
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
ASSERT_TRUE(!it->Valid());
|
2023-10-18 16:38:38 +00:00
|
|
|
ASSERT_OK(it->status());
|
2017-03-07 19:50:02 +00:00
|
|
|
ASSERT_EQ(2, count);
|
|
|
|
delete it;
|
|
|
|
|
2023-08-09 22:46:44 +00:00
|
|
|
ASSERT_OK(Flush(1));
|
2017-03-07 19:50:02 +00:00
|
|
|
|
|
|
|
// After flushing
|
|
|
|
// point lookups
|
|
|
|
ASSERT_OK(db_->Get(ropt, handles_[1], "foo", &value));
|
|
|
|
ASSERT_EQ("first", value);
|
|
|
|
ASSERT_OK(db_->Get(ropt, handles_[1], "bar", &value));
|
|
|
|
ASSERT_EQ("second", value);
|
|
|
|
// nothing should be returned using memtable-only iterator after flushing.
|
|
|
|
it = db_->NewIterator(ropt, handles_[1]);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(it->status());
|
2017-03-07 19:50:02 +00:00
|
|
|
count = 0;
|
|
|
|
for (it->SeekToFirst(); it->Valid(); it->Next()) {
|
|
|
|
ASSERT_TRUE(it->Valid());
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
ASSERT_TRUE(!it->Valid());
|
|
|
|
ASSERT_EQ(0, count);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(it->status());
|
2017-03-07 19:50:02 +00:00
|
|
|
delete it;
|
|
|
|
|
|
|
|
// Add a key to memtable
|
|
|
|
ASSERT_OK(Put(1, "foobar", "third"));
|
|
|
|
it = db_->NewIterator(ropt, handles_[1]);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(it->status());
|
2017-03-07 19:50:02 +00:00
|
|
|
count = 0;
|
|
|
|
for (it->SeekToFirst(); it->Valid(); it->Next()) {
|
|
|
|
ASSERT_TRUE(it->Valid());
|
|
|
|
ASSERT_EQ("foobar", it->key().ToString());
|
|
|
|
ASSERT_EQ("third", it->value().ToString());
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
ASSERT_TRUE(!it->Valid());
|
|
|
|
ASSERT_EQ(1, count);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(it->status());
|
2017-03-07 19:50:02 +00:00
|
|
|
delete it;
|
|
|
|
}
|
2017-06-05 21:42:34 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, LowPriWrite) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
// Compaction pressure should trigger since 6 files
|
|
|
|
options.level0_file_num_compaction_trigger = 4;
|
|
|
|
options.level0_slowdown_writes_trigger = 12;
|
|
|
|
options.level0_stop_writes_trigger = 30;
|
|
|
|
options.delayed_write_rate = 8 * 1024 * 1024;
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
std::atomic<int> rate_limit_count(0);
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2017-06-05 21:42:34 +00:00
|
|
|
"GenericRateLimiter::Request:1", [&](void* arg) {
|
|
|
|
rate_limit_count.fetch_add(1);
|
|
|
|
int64_t* rate_bytes_per_sec = static_cast<int64_t*>(arg);
|
|
|
|
ASSERT_EQ(1024 * 1024, *rate_bytes_per_sec);
|
|
|
|
});
|
|
|
|
// Block compaction
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency({
|
2017-06-05 21:42:34 +00:00
|
|
|
{"DBTest.LowPriWrite:0", "DBImpl::BGWorkCompaction"},
|
|
|
|
});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2017-06-05 21:42:34 +00:00
|
|
|
WriteOptions wo;
|
|
|
|
for (int i = 0; i < 6; i++) {
|
|
|
|
wo.low_pri = false;
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("", "", wo));
|
2017-06-05 21:42:34 +00:00
|
|
|
wo.low_pri = true;
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("", "", wo));
|
|
|
|
ASSERT_OK(Flush());
|
2017-06-05 21:42:34 +00:00
|
|
|
}
|
|
|
|
ASSERT_EQ(0, rate_limit_count.load());
|
|
|
|
wo.low_pri = true;
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("", "", wo));
|
2017-06-05 21:42:34 +00:00
|
|
|
ASSERT_EQ(1, rate_limit_count.load());
|
|
|
|
wo.low_pri = false;
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("", "", wo));
|
2017-06-05 21:42:34 +00:00
|
|
|
ASSERT_EQ(1, rate_limit_count.load());
|
|
|
|
|
2023-10-24 21:41:46 +00:00
|
|
|
wo.low_pri = true;
|
|
|
|
std::string big_value = std::string(1 * 1024 * 1024, 'x');
|
|
|
|
ASSERT_OK(Put("", big_value, wo));
|
|
|
|
ASSERT_LT(1, rate_limit_count.load());
|
|
|
|
// Reset
|
|
|
|
rate_limit_count = 0;
|
|
|
|
wo.low_pri = false;
|
|
|
|
ASSERT_OK(Put("", big_value, wo));
|
|
|
|
ASSERT_EQ(0, rate_limit_count.load());
|
|
|
|
|
2017-06-05 21:42:34 +00:00
|
|
|
TEST_SYNC_POINT("DBTest.LowPriWrite:0");
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2017-06-05 21:42:34 +00:00
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2017-06-05 21:42:34 +00:00
|
|
|
wo.low_pri = true;
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("", "", wo));
|
2023-10-24 21:41:46 +00:00
|
|
|
ASSERT_EQ(0, rate_limit_count.load());
|
2017-06-05 21:42:34 +00:00
|
|
|
wo.low_pri = false;
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("", "", wo));
|
2023-10-24 21:41:46 +00:00
|
|
|
ASSERT_EQ(0, rate_limit_count.load());
|
2017-06-05 21:42:34 +00:00
|
|
|
}
|
2017-06-13 21:51:22 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, RateLimitedCompactionReads) {
|
|
|
|
// compaction input has 512KB data
|
|
|
|
const int kNumKeysPerFile = 128;
|
|
|
|
const int kBytesPerKey = 1024;
|
|
|
|
const int kNumL0Files = 4;
|
|
|
|
|
2022-02-17 07:17:03 +00:00
|
|
|
for (int compaction_readahead_size : {0, 32 << 10}) {
|
|
|
|
for (auto use_direct_io : {false, true}) {
|
|
|
|
if (use_direct_io && !IsDirectIOSupported()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.compaction_readahead_size = compaction_readahead_size;
|
|
|
|
options.compression = kNoCompression;
|
|
|
|
options.level0_file_num_compaction_trigger = kNumL0Files;
|
|
|
|
options.memtable_factory.reset(
|
|
|
|
test::NewSpecialSkipListFactory(kNumKeysPerFile));
|
|
|
|
// takes roughly one second, split into 100 x 10ms intervals. Each
|
|
|
|
// interval permits 5.12KB, which is smaller than the block size, so this
|
|
|
|
// test exercises the code for chunking reads.
|
|
|
|
options.rate_limiter.reset(NewGenericRateLimiter(
|
|
|
|
static_cast<int64_t>(kNumL0Files * kNumKeysPerFile *
|
|
|
|
kBytesPerKey) /* rate_bytes_per_sec */,
|
|
|
|
10 * 1000 /* refill_period_us */, 10 /* fairness */,
|
|
|
|
RateLimiter::Mode::kReadsOnly));
|
|
|
|
options.use_direct_reads =
|
|
|
|
options.use_direct_io_for_flush_and_compaction = use_direct_io;
|
|
|
|
BlockBasedTableOptions bbto;
|
|
|
|
bbto.block_size = 16384;
|
|
|
|
bbto.no_block_cache = true;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
|
|
|
|
DestroyAndReopen(options);
|
2017-06-13 21:51:22 +00:00
|
|
|
|
2022-02-17 07:17:03 +00:00
|
|
|
for (int i = 0; i < kNumL0Files; ++i) {
|
|
|
|
for (int j = 0; j <= kNumKeysPerFile; ++j) {
|
|
|
|
ASSERT_OK(Put(Key(j), DummyString(kBytesPerKey)));
|
|
|
|
}
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable());
|
|
|
|
if (i + 1 < kNumL0Files) {
|
|
|
|
ASSERT_EQ(i + 1, NumTableFilesAtLevel(0));
|
|
|
|
}
|
2017-06-13 21:51:22 +00:00
|
|
|
}
|
2022-02-17 07:17:03 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
|
|
|
ASSERT_EQ(0, NumTableFilesAtLevel(0));
|
|
|
|
|
|
|
|
// should be slightly above 512KB due to non-data blocks read. Arbitrarily
|
|
|
|
// chose 1MB as the upper bound on the total bytes read.
|
Set Read rate limiter priority dynamically and pass it to FS (#9996)
Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.
### Solution
User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile.
**This PR is for the Read path.** The **Read:** dynamic priority for different state are listed as follows:
| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
| Flush (verification read in BuildTable()) | IO_USER | IO_USER | IO_USER |
| Compaction | IO_LOW | IO_USER | IO_USER |
| User | User provided | User provided | User provided |
We will respect the read_options that the user provided and will not set it.
The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read".
**Details**
1. Set read_options.rate_limiter_priority dynamically:
- User: Do not update the read_options. Use the read_options that the user provided.
- Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction().
- Flush: Update read_options in BuildTable().
2. Pass the rate limiter priority to FSRandomAccessFile functions:
- After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions, including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock().
- In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch().
- Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator().
### Test Plans
Add unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996
Reviewed By: anand1976
Differential Revision: D36452483
Pulled By: gitbw95
fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf
2022-05-19 02:41:44 +00:00
|
|
|
size_t rate_limited_bytes = static_cast<size_t>(
|
|
|
|
options.rate_limiter->GetTotalBytesThrough(Env::IO_TOTAL));
|
|
|
|
// The charges can exist for `IO_LOW` and `IO_USER` priorities.
|
|
|
|
size_t rate_limited_bytes_by_pri =
|
|
|
|
options.rate_limiter->GetTotalBytesThrough(Env::IO_LOW) +
|
|
|
|
options.rate_limiter->GetTotalBytesThrough(Env::IO_USER);
|
2022-02-17 07:17:03 +00:00
|
|
|
ASSERT_EQ(rate_limited_bytes,
|
Set Read rate limiter priority dynamically and pass it to FS (#9996)
Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.
### Solution
User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile.
**This PR is for the Read path.** The **Read:** dynamic priority for different state are listed as follows:
| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
| Flush (verification read in BuildTable()) | IO_USER | IO_USER | IO_USER |
| Compaction | IO_LOW | IO_USER | IO_USER |
| User | User provided | User provided | User provided |
We will respect the read_options that the user provided and will not set it.
The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read".
**Details**
1. Set read_options.rate_limiter_priority dynamically:
- User: Do not update the read_options. Use the read_options that the user provided.
- Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction().
- Flush: Update read_options in BuildTable().
2. Pass the rate limiter priority to FSRandomAccessFile functions:
- After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions, including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock().
- In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch().
- Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator().
### Test Plans
Add unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996
Reviewed By: anand1976
Differential Revision: D36452483
Pulled By: gitbw95
fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf
2022-05-19 02:41:44 +00:00
|
|
|
static_cast<size_t>(rate_limited_bytes_by_pri));
|
2022-02-17 07:17:03 +00:00
|
|
|
// Include the explicit prefetch of the footer in direct I/O case.
|
|
|
|
size_t direct_io_extra = use_direct_io ? 512 * 1024 : 0;
|
|
|
|
ASSERT_GE(
|
|
|
|
rate_limited_bytes,
|
|
|
|
static_cast<size_t>(kNumKeysPerFile * kBytesPerKey * kNumL0Files));
|
|
|
|
ASSERT_LT(
|
|
|
|
rate_limited_bytes,
|
|
|
|
static_cast<size_t>(2 * kNumKeysPerFile * kBytesPerKey * kNumL0Files +
|
|
|
|
direct_io_extra));
|
|
|
|
|
|
|
|
Iterator* iter = db_->NewIterator(ReadOptions());
|
|
|
|
ASSERT_OK(iter->status());
|
|
|
|
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {
|
|
|
|
ASSERT_EQ(iter->value().ToString(), DummyString(kBytesPerKey));
|
2021-11-19 18:08:06 +00:00
|
|
|
}
|
2022-02-17 07:17:03 +00:00
|
|
|
delete iter;
|
|
|
|
// bytes read for user iterator shouldn't count against the rate limit.
|
Set Read rate limiter priority dynamically and pass it to FS (#9996)
Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.
### Solution
User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile.
**This PR is for the Read path.** The **Read:** dynamic priority for different state are listed as follows:
| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
| Flush (verification read in BuildTable()) | IO_USER | IO_USER | IO_USER |
| Compaction | IO_LOW | IO_USER | IO_USER |
| User | User provided | User provided | User provided |
We will respect the read_options that the user provided and will not set it.
The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read".
**Details**
1. Set read_options.rate_limiter_priority dynamically:
- User: Do not update the read_options. Use the read_options that the user provided.
- Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction().
- Flush: Update read_options in BuildTable().
2. Pass the rate limiter priority to FSRandomAccessFile functions:
- After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions, including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock().
- In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch().
- Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator().
### Test Plans
Add unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996
Reviewed By: anand1976
Differential Revision: D36452483
Pulled By: gitbw95
fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf
2022-05-19 02:41:44 +00:00
|
|
|
rate_limited_bytes_by_pri =
|
|
|
|
options.rate_limiter->GetTotalBytesThrough(Env::IO_LOW) +
|
|
|
|
options.rate_limiter->GetTotalBytesThrough(Env::IO_USER);
|
2022-02-17 07:17:03 +00:00
|
|
|
ASSERT_EQ(rate_limited_bytes,
|
Set Read rate limiter priority dynamically and pass it to FS (#9996)
Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.
### Solution
User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile.
**This PR is for the Read path.** The **Read:** dynamic priority for different state are listed as follows:
| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
| Flush (verification read in BuildTable()) | IO_USER | IO_USER | IO_USER |
| Compaction | IO_LOW | IO_USER | IO_USER |
| User | User provided | User provided | User provided |
We will respect the read_options that the user provided and will not set it.
The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read".
**Details**
1. Set read_options.rate_limiter_priority dynamically:
- User: Do not update the read_options. Use the read_options that the user provided.
- Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction().
- Flush: Update read_options in BuildTable().
2. Pass the rate limiter priority to FSRandomAccessFile functions:
- After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions, including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock().
- In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch().
- Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator().
### Test Plans
Add unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996
Reviewed By: anand1976
Differential Revision: D36452483
Pulled By: gitbw95
fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf
2022-05-19 02:41:44 +00:00
|
|
|
static_cast<size_t>(rate_limited_bytes_by_pri));
|
2017-06-13 21:51:22 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2017-08-24 23:05:16 +00:00
|
|
|
|
|
|
|
// Make sure DB can be reopen with reduced number of levels, given no file
|
|
|
|
// is on levels higher than the new num_levels.
|
|
|
|
TEST_F(DBTest2, ReduceLevel) {
|
|
|
|
Options options;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
options.env = env_;
|
2017-08-24 23:05:16 +00:00
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
options.num_levels = 7;
|
|
|
|
Reopen(options);
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
2017-08-24 23:05:16 +00:00
|
|
|
MoveFilesToLevel(6);
|
|
|
|
ASSERT_EQ("0,0,0,0,0,0,1", FilesPerLevel());
|
|
|
|
CompactRangeOptions compact_options;
|
|
|
|
compact_options.change_level = true;
|
|
|
|
compact_options.target_level = 1;
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->CompactRange(compact_options, nullptr, nullptr));
|
2017-08-24 23:05:16 +00:00
|
|
|
ASSERT_EQ("0,1", FilesPerLevel());
|
|
|
|
options.num_levels = 3;
|
|
|
|
Reopen(options);
|
|
|
|
ASSERT_EQ("0,1", FilesPerLevel());
|
|
|
|
}
|
2017-09-11 15:58:52 +00:00
|
|
|
|
|
|
|
// Test that ReadCallback is actually used in both memtbale and sst tables
|
|
|
|
TEST_F(DBTest2, ReadCallbackTest) {
|
|
|
|
Options options;
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
options.num_levels = 7;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
options.env = env_;
|
2017-09-11 15:58:52 +00:00
|
|
|
Reopen(options);
|
|
|
|
std::vector<const Snapshot*> snapshots;
|
|
|
|
// Try to create a db with multiple layers and a memtable
|
|
|
|
const std::string key = "foo";
|
|
|
|
const std::string value = "bar";
|
|
|
|
// This test assumes that the seq start with 1 and increased by 1 after each
|
|
|
|
// write batch of size 1. If that behavior changes, the test needs to be
|
|
|
|
// updated as well.
|
|
|
|
// TODO(myabandeh): update this test to use the seq number that is returned by
|
|
|
|
// the DB instead of assuming what seq the DB used.
|
|
|
|
int i = 1;
|
|
|
|
for (; i < 10; i++) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put(key, value + std::to_string(i)));
|
2017-09-11 15:58:52 +00:00
|
|
|
// Take a snapshot to avoid the value being removed during compaction
|
|
|
|
auto snapshot = dbfull()->GetSnapshot();
|
|
|
|
snapshots.push_back(snapshot);
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2017-09-11 15:58:52 +00:00
|
|
|
for (; i < 20; i++) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put(key, value + std::to_string(i)));
|
2017-09-11 15:58:52 +00:00
|
|
|
// Take a snapshot to avoid the value being removed during compaction
|
|
|
|
auto snapshot = dbfull()->GetSnapshot();
|
|
|
|
snapshots.push_back(snapshot);
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2017-09-11 15:58:52 +00:00
|
|
|
MoveFilesToLevel(6);
|
|
|
|
ASSERT_EQ("0,0,0,0,0,0,2", FilesPerLevel());
|
|
|
|
for (; i < 30; i++) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put(key, value + std::to_string(i)));
|
2017-09-11 15:58:52 +00:00
|
|
|
auto snapshot = dbfull()->GetSnapshot();
|
|
|
|
snapshots.push_back(snapshot);
|
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2017-09-11 15:58:52 +00:00
|
|
|
ASSERT_EQ("1,0,0,0,0,0,2", FilesPerLevel());
|
|
|
|
// And also add some values to the memtable
|
|
|
|
for (; i < 40; i++) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put(key, value + std::to_string(i)));
|
2017-09-11 15:58:52 +00:00
|
|
|
auto snapshot = dbfull()->GetSnapshot();
|
|
|
|
snapshots.push_back(snapshot);
|
|
|
|
}
|
|
|
|
|
|
|
|
class TestReadCallback : public ReadCallback {
|
|
|
|
public:
|
2019-04-02 21:43:03 +00:00
|
|
|
explicit TestReadCallback(SequenceNumber snapshot)
|
|
|
|
: ReadCallback(snapshot), snapshot_(snapshot) {}
|
2019-02-27 00:52:20 +00:00
|
|
|
bool IsVisibleFullCheck(SequenceNumber seq) override {
|
|
|
|
return seq <= snapshot_;
|
|
|
|
}
|
2017-09-11 15:58:52 +00:00
|
|
|
|
|
|
|
private:
|
|
|
|
SequenceNumber snapshot_;
|
|
|
|
};
|
|
|
|
|
|
|
|
for (int seq = 1; seq < i; seq++) {
|
|
|
|
PinnableSlice pinnable_val;
|
|
|
|
ReadOptions roptions;
|
|
|
|
TestReadCallback callback(seq);
|
|
|
|
bool dont_care = true;
|
New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
const ReadOptions& options, ColumnFamilyHandle* column_family,
const Slice& key, PinnableSlice* merge_operands,
GetMergeOperandsOptions* get_merge_operands_options,
int* number_of_operands)
Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
Differential Revision: D16657366
Pulled By: vjnadimpalli
fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 21:22:34 +00:00
|
|
|
DBImpl::GetImplOptions get_impl_options;
|
|
|
|
get_impl_options.column_family = dbfull()->DefaultColumnFamily();
|
|
|
|
get_impl_options.value = &pinnable_val;
|
|
|
|
get_impl_options.value_found = &dont_care;
|
|
|
|
get_impl_options.callback = &callback;
|
|
|
|
Status s = dbfull()->GetImpl(roptions, key, get_impl_options);
|
2017-09-11 15:58:52 +00:00
|
|
|
ASSERT_TRUE(s.ok());
|
|
|
|
// Assuming that after each Put the DB increased seq by one, the value and
|
|
|
|
// seq number must be equal since we also inc value by 1 after each Put.
|
|
|
|
ASSERT_EQ(value + std::to_string(seq), pinnable_val.ToString());
|
|
|
|
}
|
|
|
|
|
|
|
|
for (auto snapshot : snapshots) {
|
|
|
|
dbfull()->ReleaseSnapshot(snapshot);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-01-18 01:37:10 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, LiveFilesOmitObsoleteFiles) {
|
|
|
|
// Regression test for race condition where an obsolete file is returned to
|
|
|
|
// user as a "live file" but then deleted, all while file deletions are
|
|
|
|
// disabled.
|
|
|
|
//
|
|
|
|
// It happened like this:
|
|
|
|
//
|
|
|
|
// 1. [flush thread] Log file "x.log" found by FindObsoleteFiles
|
|
|
|
// 2. [user thread] DisableFileDeletions, GetSortedWalFiles are called and the
|
|
|
|
// latter returned "x.log"
|
|
|
|
// 3. [flush thread] PurgeObsoleteFiles deleted "x.log"
|
|
|
|
// 4. [user thread] Reading "x.log" failed
|
|
|
|
//
|
|
|
|
// Unfortunately the only regression test I can come up with involves sleep.
|
|
|
|
// We cannot set SyncPoints to repro since, once the fix is applied, the
|
|
|
|
// SyncPoints would cause a deadlock as the repro's sequence of events is now
|
|
|
|
// prohibited.
|
|
|
|
//
|
|
|
|
// Instead, if we sleep for a second between Find and Purge, and ensure the
|
|
|
|
// read attempt happens after purge, then the sequence of events will almost
|
|
|
|
// certainly happen on the old code.
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency({
|
2018-01-18 01:37:10 +00:00
|
|
|
{"DBImpl::BackgroundCallFlush:FilesFound",
|
|
|
|
"DBTest2::LiveFilesOmitObsoleteFiles:FlushTriggered"},
|
|
|
|
{"DBImpl::PurgeObsoleteFiles:End",
|
|
|
|
"DBTest2::LiveFilesOmitObsoleteFiles:LiveFilesCaptured"},
|
|
|
|
});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2018-01-18 01:37:10 +00:00
|
|
|
"DBImpl::PurgeObsoleteFiles:Begin",
|
2018-04-13 00:55:14 +00:00
|
|
|
[&](void* /*arg*/) { env_->SleepForMicroseconds(1000000); });
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2018-01-18 01:37:10 +00:00
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("key", "val"));
|
2018-01-18 01:37:10 +00:00
|
|
|
FlushOptions flush_opts;
|
|
|
|
flush_opts.wait = false;
|
2023-08-09 22:46:44 +00:00
|
|
|
ASSERT_OK(db_->Flush(flush_opts));
|
2018-01-18 01:37:10 +00:00
|
|
|
TEST_SYNC_POINT("DBTest2::LiveFilesOmitObsoleteFiles:FlushTriggered");
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(db_->DisableFileDeletions());
|
2018-01-18 01:37:10 +00:00
|
|
|
VectorLogPtr log_files;
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(db_->GetSortedWalFiles(log_files));
|
2018-01-18 01:37:10 +00:00
|
|
|
TEST_SYNC_POINT("DBTest2::LiveFilesOmitObsoleteFiles:LiveFilesCaptured");
|
|
|
|
for (const auto& log_file : log_files) {
|
|
|
|
ASSERT_OK(env_->FileExists(LogFileName(dbname_, log_file->LogNumber())));
|
|
|
|
}
|
|
|
|
|
2023-11-10 22:35:54 +00:00
|
|
|
ASSERT_OK(db_->EnableFileDeletions(/*force=*/false));
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2018-01-18 01:37:10 +00:00
|
|
|
}
|
|
|
|
|
2018-11-13 20:47:52 +00:00
|
|
|
TEST_F(DBTest2, TestNumPread) {
|
|
|
|
Options options = CurrentOptions();
|
2021-01-06 18:48:24 +00:00
|
|
|
bool prefetch_supported =
|
|
|
|
test::IsPrefetchSupported(env_->GetFileSystem(), dbname_);
|
2018-11-13 20:47:52 +00:00
|
|
|
// disable block cache
|
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
table_options.no_block_cache = true;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
|
|
|
Reopen(options);
|
|
|
|
env_->count_random_reads_ = true;
|
|
|
|
env_->random_file_open_counter_.store(0);
|
|
|
|
ASSERT_OK(Put("bar", "foo"));
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
2021-01-06 18:48:24 +00:00
|
|
|
if (prefetch_supported) {
|
|
|
|
// After flush, we'll open the file and read footer, meta block,
|
|
|
|
// property block and index block.
|
|
|
|
ASSERT_EQ(4, env_->random_read_counter_.Read());
|
|
|
|
} else {
|
|
|
|
// With prefetch not supported, we will do a single read into a buffer
|
|
|
|
ASSERT_EQ(1, env_->random_read_counter_.Read());
|
|
|
|
}
|
2018-11-13 20:47:52 +00:00
|
|
|
ASSERT_EQ(1, env_->random_file_open_counter_.load());
|
|
|
|
|
|
|
|
// One pread per a normal data block read
|
|
|
|
env_->random_file_open_counter_.store(0);
|
|
|
|
env_->random_read_counter_.Reset();
|
|
|
|
ASSERT_EQ("bar", Get("foo"));
|
|
|
|
ASSERT_EQ(1, env_->random_read_counter_.Read());
|
|
|
|
// All files are already opened.
|
|
|
|
ASSERT_EQ(0, env_->random_file_open_counter_.load());
|
|
|
|
|
|
|
|
env_->random_file_open_counter_.store(0);
|
|
|
|
env_->random_read_counter_.Reset();
|
|
|
|
ASSERT_OK(Put("bar2", "foo2"));
|
|
|
|
ASSERT_OK(Put("foo2", "bar2"));
|
|
|
|
ASSERT_OK(Flush());
|
2021-01-06 18:48:24 +00:00
|
|
|
if (prefetch_supported) {
|
|
|
|
// After flush, we'll open the file and read footer, meta block,
|
|
|
|
// property block and index block.
|
|
|
|
ASSERT_EQ(4, env_->random_read_counter_.Read());
|
|
|
|
} else {
|
|
|
|
// With prefetch not supported, we will do a single read into a buffer
|
|
|
|
ASSERT_EQ(1, env_->random_read_counter_.Read());
|
|
|
|
}
|
2018-11-13 20:47:52 +00:00
|
|
|
ASSERT_EQ(1, env_->random_file_open_counter_.load());
|
|
|
|
|
|
|
|
env_->random_file_open_counter_.store(0);
|
|
|
|
env_->random_read_counter_.Reset();
|
|
|
|
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));
|
2021-01-06 18:48:24 +00:00
|
|
|
if (prefetch_supported) {
|
|
|
|
// Compaction needs two input blocks, which requires 2 preads, and
|
|
|
|
// generate a new SST file which needs 4 preads (footer, meta block,
|
|
|
|
// property block and index block). In total 6.
|
|
|
|
ASSERT_EQ(6, env_->random_read_counter_.Read());
|
|
|
|
} else {
|
|
|
|
// With prefetch off, compaction needs two input blocks,
|
|
|
|
// followed by a single buffered read. In total 3.
|
|
|
|
ASSERT_EQ(3, env_->random_read_counter_.Read());
|
|
|
|
}
|
|
|
|
// All compaction input files should have already been opened.
|
2018-11-13 20:47:52 +00:00
|
|
|
ASSERT_EQ(1, env_->random_file_open_counter_.load());
|
|
|
|
|
|
|
|
// One pread per a normal data block read
|
|
|
|
env_->random_file_open_counter_.store(0);
|
|
|
|
env_->random_read_counter_.Reset();
|
|
|
|
ASSERT_EQ("foo2", Get("bar2"));
|
|
|
|
ASSERT_EQ(1, env_->random_read_counter_.Read());
|
|
|
|
// SST files are already opened.
|
|
|
|
ASSERT_EQ(0, env_->random_file_open_counter_.load());
|
|
|
|
}
|
|
|
|
|
2021-08-19 00:04:36 +00:00
|
|
|
class TraceExecutionResultHandler : public TraceRecordResult::Handler {
|
|
|
|
public:
|
2024-01-05 19:53:57 +00:00
|
|
|
TraceExecutionResultHandler() = default;
|
|
|
|
~TraceExecutionResultHandler() override = default;
|
2021-08-19 00:04:36 +00:00
|
|
|
|
|
|
|
virtual Status Handle(const StatusOnlyTraceExecutionResult& result) override {
|
|
|
|
if (result.GetStartTimestamp() > result.GetEndTimestamp()) {
|
|
|
|
return Status::InvalidArgument("Invalid timestamps.");
|
|
|
|
}
|
|
|
|
result.GetStatus().PermitUncheckedError();
|
|
|
|
switch (result.GetTraceType()) {
|
|
|
|
case kTraceWrite: {
|
|
|
|
total_latency_ += result.GetLatency();
|
|
|
|
cnt_++;
|
|
|
|
writes_++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default:
|
|
|
|
return Status::Corruption("Type mismatch.");
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
virtual Status Handle(
|
|
|
|
const SingleValueTraceExecutionResult& result) override {
|
|
|
|
if (result.GetStartTimestamp() > result.GetEndTimestamp()) {
|
|
|
|
return Status::InvalidArgument("Invalid timestamps.");
|
|
|
|
}
|
|
|
|
result.GetStatus().PermitUncheckedError();
|
|
|
|
switch (result.GetTraceType()) {
|
|
|
|
case kTraceGet: {
|
|
|
|
total_latency_ += result.GetLatency();
|
|
|
|
cnt_++;
|
|
|
|
gets_++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default:
|
|
|
|
return Status::Corruption("Type mismatch.");
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
virtual Status Handle(
|
|
|
|
const MultiValuesTraceExecutionResult& result) override {
|
|
|
|
if (result.GetStartTimestamp() > result.GetEndTimestamp()) {
|
|
|
|
return Status::InvalidArgument("Invalid timestamps.");
|
|
|
|
}
|
|
|
|
for (const Status& s : result.GetMultiStatus()) {
|
|
|
|
s.PermitUncheckedError();
|
|
|
|
}
|
|
|
|
switch (result.GetTraceType()) {
|
|
|
|
case kTraceMultiGet: {
|
|
|
|
total_latency_ += result.GetLatency();
|
|
|
|
cnt_++;
|
|
|
|
multigets_++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default:
|
|
|
|
return Status::Corruption("Type mismatch.");
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
2021-08-20 22:32:55 +00:00
|
|
|
virtual Status Handle(const IteratorTraceExecutionResult& result) override {
|
|
|
|
if (result.GetStartTimestamp() > result.GetEndTimestamp()) {
|
|
|
|
return Status::InvalidArgument("Invalid timestamps.");
|
|
|
|
}
|
|
|
|
result.GetStatus().PermitUncheckedError();
|
|
|
|
switch (result.GetTraceType()) {
|
|
|
|
case kTraceIteratorSeek:
|
|
|
|
case kTraceIteratorSeekForPrev: {
|
|
|
|
total_latency_ += result.GetLatency();
|
|
|
|
cnt_++;
|
|
|
|
seeks_++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default:
|
|
|
|
return Status::Corruption("Type mismatch.");
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
2021-08-19 00:04:36 +00:00
|
|
|
void Reset() {
|
|
|
|
total_latency_ = 0;
|
|
|
|
cnt_ = 0;
|
|
|
|
writes_ = 0;
|
|
|
|
gets_ = 0;
|
|
|
|
seeks_ = 0;
|
|
|
|
multigets_ = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
double GetAvgLatency() const {
|
|
|
|
return cnt_ == 0 ? 0.0 : 1.0 * total_latency_ / cnt_;
|
|
|
|
}
|
|
|
|
|
|
|
|
int GetNumWrites() const { return writes_; }
|
|
|
|
|
|
|
|
int GetNumGets() const { return gets_; }
|
|
|
|
|
|
|
|
int GetNumIterSeeks() const { return seeks_; }
|
|
|
|
|
|
|
|
int GetNumMultiGets() const { return multigets_; }
|
|
|
|
|
|
|
|
private:
|
|
|
|
std::atomic<uint64_t> total_latency_{0};
|
|
|
|
std::atomic<uint32_t> cnt_{0};
|
|
|
|
std::atomic<int> writes_{0};
|
|
|
|
std::atomic<int> gets_{0};
|
|
|
|
std::atomic<int> seeks_{0};
|
|
|
|
std::atomic<int> multigets_{0};
|
|
|
|
};
|
|
|
|
|
2018-08-01 07:14:43 +00:00
|
|
|
TEST_F(DBTest2, TraceAndReplay) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.merge_operator = MergeOperators::CreatePutOperator();
|
|
|
|
ReadOptions ro;
|
|
|
|
WriteOptions wo;
|
|
|
|
TraceOptions trace_opts;
|
|
|
|
EnvOptions env_opts;
|
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
Random rnd(301);
|
2018-08-11 00:56:11 +00:00
|
|
|
Iterator* single_iter = nullptr;
|
2018-08-01 07:14:43 +00:00
|
|
|
|
2019-04-03 20:23:48 +00:00
|
|
|
ASSERT_TRUE(db_->EndTrace().IsIOError());
|
|
|
|
|
2018-08-01 07:14:43 +00:00
|
|
|
std::string trace_filename = dbname_ + "/rocksdb.trace";
|
|
|
|
std::unique_ptr<TraceWriter> trace_writer;
|
|
|
|
ASSERT_OK(NewFileTraceWriter(env_, env_opts, trace_filename, &trace_writer));
|
|
|
|
ASSERT_OK(db_->StartTrace(trace_opts, std::move(trace_writer)));
|
|
|
|
|
2021-08-19 00:04:36 +00:00
|
|
|
// 5 Writes
|
2018-08-01 07:14:43 +00:00
|
|
|
ASSERT_OK(Put(0, "a", "1"));
|
|
|
|
ASSERT_OK(Merge(0, "b", "2"));
|
|
|
|
ASSERT_OK(Delete(0, "c"));
|
|
|
|
ASSERT_OK(SingleDelete(0, "d"));
|
|
|
|
ASSERT_OK(db_->DeleteRange(wo, dbfull()->DefaultColumnFamily(), "e", "f"));
|
|
|
|
|
2021-08-19 00:04:36 +00:00
|
|
|
// 6th Write
|
2018-08-01 07:14:43 +00:00
|
|
|
WriteBatch batch;
|
|
|
|
ASSERT_OK(batch.Put("f", "11"));
|
|
|
|
ASSERT_OK(batch.Merge("g", "12"));
|
|
|
|
ASSERT_OK(batch.Delete("h"));
|
|
|
|
ASSERT_OK(batch.SingleDelete("i"));
|
|
|
|
ASSERT_OK(batch.DeleteRange("j", "k"));
|
|
|
|
ASSERT_OK(db_->Write(wo, &batch));
|
|
|
|
|
2021-08-19 00:04:36 +00:00
|
|
|
// 2 Seek(ForPrev)s
|
2018-08-11 00:56:11 +00:00
|
|
|
single_iter = db_->NewIterator(ro);
|
2021-08-19 00:04:36 +00:00
|
|
|
single_iter->Seek("f"); // Seek 1
|
2018-08-11 00:56:11 +00:00
|
|
|
single_iter->SeekForPrev("g");
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(single_iter->status());
|
2018-08-11 00:56:11 +00:00
|
|
|
delete single_iter;
|
|
|
|
|
2021-08-19 00:04:36 +00:00
|
|
|
// 2 Gets
|
2018-08-01 07:14:43 +00:00
|
|
|
ASSERT_EQ("1", Get(0, "a"));
|
|
|
|
ASSERT_EQ("12", Get(0, "g"));
|
|
|
|
|
2021-08-19 00:04:36 +00:00
|
|
|
// 7th and 8th Write, 3rd Get
|
2018-08-01 07:14:43 +00:00
|
|
|
ASSERT_OK(Put(1, "foo", "bar"));
|
|
|
|
ASSERT_OK(Put(1, "rocksdb", "rocks"));
|
|
|
|
ASSERT_EQ("NOT_FOUND", Get(1, "leveldb"));
|
|
|
|
|
2021-08-19 00:04:36 +00:00
|
|
|
// Total Write x 8, Get x 3, Seek x 2.
|
2018-08-01 07:14:43 +00:00
|
|
|
ASSERT_OK(db_->EndTrace());
|
|
|
|
// These should not get into the trace file as it is after EndTrace.
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("hello", "world"));
|
|
|
|
ASSERT_OK(Merge("foo", "bar"));
|
2018-08-01 07:14:43 +00:00
|
|
|
|
|
|
|
// Open another db, replay, and verify the data
|
|
|
|
std::string value;
|
2020-06-25 19:07:47 +00:00
|
|
|
std::string dbname2 = test::PerThreadDBPath(env_, "/db_replay");
|
2018-08-01 07:14:43 +00:00
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
|
|
|
|
|
|
|
// Using a different name than db2, to pacify infer's use-after-lifetime
|
|
|
|
// warnings (http://fbinfer.com).
|
|
|
|
DB* db2_init = nullptr;
|
|
|
|
options.create_if_missing = true;
|
|
|
|
ASSERT_OK(DB::Open(options, dbname2, &db2_init));
|
|
|
|
ColumnFamilyHandle* cf;
|
|
|
|
ASSERT_OK(
|
|
|
|
db2_init->CreateColumnFamily(ColumnFamilyOptions(), "pikachu", &cf));
|
|
|
|
delete cf;
|
|
|
|
delete db2_init;
|
|
|
|
|
|
|
|
DB* db2 = nullptr;
|
|
|
|
std::vector<ColumnFamilyDescriptor> column_families;
|
|
|
|
ColumnFamilyOptions cf_options;
|
|
|
|
cf_options.merge_operator = MergeOperators::CreatePutOperator();
|
2024-01-05 19:53:57 +00:00
|
|
|
column_families.emplace_back("default", cf_options);
|
|
|
|
column_families.emplace_back("pikachu", ColumnFamilyOptions());
|
2018-08-01 07:14:43 +00:00
|
|
|
std::vector<ColumnFamilyHandle*> handles;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
DBOptions db_opts;
|
|
|
|
db_opts.env = env_;
|
|
|
|
ASSERT_OK(DB::Open(db_opts, dbname2, column_families, &handles, &db2));
|
2018-08-01 07:14:43 +00:00
|
|
|
|
|
|
|
env_->SleepForMicroseconds(100);
|
|
|
|
// Verify that the keys don't already exist
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "a", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "g", &value).IsNotFound());
|
|
|
|
|
|
|
|
std::unique_ptr<TraceReader> trace_reader;
|
|
|
|
ASSERT_OK(NewFileTraceReader(env_, env_opts, trace_filename, &trace_reader));
|
2021-08-12 02:31:44 +00:00
|
|
|
std::unique_ptr<Replayer> replayer;
|
|
|
|
ASSERT_OK(
|
|
|
|
db2->NewDefaultReplayer(handles, std::move(trace_reader), &replayer));
|
2021-08-19 00:04:36 +00:00
|
|
|
|
|
|
|
TraceExecutionResultHandler res_handler;
|
|
|
|
std::function<void(Status, std::unique_ptr<TraceRecordResult> &&)> res_cb =
|
|
|
|
[&res_handler](Status exec_s, std::unique_ptr<TraceRecordResult>&& res) {
|
|
|
|
ASSERT_TRUE(exec_s.ok() || exec_s.IsNotSupported());
|
|
|
|
if (res != nullptr) {
|
|
|
|
ASSERT_OK(res->Accept(&res_handler));
|
|
|
|
res.reset();
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
2021-08-12 02:31:44 +00:00
|
|
|
// Unprepared replay should fail with Status::Incomplete()
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_TRUE(replayer->Replay(ReplayOptions(), nullptr).IsIncomplete());
|
2021-08-12 02:31:44 +00:00
|
|
|
ASSERT_OK(replayer->Prepare());
|
|
|
|
// Ok to repeatedly Prepare().
|
|
|
|
ASSERT_OK(replayer->Prepare());
|
|
|
|
// Replay using 1 thread, 1x speed.
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Replay(ReplayOptions(1, 1.0), res_cb));
|
2022-11-22 21:07:17 +00:00
|
|
|
ASSERT_GE(res_handler.GetAvgLatency(), 0.0);
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_EQ(res_handler.GetNumWrites(), 8);
|
|
|
|
ASSERT_EQ(res_handler.GetNumGets(), 3);
|
|
|
|
ASSERT_EQ(res_handler.GetNumIterSeeks(), 2);
|
|
|
|
ASSERT_EQ(res_handler.GetNumMultiGets(), 0);
|
|
|
|
res_handler.Reset();
|
2021-08-12 02:31:44 +00:00
|
|
|
|
|
|
|
ASSERT_OK(db2->Get(ro, handles[0], "a", &value));
|
|
|
|
ASSERT_EQ("1", value);
|
|
|
|
ASSERT_OK(db2->Get(ro, handles[0], "g", &value));
|
|
|
|
ASSERT_EQ("12", value);
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "hello", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "world", &value).IsNotFound());
|
|
|
|
|
|
|
|
ASSERT_OK(db2->Get(ro, handles[1], "foo", &value));
|
|
|
|
ASSERT_EQ("bar", value);
|
|
|
|
ASSERT_OK(db2->Get(ro, handles[1], "rocksdb", &value));
|
|
|
|
ASSERT_EQ("rocks", value);
|
|
|
|
|
|
|
|
// Re-replay should fail with Status::Incomplete() if Prepare() was not
|
|
|
|
// called. Currently we don't distinguish between unprepared and trace end.
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_TRUE(replayer->Replay(ReplayOptions(), nullptr).IsIncomplete());
|
2021-08-12 02:31:44 +00:00
|
|
|
|
|
|
|
// Re-replay using 2 threads, 2x speed.
|
|
|
|
ASSERT_OK(replayer->Prepare());
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Replay(ReplayOptions(2, 2.0), res_cb));
|
2022-11-22 21:07:17 +00:00
|
|
|
ASSERT_GE(res_handler.GetAvgLatency(), 0.0);
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_EQ(res_handler.GetNumWrites(), 8);
|
|
|
|
ASSERT_EQ(res_handler.GetNumGets(), 3);
|
|
|
|
ASSERT_EQ(res_handler.GetNumIterSeeks(), 2);
|
|
|
|
ASSERT_EQ(res_handler.GetNumMultiGets(), 0);
|
|
|
|
res_handler.Reset();
|
2021-08-12 02:31:44 +00:00
|
|
|
|
|
|
|
// Re-replay using 2 threads, 1/2 speed.
|
|
|
|
ASSERT_OK(replayer->Prepare());
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Replay(ReplayOptions(2, 0.5), res_cb));
|
2022-11-22 21:07:17 +00:00
|
|
|
ASSERT_GE(res_handler.GetAvgLatency(), 0.0);
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_EQ(res_handler.GetNumWrites(), 8);
|
|
|
|
ASSERT_EQ(res_handler.GetNumGets(), 3);
|
|
|
|
ASSERT_EQ(res_handler.GetNumIterSeeks(), 2);
|
|
|
|
ASSERT_EQ(res_handler.GetNumMultiGets(), 0);
|
|
|
|
res_handler.Reset();
|
|
|
|
|
2021-08-12 02:31:44 +00:00
|
|
|
replayer.reset();
|
|
|
|
|
|
|
|
for (auto handle : handles) {
|
|
|
|
delete handle;
|
|
|
|
}
|
|
|
|
delete db2;
|
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, TraceAndManualReplay) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.merge_operator = MergeOperators::CreatePutOperator();
|
|
|
|
ReadOptions ro;
|
|
|
|
WriteOptions wo;
|
|
|
|
TraceOptions trace_opts;
|
|
|
|
EnvOptions env_opts;
|
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
Random rnd(301);
|
|
|
|
Iterator* single_iter = nullptr;
|
|
|
|
|
|
|
|
ASSERT_TRUE(db_->EndTrace().IsIOError());
|
|
|
|
|
|
|
|
std::string trace_filename = dbname_ + "/rocksdb.trace";
|
|
|
|
std::unique_ptr<TraceWriter> trace_writer;
|
|
|
|
ASSERT_OK(NewFileTraceWriter(env_, env_opts, trace_filename, &trace_writer));
|
|
|
|
ASSERT_OK(db_->StartTrace(trace_opts, std::move(trace_writer)));
|
|
|
|
|
|
|
|
ASSERT_OK(Put(0, "a", "1"));
|
|
|
|
ASSERT_OK(Merge(0, "b", "2"));
|
|
|
|
ASSERT_OK(Delete(0, "c"));
|
|
|
|
ASSERT_OK(SingleDelete(0, "d"));
|
|
|
|
ASSERT_OK(db_->DeleteRange(wo, dbfull()->DefaultColumnFamily(), "e", "f"));
|
|
|
|
|
|
|
|
WriteBatch batch;
|
|
|
|
ASSERT_OK(batch.Put("f", "11"));
|
|
|
|
ASSERT_OK(batch.Merge("g", "12"));
|
|
|
|
ASSERT_OK(batch.Delete("h"));
|
|
|
|
ASSERT_OK(batch.SingleDelete("i"));
|
|
|
|
ASSERT_OK(batch.DeleteRange("j", "k"));
|
|
|
|
ASSERT_OK(db_->Write(wo, &batch));
|
|
|
|
|
|
|
|
single_iter = db_->NewIterator(ro);
|
|
|
|
single_iter->Seek("f");
|
|
|
|
single_iter->SeekForPrev("g");
|
2021-08-20 00:26:11 +00:00
|
|
|
ASSERT_OK(single_iter->status());
|
|
|
|
delete single_iter;
|
|
|
|
|
|
|
|
// Write some sequenced keys for testing lower/upper bounds of iterator.
|
|
|
|
batch.Clear();
|
|
|
|
ASSERT_OK(batch.Put("iter-0", "iter-0"));
|
|
|
|
ASSERT_OK(batch.Put("iter-1", "iter-1"));
|
|
|
|
ASSERT_OK(batch.Put("iter-2", "iter-2"));
|
|
|
|
ASSERT_OK(batch.Put("iter-3", "iter-3"));
|
|
|
|
ASSERT_OK(batch.Put("iter-4", "iter-4"));
|
|
|
|
ASSERT_OK(db_->Write(wo, &batch));
|
|
|
|
|
|
|
|
ReadOptions bounded_ro = ro;
|
|
|
|
Slice lower_bound("iter-1");
|
|
|
|
Slice upper_bound("iter-3");
|
|
|
|
bounded_ro.iterate_lower_bound = &lower_bound;
|
|
|
|
bounded_ro.iterate_upper_bound = &upper_bound;
|
|
|
|
single_iter = db_->NewIterator(bounded_ro);
|
|
|
|
single_iter->Seek("iter-0");
|
|
|
|
ASSERT_EQ(single_iter->key().ToString(), "iter-1");
|
|
|
|
single_iter->Seek("iter-2");
|
|
|
|
ASSERT_EQ(single_iter->key().ToString(), "iter-2");
|
|
|
|
single_iter->Seek("iter-4");
|
|
|
|
ASSERT_FALSE(single_iter->Valid());
|
|
|
|
single_iter->SeekForPrev("iter-0");
|
|
|
|
ASSERT_FALSE(single_iter->Valid());
|
|
|
|
single_iter->SeekForPrev("iter-2");
|
|
|
|
ASSERT_EQ(single_iter->key().ToString(), "iter-2");
|
|
|
|
single_iter->SeekForPrev("iter-4");
|
|
|
|
ASSERT_EQ(single_iter->key().ToString(), "iter-2");
|
|
|
|
ASSERT_OK(single_iter->status());
|
2021-08-12 02:31:44 +00:00
|
|
|
delete single_iter;
|
|
|
|
|
|
|
|
ASSERT_EQ("1", Get(0, "a"));
|
|
|
|
ASSERT_EQ("12", Get(0, "g"));
|
|
|
|
|
|
|
|
ASSERT_OK(Put(1, "foo", "bar"));
|
|
|
|
ASSERT_OK(Put(1, "rocksdb", "rocks"));
|
|
|
|
ASSERT_EQ("NOT_FOUND", Get(1, "leveldb"));
|
|
|
|
|
2021-08-19 00:04:36 +00:00
|
|
|
// Same as TraceAndReplay, Write x 8, Get x 3, Seek x 2.
|
2021-08-20 00:26:11 +00:00
|
|
|
// Plus 1 WriteBatch for iterator with lower/upper bounds, and 6
|
|
|
|
// Seek(ForPrev)s.
|
|
|
|
// Total Write x 9, Get x 3, Seek x 8
|
2021-08-12 02:31:44 +00:00
|
|
|
ASSERT_OK(db_->EndTrace());
|
|
|
|
// These should not get into the trace file as it is after EndTrace.
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("hello", "world"));
|
|
|
|
ASSERT_OK(Merge("foo", "bar"));
|
2021-08-12 02:31:44 +00:00
|
|
|
|
|
|
|
// Open another db, replay, and verify the data
|
|
|
|
std::string value;
|
|
|
|
std::string dbname2 = test::PerThreadDBPath(env_, "/db_replay");
|
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
|
|
|
|
|
|
|
// Using a different name than db2, to pacify infer's use-after-lifetime
|
|
|
|
// warnings (http://fbinfer.com).
|
|
|
|
DB* db2_init = nullptr;
|
|
|
|
options.create_if_missing = true;
|
|
|
|
ASSERT_OK(DB::Open(options, dbname2, &db2_init));
|
|
|
|
ColumnFamilyHandle* cf;
|
|
|
|
ASSERT_OK(
|
|
|
|
db2_init->CreateColumnFamily(ColumnFamilyOptions(), "pikachu", &cf));
|
|
|
|
delete cf;
|
|
|
|
delete db2_init;
|
|
|
|
|
|
|
|
DB* db2 = nullptr;
|
|
|
|
std::vector<ColumnFamilyDescriptor> column_families;
|
|
|
|
ColumnFamilyOptions cf_options;
|
|
|
|
cf_options.merge_operator = MergeOperators::CreatePutOperator();
|
2024-01-05 19:53:57 +00:00
|
|
|
column_families.emplace_back("default", cf_options);
|
|
|
|
column_families.emplace_back("pikachu", ColumnFamilyOptions());
|
2021-08-12 02:31:44 +00:00
|
|
|
std::vector<ColumnFamilyHandle*> handles;
|
|
|
|
DBOptions db_opts;
|
|
|
|
db_opts.env = env_;
|
|
|
|
ASSERT_OK(DB::Open(db_opts, dbname2, column_families, &handles, &db2));
|
|
|
|
|
|
|
|
env_->SleepForMicroseconds(100);
|
|
|
|
// Verify that the keys don't already exist
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "a", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "g", &value).IsNotFound());
|
|
|
|
|
|
|
|
std::unique_ptr<TraceReader> trace_reader;
|
|
|
|
ASSERT_OK(NewFileTraceReader(env_, env_opts, trace_filename, &trace_reader));
|
|
|
|
std::unique_ptr<Replayer> replayer;
|
|
|
|
ASSERT_OK(
|
|
|
|
db2->NewDefaultReplayer(handles, std::move(trace_reader), &replayer));
|
|
|
|
|
2021-08-19 00:04:36 +00:00
|
|
|
TraceExecutionResultHandler res_handler;
|
|
|
|
|
2021-08-12 02:31:44 +00:00
|
|
|
// Manual replay for 2 times. The 2nd checks if the replay can restart.
|
|
|
|
std::unique_ptr<TraceRecord> record;
|
2021-08-19 00:04:36 +00:00
|
|
|
std::unique_ptr<TraceRecordResult> result;
|
2021-08-12 02:31:44 +00:00
|
|
|
for (int i = 0; i < 2; i++) {
|
|
|
|
// Next should fail if unprepared.
|
|
|
|
ASSERT_TRUE(replayer->Next(nullptr).IsIncomplete());
|
|
|
|
ASSERT_OK(replayer->Prepare());
|
|
|
|
Status s = Status::OK();
|
|
|
|
// Looping until trace end.
|
|
|
|
while (s.ok()) {
|
|
|
|
s = replayer->Next(&record);
|
|
|
|
// Skip unsupported operations.
|
|
|
|
if (s.IsNotSupported()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (s.ok()) {
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Execute(record, &result));
|
|
|
|
if (result != nullptr) {
|
|
|
|
ASSERT_OK(result->Accept(&res_handler));
|
2021-08-20 22:32:55 +00:00
|
|
|
if (record->GetTraceType() == kTraceIteratorSeek ||
|
|
|
|
record->GetTraceType() == kTraceIteratorSeekForPrev) {
|
|
|
|
IteratorSeekQueryTraceRecord* iter_rec =
|
|
|
|
dynamic_cast<IteratorSeekQueryTraceRecord*>(record.get());
|
|
|
|
IteratorTraceExecutionResult* iter_res =
|
|
|
|
dynamic_cast<IteratorTraceExecutionResult*>(result.get());
|
|
|
|
// Check if lower/upper bounds are correctly saved and decoded.
|
|
|
|
std::string lower_str = iter_rec->GetLowerBound().ToString();
|
|
|
|
std::string upper_str = iter_rec->GetUpperBound().ToString();
|
|
|
|
std::string iter_key = iter_res->GetKey().ToString();
|
|
|
|
std::string iter_value = iter_res->GetValue().ToString();
|
|
|
|
if (!lower_str.empty() && !upper_str.empty()) {
|
|
|
|
ASSERT_EQ(lower_str, "iter-1");
|
|
|
|
ASSERT_EQ(upper_str, "iter-3");
|
|
|
|
if (iter_res->GetValid()) {
|
|
|
|
// If iterator is valid, then lower_bound <= key < upper_bound.
|
|
|
|
ASSERT_GE(iter_key, lower_str);
|
|
|
|
ASSERT_LT(iter_key, upper_str);
|
|
|
|
} else {
|
|
|
|
// If iterator is invalid, then
|
|
|
|
// key < lower_bound or key >= upper_bound.
|
|
|
|
ASSERT_TRUE(iter_key < lower_str || iter_key >= upper_str);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// If iterator is invalid, the key and value should be empty.
|
|
|
|
if (!iter_res->GetValid()) {
|
|
|
|
ASSERT_TRUE(iter_key.empty());
|
|
|
|
ASSERT_TRUE(iter_value.empty());
|
|
|
|
}
|
|
|
|
}
|
2021-08-19 00:04:36 +00:00
|
|
|
result.reset();
|
|
|
|
}
|
2021-08-12 02:31:44 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
// Status::Incomplete() will be returned when manually reading the trace
|
|
|
|
// end, or Prepare() was not called.
|
|
|
|
ASSERT_TRUE(s.IsIncomplete());
|
|
|
|
ASSERT_TRUE(replayer->Next(nullptr).IsIncomplete());
|
2022-11-22 21:07:17 +00:00
|
|
|
ASSERT_GE(res_handler.GetAvgLatency(), 0.0);
|
2021-08-20 00:26:11 +00:00
|
|
|
ASSERT_EQ(res_handler.GetNumWrites(), 9);
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_EQ(res_handler.GetNumGets(), 3);
|
2021-08-20 00:26:11 +00:00
|
|
|
ASSERT_EQ(res_handler.GetNumIterSeeks(), 8);
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_EQ(res_handler.GetNumMultiGets(), 0);
|
|
|
|
res_handler.Reset();
|
2021-08-12 02:31:44 +00:00
|
|
|
}
|
2018-08-01 07:14:43 +00:00
|
|
|
|
|
|
|
ASSERT_OK(db2->Get(ro, handles[0], "a", &value));
|
|
|
|
ASSERT_EQ("1", value);
|
|
|
|
ASSERT_OK(db2->Get(ro, handles[0], "g", &value));
|
|
|
|
ASSERT_EQ("12", value);
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "hello", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "world", &value).IsNotFound());
|
|
|
|
|
|
|
|
ASSERT_OK(db2->Get(ro, handles[1], "foo", &value));
|
|
|
|
ASSERT_EQ("bar", value);
|
|
|
|
ASSERT_OK(db2->Get(ro, handles[1], "rocksdb", &value));
|
|
|
|
ASSERT_EQ("rocks", value);
|
|
|
|
|
2021-08-12 02:31:44 +00:00
|
|
|
// Test execution of artificially created TraceRecords.
|
|
|
|
uint64_t fake_ts = 1U;
|
|
|
|
// Write
|
|
|
|
batch.Clear();
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(batch.Put("trace-record-write1", "write1"));
|
|
|
|
ASSERT_OK(batch.Put("trace-record-write2", "write2"));
|
2021-08-12 02:31:44 +00:00
|
|
|
record.reset(new WriteQueryTraceRecord(batch.Data(), fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Execute(record, &result));
|
|
|
|
ASSERT_TRUE(result != nullptr);
|
|
|
|
ASSERT_OK(result->Accept(&res_handler)); // Write x 1
|
2021-08-12 02:31:44 +00:00
|
|
|
ASSERT_OK(db2->Get(ro, handles[0], "trace-record-write1", &value));
|
|
|
|
ASSERT_EQ("write1", value);
|
|
|
|
ASSERT_OK(db2->Get(ro, handles[0], "trace-record-write2", &value));
|
|
|
|
ASSERT_EQ("write2", value);
|
2022-11-22 21:07:17 +00:00
|
|
|
ASSERT_GE(res_handler.GetAvgLatency(), 0.0);
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_EQ(res_handler.GetNumWrites(), 1);
|
|
|
|
ASSERT_EQ(res_handler.GetNumGets(), 0);
|
|
|
|
ASSERT_EQ(res_handler.GetNumIterSeeks(), 0);
|
|
|
|
ASSERT_EQ(res_handler.GetNumMultiGets(), 0);
|
|
|
|
res_handler.Reset();
|
2021-08-12 02:31:44 +00:00
|
|
|
|
|
|
|
// Get related
|
|
|
|
// Get an existing key.
|
|
|
|
record.reset(new GetQueryTraceRecord(handles[0]->GetID(),
|
|
|
|
"trace-record-write1", fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Execute(record, &result));
|
|
|
|
ASSERT_TRUE(result != nullptr);
|
|
|
|
ASSERT_OK(result->Accept(&res_handler)); // Get x 1
|
2021-08-12 02:31:44 +00:00
|
|
|
// Get an non-existing key, should still return Status::OK().
|
|
|
|
record.reset(new GetQueryTraceRecord(handles[0]->GetID(), "trace-record-get",
|
|
|
|
fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Execute(record, &result));
|
|
|
|
ASSERT_TRUE(result != nullptr);
|
|
|
|
ASSERT_OK(result->Accept(&res_handler)); // Get x 2
|
2021-08-12 02:31:44 +00:00
|
|
|
// Get from an invalid (non-existing) cf_id.
|
|
|
|
uint32_t invalid_cf_id = handles[1]->GetID() + 1;
|
|
|
|
record.reset(new GetQueryTraceRecord(invalid_cf_id, "whatever", fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_TRUE(replayer->Execute(record, &result).IsCorruption());
|
|
|
|
ASSERT_TRUE(result == nullptr);
|
2022-11-22 21:07:17 +00:00
|
|
|
ASSERT_GE(res_handler.GetAvgLatency(), 0.0);
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_EQ(res_handler.GetNumWrites(), 0);
|
|
|
|
ASSERT_EQ(res_handler.GetNumGets(), 2);
|
|
|
|
ASSERT_EQ(res_handler.GetNumIterSeeks(), 0);
|
|
|
|
ASSERT_EQ(res_handler.GetNumMultiGets(), 0);
|
|
|
|
res_handler.Reset();
|
2021-08-12 02:31:44 +00:00
|
|
|
|
|
|
|
// Iteration related
|
|
|
|
for (IteratorSeekQueryTraceRecord::SeekType seekType :
|
|
|
|
{IteratorSeekQueryTraceRecord::kSeek,
|
|
|
|
IteratorSeekQueryTraceRecord::kSeekForPrev}) {
|
|
|
|
// Seek to an existing key.
|
|
|
|
record.reset(new IteratorSeekQueryTraceRecord(
|
|
|
|
seekType, handles[0]->GetID(), "trace-record-write1", fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Execute(record, &result));
|
|
|
|
ASSERT_TRUE(result != nullptr);
|
|
|
|
ASSERT_OK(result->Accept(&res_handler)); // Seek x 1 in one iteration
|
2021-08-12 02:31:44 +00:00
|
|
|
// Seek to an non-existing key, should still return Status::OK().
|
|
|
|
record.reset(new IteratorSeekQueryTraceRecord(
|
|
|
|
seekType, handles[0]->GetID(), "trace-record-get", fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Execute(record, &result));
|
|
|
|
ASSERT_TRUE(result != nullptr);
|
|
|
|
ASSERT_OK(result->Accept(&res_handler)); // Seek x 2 in one iteration
|
2021-08-12 02:31:44 +00:00
|
|
|
// Seek from an invalid cf_id.
|
|
|
|
record.reset(new IteratorSeekQueryTraceRecord(seekType, invalid_cf_id,
|
|
|
|
"whatever", fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_TRUE(replayer->Execute(record, &result).IsCorruption());
|
|
|
|
ASSERT_TRUE(result == nullptr);
|
2021-08-12 02:31:44 +00:00
|
|
|
}
|
2022-11-22 21:07:17 +00:00
|
|
|
ASSERT_GE(res_handler.GetAvgLatency(), 0.0);
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_EQ(res_handler.GetNumWrites(), 0);
|
|
|
|
ASSERT_EQ(res_handler.GetNumGets(), 0);
|
|
|
|
ASSERT_EQ(res_handler.GetNumIterSeeks(), 4); // Seek x 2 in two iterations
|
|
|
|
ASSERT_EQ(res_handler.GetNumMultiGets(), 0);
|
|
|
|
res_handler.Reset();
|
2021-08-12 02:31:44 +00:00
|
|
|
|
|
|
|
// MultiGet related
|
|
|
|
// Get existing keys.
|
|
|
|
record.reset(new MultiGetQueryTraceRecord(
|
|
|
|
std::vector<uint32_t>({handles[0]->GetID(), handles[1]->GetID()}),
|
|
|
|
std::vector<std::string>({"a", "foo"}), fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Execute(record, &result));
|
|
|
|
ASSERT_TRUE(result != nullptr);
|
|
|
|
ASSERT_OK(result->Accept(&res_handler)); // MultiGet x 1
|
2021-08-12 02:31:44 +00:00
|
|
|
// Get all non-existing keys, should still return Status::OK().
|
|
|
|
record.reset(new MultiGetQueryTraceRecord(
|
|
|
|
std::vector<uint32_t>({handles[0]->GetID(), handles[1]->GetID()}),
|
|
|
|
std::vector<std::string>({"no1", "no2"}), fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Execute(record, &result));
|
|
|
|
ASSERT_TRUE(result != nullptr);
|
|
|
|
ASSERT_OK(result->Accept(&res_handler)); // MultiGet x 2
|
2021-08-12 02:31:44 +00:00
|
|
|
// Get mixed of existing and non-existing keys, should still return
|
|
|
|
// Status::OK().
|
|
|
|
record.reset(new MultiGetQueryTraceRecord(
|
|
|
|
std::vector<uint32_t>({handles[0]->GetID(), handles[1]->GetID()}),
|
|
|
|
std::vector<std::string>({"a", "no2"}), fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Execute(record, &result));
|
|
|
|
ASSERT_TRUE(result != nullptr);
|
|
|
|
MultiValuesTraceExecutionResult* mvr =
|
|
|
|
dynamic_cast<MultiValuesTraceExecutionResult*>(result.get());
|
|
|
|
ASSERT_TRUE(mvr != nullptr);
|
|
|
|
ASSERT_OK(mvr->GetMultiStatus()[0]);
|
|
|
|
ASSERT_TRUE(mvr->GetMultiStatus()[1].IsNotFound());
|
|
|
|
ASSERT_EQ(mvr->GetValues()[0], "1");
|
|
|
|
ASSERT_EQ(mvr->GetValues()[1], "");
|
|
|
|
ASSERT_OK(result->Accept(&res_handler)); // MultiGet x 3
|
2021-08-12 02:31:44 +00:00
|
|
|
// Get from an invalid (non-existing) cf_id.
|
|
|
|
record.reset(new MultiGetQueryTraceRecord(
|
|
|
|
std::vector<uint32_t>(
|
|
|
|
{handles[0]->GetID(), handles[1]->GetID(), invalid_cf_id}),
|
|
|
|
std::vector<std::string>({"a", "foo", "whatever"}), fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_TRUE(replayer->Execute(record, &result).IsCorruption());
|
|
|
|
ASSERT_TRUE(result == nullptr);
|
2021-08-12 02:31:44 +00:00
|
|
|
// Empty MultiGet
|
|
|
|
record.reset(new MultiGetQueryTraceRecord(
|
|
|
|
std::vector<uint32_t>(), std::vector<std::string>(), fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_TRUE(replayer->Execute(record, &result).IsInvalidArgument());
|
|
|
|
ASSERT_TRUE(result == nullptr);
|
2021-08-12 02:31:44 +00:00
|
|
|
// MultiGet size mismatch
|
|
|
|
record.reset(new MultiGetQueryTraceRecord(
|
|
|
|
std::vector<uint32_t>({handles[0]->GetID(), handles[1]->GetID()}),
|
|
|
|
std::vector<std::string>({"a"}), fake_ts++));
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_TRUE(replayer->Execute(record, &result).IsInvalidArgument());
|
|
|
|
ASSERT_TRUE(result == nullptr);
|
2022-11-22 21:07:17 +00:00
|
|
|
ASSERT_GE(res_handler.GetAvgLatency(), 0.0);
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_EQ(res_handler.GetNumWrites(), 0);
|
|
|
|
ASSERT_EQ(res_handler.GetNumGets(), 0);
|
|
|
|
ASSERT_EQ(res_handler.GetNumIterSeeks(), 0);
|
|
|
|
ASSERT_EQ(res_handler.GetNumMultiGets(), 3);
|
|
|
|
res_handler.Reset();
|
2021-08-12 02:31:44 +00:00
|
|
|
|
|
|
|
replayer.reset();
|
|
|
|
|
2018-11-27 22:24:24 +00:00
|
|
|
for (auto handle : handles) {
|
|
|
|
delete handle;
|
|
|
|
}
|
|
|
|
delete db2;
|
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, TraceWithLimit) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.merge_operator = MergeOperators::CreatePutOperator();
|
|
|
|
ReadOptions ro;
|
|
|
|
WriteOptions wo;
|
|
|
|
TraceOptions trace_opts;
|
|
|
|
EnvOptions env_opts;
|
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
Random rnd(301);
|
|
|
|
|
|
|
|
// test the max trace file size options
|
|
|
|
trace_opts.max_trace_file_size = 5;
|
|
|
|
std::string trace_filename = dbname_ + "/rocksdb.trace1";
|
|
|
|
std::unique_ptr<TraceWriter> trace_writer;
|
|
|
|
ASSERT_OK(NewFileTraceWriter(env_, env_opts, trace_filename, &trace_writer));
|
|
|
|
ASSERT_OK(db_->StartTrace(trace_opts, std::move(trace_writer)));
|
|
|
|
ASSERT_OK(Put(0, "a", "1"));
|
|
|
|
ASSERT_OK(Put(0, "b", "1"));
|
|
|
|
ASSERT_OK(Put(0, "c", "1"));
|
|
|
|
ASSERT_OK(db_->EndTrace());
|
|
|
|
|
2020-06-25 19:07:47 +00:00
|
|
|
std::string dbname2 = test::PerThreadDBPath(env_, "/db_replay2");
|
2018-11-27 22:24:24 +00:00
|
|
|
std::string value;
|
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
|
|
|
|
|
|
|
// Using a different name than db2, to pacify infer's use-after-lifetime
|
|
|
|
// warnings (http://fbinfer.com).
|
|
|
|
DB* db2_init = nullptr;
|
|
|
|
options.create_if_missing = true;
|
|
|
|
ASSERT_OK(DB::Open(options, dbname2, &db2_init));
|
|
|
|
ColumnFamilyHandle* cf;
|
|
|
|
ASSERT_OK(
|
|
|
|
db2_init->CreateColumnFamily(ColumnFamilyOptions(), "pikachu", &cf));
|
|
|
|
delete cf;
|
|
|
|
delete db2_init;
|
|
|
|
|
|
|
|
DB* db2 = nullptr;
|
|
|
|
std::vector<ColumnFamilyDescriptor> column_families;
|
|
|
|
ColumnFamilyOptions cf_options;
|
|
|
|
cf_options.merge_operator = MergeOperators::CreatePutOperator();
|
2024-01-05 19:53:57 +00:00
|
|
|
column_families.emplace_back("default", cf_options);
|
|
|
|
column_families.emplace_back("pikachu", ColumnFamilyOptions());
|
2018-11-27 22:24:24 +00:00
|
|
|
std::vector<ColumnFamilyHandle*> handles;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
DBOptions db_opts;
|
|
|
|
db_opts.env = env_;
|
|
|
|
ASSERT_OK(DB::Open(db_opts, dbname2, column_families, &handles, &db2));
|
2018-11-27 22:24:24 +00:00
|
|
|
|
|
|
|
env_->SleepForMicroseconds(100);
|
|
|
|
// Verify that the keys don't already exist
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "a", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "b", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "c", &value).IsNotFound());
|
|
|
|
|
|
|
|
std::unique_ptr<TraceReader> trace_reader;
|
|
|
|
ASSERT_OK(NewFileTraceReader(env_, env_opts, trace_filename, &trace_reader));
|
2021-08-12 02:31:44 +00:00
|
|
|
std::unique_ptr<Replayer> replayer;
|
|
|
|
ASSERT_OK(
|
|
|
|
db2->NewDefaultReplayer(handles, std::move(trace_reader), &replayer));
|
|
|
|
ASSERT_OK(replayer->Prepare());
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Replay(ReplayOptions(), nullptr));
|
2021-08-12 02:31:44 +00:00
|
|
|
replayer.reset();
|
2018-11-27 22:24:24 +00:00
|
|
|
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "a", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "b", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "c", &value).IsNotFound());
|
|
|
|
|
2018-08-01 07:14:43 +00:00
|
|
|
for (auto handle : handles) {
|
|
|
|
delete handle;
|
|
|
|
}
|
|
|
|
delete db2;
|
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
|
|
|
}
|
|
|
|
|
2019-02-09 01:29:41 +00:00
|
|
|
TEST_F(DBTest2, TraceWithSampling) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
ReadOptions ro;
|
|
|
|
WriteOptions wo;
|
|
|
|
TraceOptions trace_opts;
|
|
|
|
EnvOptions env_opts;
|
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
Random rnd(301);
|
|
|
|
|
|
|
|
// test the trace file sampling options
|
|
|
|
trace_opts.sampling_frequency = 2;
|
|
|
|
std::string trace_filename = dbname_ + "/rocksdb.trace_sampling";
|
|
|
|
std::unique_ptr<TraceWriter> trace_writer;
|
|
|
|
ASSERT_OK(NewFileTraceWriter(env_, env_opts, trace_filename, &trace_writer));
|
|
|
|
ASSERT_OK(db_->StartTrace(trace_opts, std::move(trace_writer)));
|
|
|
|
ASSERT_OK(Put(0, "a", "1"));
|
|
|
|
ASSERT_OK(Put(0, "b", "2"));
|
|
|
|
ASSERT_OK(Put(0, "c", "3"));
|
|
|
|
ASSERT_OK(Put(0, "d", "4"));
|
|
|
|
ASSERT_OK(Put(0, "e", "5"));
|
|
|
|
ASSERT_OK(db_->EndTrace());
|
|
|
|
|
2020-06-25 19:07:47 +00:00
|
|
|
std::string dbname2 = test::PerThreadDBPath(env_, "/db_replay_sampling");
|
2019-02-09 01:29:41 +00:00
|
|
|
std::string value;
|
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
|
|
|
|
|
|
|
// Using a different name than db2, to pacify infer's use-after-lifetime
|
|
|
|
// warnings (http://fbinfer.com).
|
|
|
|
DB* db2_init = nullptr;
|
|
|
|
options.create_if_missing = true;
|
|
|
|
ASSERT_OK(DB::Open(options, dbname2, &db2_init));
|
|
|
|
ColumnFamilyHandle* cf;
|
|
|
|
ASSERT_OK(
|
|
|
|
db2_init->CreateColumnFamily(ColumnFamilyOptions(), "pikachu", &cf));
|
|
|
|
delete cf;
|
|
|
|
delete db2_init;
|
|
|
|
|
|
|
|
DB* db2 = nullptr;
|
|
|
|
std::vector<ColumnFamilyDescriptor> column_families;
|
|
|
|
ColumnFamilyOptions cf_options;
|
2024-01-05 19:53:57 +00:00
|
|
|
column_families.emplace_back("default", cf_options);
|
|
|
|
column_families.emplace_back("pikachu", ColumnFamilyOptions());
|
2019-02-09 01:29:41 +00:00
|
|
|
std::vector<ColumnFamilyHandle*> handles;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
DBOptions db_opts;
|
|
|
|
db_opts.env = env_;
|
|
|
|
ASSERT_OK(DB::Open(db_opts, dbname2, column_families, &handles, &db2));
|
2019-02-09 01:29:41 +00:00
|
|
|
|
|
|
|
env_->SleepForMicroseconds(100);
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "a", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "b", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "c", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "d", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "e", &value).IsNotFound());
|
|
|
|
|
|
|
|
std::unique_ptr<TraceReader> trace_reader;
|
|
|
|
ASSERT_OK(NewFileTraceReader(env_, env_opts, trace_filename, &trace_reader));
|
2021-08-12 02:31:44 +00:00
|
|
|
std::unique_ptr<Replayer> replayer;
|
|
|
|
ASSERT_OK(
|
|
|
|
db2->NewDefaultReplayer(handles, std::move(trace_reader), &replayer));
|
|
|
|
ASSERT_OK(replayer->Prepare());
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Replay(ReplayOptions(), nullptr));
|
2021-08-12 02:31:44 +00:00
|
|
|
replayer.reset();
|
2019-02-09 01:29:41 +00:00
|
|
|
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "a", &value).IsNotFound());
|
|
|
|
ASSERT_FALSE(db2->Get(ro, handles[0], "b", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "c", &value).IsNotFound());
|
|
|
|
ASSERT_FALSE(db2->Get(ro, handles[0], "d", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "e", &value).IsNotFound());
|
|
|
|
|
|
|
|
for (auto handle : handles) {
|
|
|
|
delete handle;
|
|
|
|
}
|
|
|
|
delete db2;
|
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
|
|
|
}
|
|
|
|
|
2019-03-19 21:19:01 +00:00
|
|
|
TEST_F(DBTest2, TraceWithFilter) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.merge_operator = MergeOperators::CreatePutOperator();
|
|
|
|
ReadOptions ro;
|
|
|
|
WriteOptions wo;
|
|
|
|
TraceOptions trace_opts;
|
|
|
|
EnvOptions env_opts;
|
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
Random rnd(301);
|
|
|
|
Iterator* single_iter = nullptr;
|
|
|
|
|
|
|
|
trace_opts.filter = TraceFilterType::kTraceFilterWrite;
|
|
|
|
|
|
|
|
std::string trace_filename = dbname_ + "/rocksdb.trace";
|
|
|
|
std::unique_ptr<TraceWriter> trace_writer;
|
|
|
|
ASSERT_OK(NewFileTraceWriter(env_, env_opts, trace_filename, &trace_writer));
|
|
|
|
ASSERT_OK(db_->StartTrace(trace_opts, std::move(trace_writer)));
|
|
|
|
|
|
|
|
ASSERT_OK(Put(0, "a", "1"));
|
|
|
|
ASSERT_OK(Merge(0, "b", "2"));
|
|
|
|
ASSERT_OK(Delete(0, "c"));
|
|
|
|
ASSERT_OK(SingleDelete(0, "d"));
|
|
|
|
ASSERT_OK(db_->DeleteRange(wo, dbfull()->DefaultColumnFamily(), "e", "f"));
|
|
|
|
|
|
|
|
WriteBatch batch;
|
|
|
|
ASSERT_OK(batch.Put("f", "11"));
|
|
|
|
ASSERT_OK(batch.Merge("g", "12"));
|
|
|
|
ASSERT_OK(batch.Delete("h"));
|
|
|
|
ASSERT_OK(batch.SingleDelete("i"));
|
|
|
|
ASSERT_OK(batch.DeleteRange("j", "k"));
|
|
|
|
ASSERT_OK(db_->Write(wo, &batch));
|
|
|
|
|
|
|
|
single_iter = db_->NewIterator(ro);
|
|
|
|
single_iter->Seek("f");
|
|
|
|
single_iter->SeekForPrev("g");
|
|
|
|
delete single_iter;
|
|
|
|
|
|
|
|
ASSERT_EQ("1", Get(0, "a"));
|
|
|
|
ASSERT_EQ("12", Get(0, "g"));
|
|
|
|
|
|
|
|
ASSERT_OK(Put(1, "foo", "bar"));
|
|
|
|
ASSERT_OK(Put(1, "rocksdb", "rocks"));
|
|
|
|
ASSERT_EQ("NOT_FOUND", Get(1, "leveldb"));
|
|
|
|
|
|
|
|
ASSERT_OK(db_->EndTrace());
|
|
|
|
// These should not get into the trace file as it is after EndTrace.
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("hello", "world"));
|
|
|
|
ASSERT_OK(Merge("foo", "bar"));
|
2019-03-19 21:19:01 +00:00
|
|
|
|
|
|
|
// Open another db, replay, and verify the data
|
|
|
|
std::string value;
|
2021-04-02 20:37:16 +00:00
|
|
|
std::string dbname2 = test::PerThreadDBPath(env_, "db_replay");
|
2019-03-19 21:19:01 +00:00
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
|
|
|
|
|
|
|
// Using a different name than db2, to pacify infer's use-after-lifetime
|
|
|
|
// warnings (http://fbinfer.com).
|
|
|
|
DB* db2_init = nullptr;
|
|
|
|
options.create_if_missing = true;
|
|
|
|
ASSERT_OK(DB::Open(options, dbname2, &db2_init));
|
|
|
|
ColumnFamilyHandle* cf;
|
|
|
|
ASSERT_OK(
|
|
|
|
db2_init->CreateColumnFamily(ColumnFamilyOptions(), "pikachu", &cf));
|
|
|
|
delete cf;
|
|
|
|
delete db2_init;
|
|
|
|
|
|
|
|
DB* db2 = nullptr;
|
|
|
|
std::vector<ColumnFamilyDescriptor> column_families;
|
|
|
|
ColumnFamilyOptions cf_options;
|
|
|
|
cf_options.merge_operator = MergeOperators::CreatePutOperator();
|
2024-01-05 19:53:57 +00:00
|
|
|
column_families.emplace_back("default", cf_options);
|
|
|
|
column_families.emplace_back("pikachu", ColumnFamilyOptions());
|
2019-03-19 21:19:01 +00:00
|
|
|
std::vector<ColumnFamilyHandle*> handles;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
DBOptions db_opts;
|
|
|
|
db_opts.env = env_;
|
|
|
|
ASSERT_OK(DB::Open(db_opts, dbname2, column_families, &handles, &db2));
|
2019-03-19 21:19:01 +00:00
|
|
|
|
|
|
|
env_->SleepForMicroseconds(100);
|
|
|
|
// Verify that the keys don't already exist
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "a", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "g", &value).IsNotFound());
|
|
|
|
|
|
|
|
std::unique_ptr<TraceReader> trace_reader;
|
|
|
|
ASSERT_OK(NewFileTraceReader(env_, env_opts, trace_filename, &trace_reader));
|
2021-08-12 02:31:44 +00:00
|
|
|
std::unique_ptr<Replayer> replayer;
|
|
|
|
ASSERT_OK(
|
|
|
|
db2->NewDefaultReplayer(handles, std::move(trace_reader), &replayer));
|
|
|
|
ASSERT_OK(replayer->Prepare());
|
2021-08-19 00:04:36 +00:00
|
|
|
ASSERT_OK(replayer->Replay(ReplayOptions(), nullptr));
|
2021-08-12 02:31:44 +00:00
|
|
|
replayer.reset();
|
2019-03-19 21:19:01 +00:00
|
|
|
|
|
|
|
// All the key-values should not present since we filter out the WRITE ops.
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "a", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "g", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "hello", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "world", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "foo", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db2->Get(ro, handles[0], "rocksdb", &value).IsNotFound());
|
|
|
|
|
|
|
|
for (auto handle : handles) {
|
|
|
|
delete handle;
|
|
|
|
}
|
|
|
|
delete db2;
|
|
|
|
ASSERT_OK(DestroyDB(dbname2, options));
|
|
|
|
|
|
|
|
// Set up a new db.
|
2021-04-02 20:37:16 +00:00
|
|
|
std::string dbname3 = test::PerThreadDBPath(env_, "db_not_trace_read");
|
2019-03-19 21:19:01 +00:00
|
|
|
ASSERT_OK(DestroyDB(dbname3, options));
|
|
|
|
|
|
|
|
DB* db3_init = nullptr;
|
|
|
|
options.create_if_missing = true;
|
|
|
|
ColumnFamilyHandle* cf3;
|
|
|
|
ASSERT_OK(DB::Open(options, dbname3, &db3_init));
|
|
|
|
ASSERT_OK(
|
|
|
|
db3_init->CreateColumnFamily(ColumnFamilyOptions(), "pikachu", &cf3));
|
|
|
|
delete cf3;
|
|
|
|
delete db3_init;
|
|
|
|
|
|
|
|
column_families.clear();
|
2024-01-05 19:53:57 +00:00
|
|
|
column_families.emplace_back("default", cf_options);
|
|
|
|
column_families.emplace_back("pikachu", ColumnFamilyOptions());
|
2019-03-19 21:19:01 +00:00
|
|
|
handles.clear();
|
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
DB* db3 = nullptr;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
ASSERT_OK(DB::Open(db_opts, dbname3, column_families, &handles, &db3));
|
2019-03-19 21:19:01 +00:00
|
|
|
|
|
|
|
env_->SleepForMicroseconds(100);
|
|
|
|
// Verify that the keys don't already exist
|
|
|
|
ASSERT_TRUE(db3->Get(ro, handles[0], "a", &value).IsNotFound());
|
|
|
|
ASSERT_TRUE(db3->Get(ro, handles[0], "g", &value).IsNotFound());
|
|
|
|
|
2022-11-02 21:34:24 +00:00
|
|
|
// The tracer will not record the READ ops.
|
2019-03-19 21:19:01 +00:00
|
|
|
trace_opts.filter = TraceFilterType::kTraceFilterGet;
|
|
|
|
std::string trace_filename3 = dbname_ + "/rocksdb.trace_3";
|
|
|
|
std::unique_ptr<TraceWriter> trace_writer3;
|
|
|
|
ASSERT_OK(
|
2022-11-02 21:34:24 +00:00
|
|
|
NewFileTraceWriter(env_, env_opts, trace_filename3, &trace_writer3));
|
2019-03-19 21:19:01 +00:00
|
|
|
ASSERT_OK(db3->StartTrace(trace_opts, std::move(trace_writer3)));
|
|
|
|
|
|
|
|
ASSERT_OK(db3->Put(wo, handles[0], "a", "1"));
|
|
|
|
ASSERT_OK(db3->Merge(wo, handles[0], "b", "2"));
|
|
|
|
ASSERT_OK(db3->Delete(wo, handles[0], "c"));
|
|
|
|
ASSERT_OK(db3->SingleDelete(wo, handles[0], "d"));
|
|
|
|
|
|
|
|
ASSERT_OK(db3->Get(ro, handles[0], "a", &value));
|
|
|
|
ASSERT_EQ(value, "1");
|
|
|
|
ASSERT_TRUE(db3->Get(ro, handles[0], "c", &value).IsNotFound());
|
|
|
|
|
|
|
|
ASSERT_OK(db3->EndTrace());
|
|
|
|
|
|
|
|
for (auto handle : handles) {
|
|
|
|
delete handle;
|
|
|
|
}
|
|
|
|
delete db3;
|
|
|
|
ASSERT_OK(DestroyDB(dbname3, options));
|
|
|
|
|
|
|
|
std::unique_ptr<TraceReader> trace_reader3;
|
|
|
|
ASSERT_OK(
|
2022-11-02 21:34:24 +00:00
|
|
|
NewFileTraceReader(env_, env_opts, trace_filename3, &trace_reader3));
|
2019-03-19 21:19:01 +00:00
|
|
|
|
|
|
|
// Count the number of records in the trace file;
|
|
|
|
int count = 0;
|
|
|
|
std::string data;
|
|
|
|
Status s;
|
|
|
|
while (true) {
|
|
|
|
s = trace_reader3->Read(&data);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
count += 1;
|
|
|
|
}
|
|
|
|
// We also need to count the header and footer
|
|
|
|
// 4 WRITE + HEADER + FOOTER = 6
|
|
|
|
ASSERT_EQ(count, 6);
|
|
|
|
}
|
|
|
|
|
2018-01-18 01:37:10 +00:00
|
|
|
|
Copy Get() result when file reads use mmap
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
2018-06-01 23:46:32 +00:00
|
|
|
TEST_F(DBTest2, PinnableSliceAndMmapReads) {
|
|
|
|
Options options = CurrentOptions();
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
options.env = env_;
|
2020-11-03 03:47:00 +00:00
|
|
|
if (!IsMemoryMappedAccessSupported()) {
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
ROCKSDB_GTEST_SKIP("Test requires default environment");
|
|
|
|
return;
|
|
|
|
}
|
Copy Get() result when file reads use mmap
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
2018-06-01 23:46:32 +00:00
|
|
|
options.allow_mmap_reads = true;
|
2018-06-28 00:09:29 +00:00
|
|
|
options.max_open_files = 100;
|
|
|
|
options.compression = kNoCompression;
|
Copy Get() result when file reads use mmap
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
2018-06-01 23:46:32 +00:00
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
PinnableSlice pinned_value;
|
|
|
|
ASSERT_EQ(Get("foo", &pinned_value), Status::OK());
|
2018-06-28 00:09:29 +00:00
|
|
|
// It is not safe to pin mmap files as they might disappear by compaction
|
|
|
|
ASSERT_FALSE(pinned_value.IsPinned());
|
Copy Get() result when file reads use mmap
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
2018-06-01 23:46:32 +00:00
|
|
|
ASSERT_EQ(pinned_value.ToString(), "bar");
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_CompactRange(
|
|
|
|
0 /* level */, nullptr /* begin */, nullptr /* end */,
|
|
|
|
nullptr /* column_family */, true /* disallow_trivial_move */));
|
Copy Get() result when file reads use mmap
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
2018-06-01 23:46:32 +00:00
|
|
|
|
|
|
|
// Ensure pinned_value doesn't rely on memory munmap'd by the above
|
2018-06-28 00:09:29 +00:00
|
|
|
// compaction. It crashes if it does.
|
|
|
|
ASSERT_EQ(pinned_value.ToString(), "bar");
|
|
|
|
|
|
|
|
pinned_value.Reset();
|
|
|
|
// Unsafe to pin mmap files when they could be kicked out of table cache
|
|
|
|
Close();
|
2018-06-28 15:31:19 +00:00
|
|
|
ASSERT_OK(ReadOnlyReopen(options));
|
2018-06-28 00:09:29 +00:00
|
|
|
ASSERT_EQ(Get("foo", &pinned_value), Status::OK());
|
|
|
|
ASSERT_FALSE(pinned_value.IsPinned());
|
|
|
|
ASSERT_EQ(pinned_value.ToString(), "bar");
|
|
|
|
|
|
|
|
pinned_value.Reset();
|
|
|
|
// In read-only mode with infinite capacity on table cache it should pin the
|
|
|
|
// value and avoid the memcpy
|
|
|
|
Close();
|
|
|
|
options.max_open_files = -1;
|
2018-06-28 15:31:19 +00:00
|
|
|
ASSERT_OK(ReadOnlyReopen(options));
|
2018-06-28 00:09:29 +00:00
|
|
|
ASSERT_EQ(Get("foo", &pinned_value), Status::OK());
|
|
|
|
ASSERT_TRUE(pinned_value.IsPinned());
|
Copy Get() result when file reads use mmap
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
2018-06-01 23:46:32 +00:00
|
|
|
ASSERT_EQ(pinned_value.ToString(), "bar");
|
|
|
|
}
|
|
|
|
|
2018-09-15 07:05:08 +00:00
|
|
|
TEST_F(DBTest2, DISABLED_IteratorPinnedMemory) {
|
2018-08-14 00:31:58 +00:00
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.create_if_missing = true;
|
2020-02-20 20:07:53 +00:00
|
|
|
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
2018-08-14 00:31:58 +00:00
|
|
|
BlockBasedTableOptions bbto;
|
|
|
|
bbto.no_block_cache = false;
|
|
|
|
bbto.cache_index_and_filter_blocks = false;
|
|
|
|
bbto.block_cache = NewLRUCache(100000);
|
|
|
|
bbto.block_size = 400; // small block size
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
|
2018-08-14 00:31:58 +00:00
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
Random rnd(301);
|
2020-07-09 21:33:42 +00:00
|
|
|
std::string v = rnd.RandomString(400);
|
2018-08-14 00:31:58 +00:00
|
|
|
|
|
|
|
// Since v is the size of a block, each key should take a block
|
|
|
|
// of 400+ bytes.
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("1", v));
|
|
|
|
ASSERT_OK(Put("3", v));
|
|
|
|
ASSERT_OK(Put("5", v));
|
|
|
|
ASSERT_OK(Put("7", v));
|
2018-08-14 00:31:58 +00:00
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
ASSERT_EQ(0, bbto.block_cache->GetPinnedUsage());
|
|
|
|
|
|
|
|
// Verify that iterators don't pin more than one data block in block cache
|
|
|
|
// at each time.
|
|
|
|
{
|
2018-11-09 19:17:34 +00:00
|
|
|
std::unique_ptr<Iterator> iter(db_->NewIterator(ReadOptions()));
|
2018-08-14 00:31:58 +00:00
|
|
|
iter->SeekToFirst();
|
|
|
|
|
|
|
|
for (int i = 0; i < 4; i++) {
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
// Block cache should contain exactly one block.
|
|
|
|
ASSERT_GT(bbto.block_cache->GetPinnedUsage(), 0);
|
|
|
|
ASSERT_LT(bbto.block_cache->GetPinnedUsage(), 800);
|
|
|
|
iter->Next();
|
|
|
|
}
|
|
|
|
ASSERT_FALSE(iter->Valid());
|
|
|
|
|
|
|
|
iter->Seek("4");
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
|
|
|
|
ASSERT_GT(bbto.block_cache->GetPinnedUsage(), 0);
|
|
|
|
ASSERT_LT(bbto.block_cache->GetPinnedUsage(), 800);
|
|
|
|
|
|
|
|
iter->Seek("3");
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter->status());
|
|
|
|
|
2018-08-14 00:31:58 +00:00
|
|
|
ASSERT_GT(bbto.block_cache->GetPinnedUsage(), 0);
|
|
|
|
ASSERT_LT(bbto.block_cache->GetPinnedUsage(), 800);
|
|
|
|
}
|
|
|
|
ASSERT_EQ(0, bbto.block_cache->GetPinnedUsage());
|
|
|
|
|
|
|
|
// Test compaction case
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("2", v));
|
|
|
|
ASSERT_OK(Put("5", v));
|
|
|
|
ASSERT_OK(Put("6", v));
|
|
|
|
ASSERT_OK(Put("8", v));
|
2018-08-14 00:31:58 +00:00
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
// Clear existing data in block cache
|
|
|
|
bbto.block_cache->SetCapacity(0);
|
|
|
|
bbto.block_cache->SetCapacity(100000);
|
|
|
|
|
|
|
|
// Verify compaction input iterators don't hold more than one data blocks at
|
|
|
|
// one time.
|
|
|
|
std::atomic<bool> finished(false);
|
|
|
|
std::atomic<int> block_newed(0);
|
|
|
|
std::atomic<int> block_destroyed(0);
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2018-08-14 00:31:58 +00:00
|
|
|
"Block::Block:0", [&](void* /*arg*/) {
|
|
|
|
if (finished) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
// Two iterators. At most 2 outstanding blocks.
|
|
|
|
EXPECT_GE(block_newed.load(), block_destroyed.load());
|
|
|
|
EXPECT_LE(block_newed.load(), block_destroyed.load() + 1);
|
|
|
|
block_newed.fetch_add(1);
|
|
|
|
});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2018-08-14 00:31:58 +00:00
|
|
|
"Block::~Block", [&](void* /*arg*/) {
|
|
|
|
if (finished) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
// Two iterators. At most 2 outstanding blocks.
|
|
|
|
EXPECT_GE(block_newed.load(), block_destroyed.load() + 1);
|
|
|
|
EXPECT_LE(block_newed.load(), block_destroyed.load() + 2);
|
|
|
|
block_destroyed.fetch_add(1);
|
|
|
|
});
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2018-08-14 00:31:58 +00:00
|
|
|
"CompactionJob::Run:BeforeVerify",
|
|
|
|
[&](void* /*arg*/) { finished = true; });
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2018-08-14 00:31:58 +00:00
|
|
|
|
|
|
|
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));
|
|
|
|
|
|
|
|
// Two input files. Each of them has 4 data blocks.
|
|
|
|
ASSERT_EQ(8, block_newed.load());
|
|
|
|
ASSERT_EQ(8, block_destroyed.load());
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2018-08-14 00:31:58 +00:00
|
|
|
}
|
|
|
|
|
2018-10-08 21:22:06 +00:00
|
|
|
TEST_F(DBTest2, TestGetColumnFamilyHandleUnlocked) {
|
|
|
|
// Setup sync point dependency to reproduce the race condition of
|
|
|
|
// DBImpl::GetColumnFamilyHandleUnlocked
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency({
|
|
|
|
{"TestGetColumnFamilyHandleUnlocked::GetColumnFamilyHandleUnlocked1",
|
|
|
|
"TestGetColumnFamilyHandleUnlocked::PreGetColumnFamilyHandleUnlocked2"},
|
|
|
|
{"TestGetColumnFamilyHandleUnlocked::GetColumnFamilyHandleUnlocked2",
|
|
|
|
"TestGetColumnFamilyHandleUnlocked::ReadColumnFamilyHandle1"},
|
|
|
|
});
|
2018-10-08 21:22:06 +00:00
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
CreateColumnFamilies({"test1", "test2"}, Options());
|
|
|
|
ASSERT_EQ(handles_.size(), 2);
|
|
|
|
|
2020-07-03 02:24:25 +00:00
|
|
|
DBImpl* dbi = static_cast_with_check<DBImpl>(db_);
|
2018-10-08 21:22:06 +00:00
|
|
|
port::Thread user_thread1([&]() {
|
|
|
|
auto cfh = dbi->GetColumnFamilyHandleUnlocked(handles_[0]->GetID());
|
|
|
|
ASSERT_EQ(cfh->GetID(), handles_[0]->GetID());
|
2022-11-02 21:34:24 +00:00
|
|
|
TEST_SYNC_POINT(
|
|
|
|
"TestGetColumnFamilyHandleUnlocked::GetColumnFamilyHandleUnlocked1");
|
|
|
|
TEST_SYNC_POINT(
|
|
|
|
"TestGetColumnFamilyHandleUnlocked::ReadColumnFamilyHandle1");
|
2018-10-08 21:22:06 +00:00
|
|
|
ASSERT_EQ(cfh->GetID(), handles_[0]->GetID());
|
|
|
|
});
|
|
|
|
|
|
|
|
port::Thread user_thread2([&]() {
|
2022-11-02 21:34:24 +00:00
|
|
|
TEST_SYNC_POINT(
|
|
|
|
"TestGetColumnFamilyHandleUnlocked::PreGetColumnFamilyHandleUnlocked2");
|
2018-10-08 21:22:06 +00:00
|
|
|
auto cfh = dbi->GetColumnFamilyHandleUnlocked(handles_[1]->GetID());
|
|
|
|
ASSERT_EQ(cfh->GetID(), handles_[1]->GetID());
|
2022-11-02 21:34:24 +00:00
|
|
|
TEST_SYNC_POINT(
|
|
|
|
"TestGetColumnFamilyHandleUnlocked::GetColumnFamilyHandleUnlocked2");
|
2018-10-08 21:22:06 +00:00
|
|
|
ASSERT_EQ(cfh->GetID(), handles_[1]->GetID());
|
|
|
|
});
|
|
|
|
|
|
|
|
user_thread1.join();
|
|
|
|
user_thread2.join();
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
|
2018-10-08 21:22:06 +00:00
|
|
|
}
|
|
|
|
|
2018-11-12 22:30:21 +00:00
|
|
|
TEST_F(DBTest2, TestCompactFiles) {
|
|
|
|
// Setup sync point dependency to reproduce the race condition of
|
|
|
|
// DBImpl::GetColumnFamilyHandleUnlocked
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency({
|
2018-11-12 22:30:21 +00:00
|
|
|
{"TestCompactFiles::IngestExternalFile1",
|
|
|
|
"TestCompactFiles::IngestExternalFile2"},
|
|
|
|
});
|
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
Options options;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
options.env = env_;
|
2018-11-12 22:30:21 +00:00
|
|
|
options.num_levels = 2;
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
Reopen(options);
|
|
|
|
auto* handle = db_->DefaultColumnFamily();
|
|
|
|
ASSERT_EQ(db_->NumberLevels(handle), 2);
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SstFileWriter sst_file_writer{
|
|
|
|
ROCKSDB_NAMESPACE::EnvOptions(), options};
|
2018-11-12 22:30:21 +00:00
|
|
|
std::string external_file1 = dbname_ + "/test_compact_files1.sst_t";
|
|
|
|
std::string external_file2 = dbname_ + "/test_compact_files2.sst_t";
|
|
|
|
std::string external_file3 = dbname_ + "/test_compact_files3.sst_t";
|
|
|
|
|
|
|
|
ASSERT_OK(sst_file_writer.Open(external_file1));
|
|
|
|
ASSERT_OK(sst_file_writer.Put("1", "1"));
|
|
|
|
ASSERT_OK(sst_file_writer.Put("2", "2"));
|
|
|
|
ASSERT_OK(sst_file_writer.Finish());
|
|
|
|
|
|
|
|
ASSERT_OK(sst_file_writer.Open(external_file2));
|
|
|
|
ASSERT_OK(sst_file_writer.Put("3", "3"));
|
|
|
|
ASSERT_OK(sst_file_writer.Put("4", "4"));
|
|
|
|
ASSERT_OK(sst_file_writer.Finish());
|
|
|
|
|
|
|
|
ASSERT_OK(sst_file_writer.Open(external_file3));
|
|
|
|
ASSERT_OK(sst_file_writer.Put("5", "5"));
|
|
|
|
ASSERT_OK(sst_file_writer.Put("6", "6"));
|
|
|
|
ASSERT_OK(sst_file_writer.Finish());
|
|
|
|
|
|
|
|
ASSERT_OK(db_->IngestExternalFile(handle, {external_file1, external_file3},
|
|
|
|
IngestExternalFileOptions()));
|
|
|
|
ASSERT_EQ(NumTableFilesAtLevel(1, 0), 2);
|
|
|
|
std::vector<std::string> files;
|
|
|
|
GetSstFiles(env_, dbname_, &files);
|
|
|
|
ASSERT_EQ(files.size(), 2);
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
Status user_thread1_status;
|
|
|
|
port::Thread user_thread1([&]() {
|
|
|
|
user_thread1_status =
|
|
|
|
db_->CompactFiles(CompactionOptions(), handle, files, 1);
|
|
|
|
});
|
2018-11-12 22:30:21 +00:00
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
Status user_thread2_status;
|
2018-11-12 22:30:21 +00:00
|
|
|
port::Thread user_thread2([&]() {
|
2021-08-16 15:09:46 +00:00
|
|
|
user_thread2_status = db_->IngestExternalFile(handle, {external_file2},
|
|
|
|
IngestExternalFileOptions());
|
2018-11-12 22:30:21 +00:00
|
|
|
TEST_SYNC_POINT("TestCompactFiles::IngestExternalFile1");
|
|
|
|
});
|
|
|
|
|
|
|
|
user_thread1.join();
|
|
|
|
user_thread2.join();
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(user_thread1_status);
|
|
|
|
ASSERT_OK(user_thread2_status);
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
|
2018-11-12 22:30:21 +00:00
|
|
|
}
|
|
|
|
|
2019-01-19 03:10:17 +00:00
|
|
|
TEST_F(DBTest2, MultiDBParallelOpenTest) {
|
|
|
|
const int kNumDbs = 2;
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
std::vector<std::string> dbnames;
|
|
|
|
for (int i = 0; i < kNumDbs; ++i) {
|
2022-05-06 20:03:58 +00:00
|
|
|
dbnames.emplace_back(test::PerThreadDBPath(env_, "db" + std::to_string(i)));
|
2019-01-19 03:10:17 +00:00
|
|
|
ASSERT_OK(DestroyDB(dbnames.back(), options));
|
|
|
|
}
|
|
|
|
|
|
|
|
// Verify empty DBs can be created in parallel
|
|
|
|
std::vector<std::thread> open_threads;
|
|
|
|
std::vector<DB*> dbs{static_cast<unsigned int>(kNumDbs), nullptr};
|
|
|
|
options.create_if_missing = true;
|
|
|
|
for (int i = 0; i < kNumDbs; ++i) {
|
|
|
|
open_threads.emplace_back(
|
|
|
|
[&](int dbnum) {
|
|
|
|
ASSERT_OK(DB::Open(options, dbnames[dbnum], &dbs[dbnum]));
|
|
|
|
},
|
|
|
|
i);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Now add some data and close, so next we can verify non-empty DBs can be
|
|
|
|
// recovered in parallel
|
|
|
|
for (int i = 0; i < kNumDbs; ++i) {
|
|
|
|
open_threads[i].join();
|
|
|
|
ASSERT_OK(dbs[i]->Put(WriteOptions(), "xi", "gua"));
|
|
|
|
delete dbs[i];
|
|
|
|
}
|
|
|
|
|
|
|
|
// Verify non-empty DBs can be recovered in parallel
|
|
|
|
open_threads.clear();
|
|
|
|
for (int i = 0; i < kNumDbs; ++i) {
|
|
|
|
open_threads.emplace_back(
|
|
|
|
[&](int dbnum) {
|
|
|
|
ASSERT_OK(DB::Open(options, dbnames[dbnum], &dbs[dbnum]));
|
|
|
|
},
|
|
|
|
i);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Wait and cleanup
|
|
|
|
for (int i = 0; i < kNumDbs; ++i) {
|
|
|
|
open_threads[i].join();
|
|
|
|
delete dbs[i];
|
|
|
|
ASSERT_OK(DestroyDB(dbnames[i], options));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-04-12 17:55:14 +00:00
|
|
|
namespace {
|
|
|
|
class DummyOldStats : public Statistics {
|
|
|
|
public:
|
2021-09-10 16:46:47 +00:00
|
|
|
const char* Name() const override { return "DummyOldStats"; }
|
2019-04-12 17:55:14 +00:00
|
|
|
uint64_t getTickerCount(uint32_t /*ticker_type*/) const override { return 0; }
|
|
|
|
void recordTick(uint32_t /* ticker_type */, uint64_t /* count */) override {
|
|
|
|
num_rt++;
|
|
|
|
}
|
|
|
|
void setTickerCount(uint32_t /*ticker_type*/, uint64_t /*count*/) override {}
|
|
|
|
uint64_t getAndResetTickerCount(uint32_t /*ticker_type*/) override {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
void measureTime(uint32_t /*histogram_type*/, uint64_t /*count*/) override {
|
|
|
|
num_mt++;
|
|
|
|
}
|
2020-02-20 20:07:53 +00:00
|
|
|
void histogramData(
|
|
|
|
uint32_t /*histogram_type*/,
|
|
|
|
ROCKSDB_NAMESPACE::HistogramData* const /*data*/) const override {}
|
2019-04-12 17:55:14 +00:00
|
|
|
std::string getHistogramString(uint32_t /*type*/) const override {
|
|
|
|
return "";
|
|
|
|
}
|
|
|
|
bool HistEnabledForType(uint32_t /*type*/) const override { return false; }
|
|
|
|
std::string ToString() const override { return ""; }
|
2020-10-12 18:20:45 +00:00
|
|
|
std::atomic<int> num_rt{0};
|
|
|
|
std::atomic<int> num_mt{0};
|
2019-04-12 17:55:14 +00:00
|
|
|
};
|
2022-11-02 21:34:24 +00:00
|
|
|
} // anonymous namespace
|
2019-04-12 17:55:14 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, OldStatsInterface) {
|
|
|
|
DummyOldStats* dos = new DummyOldStats();
|
|
|
|
std::shared_ptr<Statistics> stats(dos);
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.create_if_missing = true;
|
|
|
|
options.statistics = stats;
|
|
|
|
Reopen(options);
|
|
|
|
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
2019-04-12 17:55:14 +00:00
|
|
|
ASSERT_EQ("bar", Get("foo"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_EQ("bar", Get("foo"));
|
|
|
|
|
|
|
|
ASSERT_GT(dos->num_rt, 0);
|
|
|
|
ASSERT_GT(dos->num_mt, 0);
|
|
|
|
}
|
2019-05-01 17:13:33 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, CloseWithUnreleasedSnapshot) {
|
|
|
|
const Snapshot* ss = db_->GetSnapshot();
|
|
|
|
|
|
|
|
for (auto h : handles_) {
|
|
|
|
db_->DestroyColumnFamilyHandle(h);
|
|
|
|
}
|
|
|
|
handles_.clear();
|
|
|
|
|
|
|
|
ASSERT_NOK(db_->Close());
|
|
|
|
db_->ReleaseSnapshot(ss);
|
|
|
|
ASSERT_OK(db_->Close());
|
|
|
|
delete db_;
|
|
|
|
db_ = nullptr;
|
|
|
|
}
|
2019-07-23 01:53:03 +00:00
|
|
|
|
2019-09-18 00:08:57 +00:00
|
|
|
TEST_F(DBTest2, PrefixBloomReseek) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.create_if_missing = true;
|
|
|
|
options.prefix_extractor.reset(NewCappedPrefixTransform(3));
|
|
|
|
BlockBasedTableOptions bbto;
|
|
|
|
bbto.filter_policy.reset(NewBloomFilterPolicy(10, false));
|
|
|
|
bbto.whole_key_filtering = false;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
// Construct two L1 files with keys:
|
|
|
|
// f1:[aaa1 ccc1] f2:[ddd0]
|
|
|
|
ASSERT_OK(Put("aaa1", ""));
|
|
|
|
ASSERT_OK(Put("ccc1", ""));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_OK(Put("ddd0", ""));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
CompactRangeOptions cro;
|
|
|
|
cro.bottommost_level_compaction = BottommostLevelCompaction::kSkip;
|
|
|
|
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
|
|
|
|
|
|
|
|
ASSERT_OK(Put("bbb1", ""));
|
|
|
|
|
|
|
|
Iterator* iter = db_->NewIterator(ReadOptions());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter->status());
|
2019-09-18 00:08:57 +00:00
|
|
|
|
|
|
|
// Seeking into f1, the iterator will check bloom filter which returns the
|
|
|
|
// file iterator ot be invalidate, and the cursor will put into f2, with
|
|
|
|
// the next key to be "ddd0".
|
|
|
|
iter->Seek("bbb1");
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("bbb1", iter->key().ToString());
|
|
|
|
|
|
|
|
// Reseek ccc1, the L1 iterator needs to go back to f1 and reseek.
|
|
|
|
iter->Seek("ccc1");
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("ccc1", iter->key().ToString());
|
|
|
|
|
|
|
|
delete iter;
|
|
|
|
}
|
|
|
|
|
2019-10-21 18:39:28 +00:00
|
|
|
TEST_F(DBTest2, PrefixBloomFilteredOut) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.create_if_missing = true;
|
|
|
|
options.prefix_extractor.reset(NewCappedPrefixTransform(3));
|
|
|
|
BlockBasedTableOptions bbto;
|
|
|
|
bbto.filter_policy.reset(NewBloomFilterPolicy(10, false));
|
|
|
|
bbto.whole_key_filtering = false;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
// Construct two L1 files with keys:
|
|
|
|
// f1:[aaa1 ccc1] f2:[ddd0]
|
|
|
|
ASSERT_OK(Put("aaa1", ""));
|
|
|
|
ASSERT_OK(Put("ccc1", ""));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_OK(Put("ddd0", ""));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
CompactRangeOptions cro;
|
|
|
|
cro.bottommost_level_compaction = BottommostLevelCompaction::kSkip;
|
|
|
|
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
|
|
|
|
|
|
|
|
Iterator* iter = db_->NewIterator(ReadOptions());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter->status());
|
2019-10-21 18:39:28 +00:00
|
|
|
|
|
|
|
// Bloom filter is filterd out by f1.
|
|
|
|
// This is just one of several valid position following the contract.
|
|
|
|
// Postioning to ccc1 or ddd0 is also valid. This is just to validate
|
|
|
|
// the behavior of the current implementation. If underlying implementation
|
|
|
|
// changes, the test might fail here.
|
|
|
|
iter->Seek("bbb1");
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter->status());
|
2019-10-21 18:39:28 +00:00
|
|
|
ASSERT_FALSE(iter->Valid());
|
|
|
|
|
|
|
|
delete iter;
|
|
|
|
}
|
|
|
|
|
2019-07-23 01:53:03 +00:00
|
|
|
TEST_F(DBTest2, RowCacheSnapshot) {
|
|
|
|
Options options = CurrentOptions();
|
2020-02-20 20:07:53 +00:00
|
|
|
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
|
2019-09-16 22:14:51 +00:00
|
|
|
options.row_cache = NewLRUCache(8 * 8192);
|
2019-07-23 01:53:03 +00:00
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put("foo", "bar1"));
|
|
|
|
|
|
|
|
const Snapshot* s1 = db_->GetSnapshot();
|
|
|
|
|
|
|
|
ASSERT_OK(Put("foo", "bar2"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
ASSERT_OK(Put("foo2", "bar"));
|
|
|
|
const Snapshot* s2 = db_->GetSnapshot();
|
|
|
|
ASSERT_OK(Put("foo3", "bar"));
|
|
|
|
const Snapshot* s3 = db_->GetSnapshot();
|
|
|
|
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 0);
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 0);
|
|
|
|
ASSERT_EQ(Get("foo"), "bar2");
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 0);
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 1);
|
|
|
|
ASSERT_EQ(Get("foo"), "bar2");
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 1);
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 1);
|
|
|
|
ASSERT_EQ(Get("foo", s1), "bar1");
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 1);
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 2);
|
|
|
|
ASSERT_EQ(Get("foo", s2), "bar2");
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 2);
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 2);
|
|
|
|
ASSERT_EQ(Get("foo", s1), "bar1");
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 3);
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 2);
|
|
|
|
ASSERT_EQ(Get("foo", s3), "bar2");
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_HIT), 4);
|
|
|
|
ASSERT_EQ(TestGetTickerCount(options, ROW_CACHE_MISS), 2);
|
|
|
|
|
|
|
|
db_->ReleaseSnapshot(s1);
|
|
|
|
db_->ReleaseSnapshot(s2);
|
|
|
|
db_->ReleaseSnapshot(s3);
|
|
|
|
}
|
2019-09-26 23:16:28 +00:00
|
|
|
|
|
|
|
// When DB is reopened with multiple column families, the manifest file
|
|
|
|
// is written after the first CF is flushed, and it is written again
|
|
|
|
// after each flush. If DB crashes between the flushes, the flushed CF
|
|
|
|
// flushed will pass the latest log file, and now we require it not
|
|
|
|
// to be corrupted, and triggering a corruption report.
|
|
|
|
// We need to fix the bug and enable the test.
|
2019-10-25 01:28:03 +00:00
|
|
|
TEST_F(DBTest2, CrashInRecoveryMultipleCF) {
|
|
|
|
const std::vector<std::string> sync_points = {
|
|
|
|
"DBImpl::RecoverLogFiles:BeforeFlushFinalMemtable",
|
|
|
|
"VersionSet::ProcessManifestWrites:BeforeWriteLastVersionEdit:0"};
|
|
|
|
for (const auto& test_sync_point : sync_points) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
// First destroy original db to ensure a clean start.
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
options.create_if_missing = true;
|
|
|
|
options.wal_recovery_mode = WALRecoveryMode::kPointInTimeRecovery;
|
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_OK(Put(1, "foo", "bar"));
|
|
|
|
ASSERT_OK(Flush(1));
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put(1, "foo", "bar"));
|
|
|
|
// The value is large enough to be divided to two blocks.
|
|
|
|
std::string large_value(400, ' ');
|
|
|
|
ASSERT_OK(Put("foo1", large_value));
|
|
|
|
ASSERT_OK(Put("foo2", large_value));
|
|
|
|
Close();
|
2019-09-26 23:16:28 +00:00
|
|
|
|
2019-10-25 01:28:03 +00:00
|
|
|
// Corrupt the log file in the middle, so that it is not corrupted
|
|
|
|
// in the tail.
|
|
|
|
std::vector<std::string> filenames;
|
|
|
|
ASSERT_OK(env_->GetChildren(dbname_, &filenames));
|
|
|
|
for (const auto& f : filenames) {
|
|
|
|
uint64_t number;
|
|
|
|
FileType type;
|
2020-10-23 00:04:39 +00:00
|
|
|
if (ParseFileName(f, &number, &type) && type == FileType::kWalFile) {
|
2019-10-25 01:28:03 +00:00
|
|
|
std::string fname = dbname_ + "/" + f;
|
|
|
|
std::string file_content;
|
|
|
|
ASSERT_OK(ReadFileToString(env_, fname, &file_content));
|
|
|
|
file_content[400] = 'h';
|
|
|
|
file_content[401] = 'a';
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
ASSERT_OK(WriteStringToFile(env_, file_content, fname, false));
|
2019-10-25 01:28:03 +00:00
|
|
|
break;
|
|
|
|
}
|
2019-09-26 23:16:28 +00:00
|
|
|
}
|
|
|
|
|
2019-10-25 01:28:03 +00:00
|
|
|
// Reopen and freeze the file system after the first manifest write.
|
|
|
|
FaultInjectionTestEnv fit_env(options.env);
|
|
|
|
options.env = &fit_env;
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
|
2019-10-25 01:28:03 +00:00
|
|
|
test_sync_point,
|
|
|
|
[&](void* /*arg*/) { fit_env.SetFilesystemActive(false); });
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
2019-10-25 01:28:03 +00:00
|
|
|
ASSERT_NOK(TryReopenWithColumnFamilies(
|
|
|
|
{kDefaultColumnFamilyName, "pikachu"}, options));
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
2019-10-25 01:28:03 +00:00
|
|
|
|
|
|
|
fit_env.SetFilesystemActive(true);
|
|
|
|
// If we continue using failure ingestion Env, it will conplain something
|
|
|
|
// when renaming current file, which is not expected. Need to investigate
|
|
|
|
// why.
|
|
|
|
options.env = env_;
|
|
|
|
ASSERT_OK(TryReopenWithColumnFamilies({kDefaultColumnFamilyName, "pikachu"},
|
|
|
|
options));
|
|
|
|
}
|
2019-09-26 23:16:28 +00:00
|
|
|
}
|
2019-11-13 18:10:09 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, SeekFileRangeDeleteTail) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.prefix_extractor.reset(NewCappedPrefixTransform(1));
|
|
|
|
options.num_levels = 3;
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put("a", "a"));
|
|
|
|
const Snapshot* s1 = db_->GetSnapshot();
|
|
|
|
ASSERT_OK(
|
|
|
|
db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), "a", "f"));
|
|
|
|
ASSERT_OK(Put("b", "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
ASSERT_OK(Put("x", "a"));
|
|
|
|
ASSERT_OK(Put("z", "a"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
CompactRangeOptions cro;
|
|
|
|
cro.change_level = true;
|
|
|
|
cro.target_level = 2;
|
|
|
|
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
|
|
|
|
|
|
|
|
{
|
|
|
|
ReadOptions ro;
|
|
|
|
ro.total_order_seek = true;
|
|
|
|
std::unique_ptr<Iterator> iter(db_->NewIterator(ro));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter->status());
|
2019-11-13 18:10:09 +00:00
|
|
|
iter->Seek("e");
|
|
|
|
ASSERT_TRUE(iter->Valid());
|
|
|
|
ASSERT_EQ("x", iter->key().ToString());
|
|
|
|
}
|
|
|
|
db_->ReleaseSnapshot(s1);
|
|
|
|
}
|
2019-12-17 21:20:42 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, BackgroundPurgeTest) {
|
|
|
|
Options options = CurrentOptions();
|
2020-02-20 20:07:53 +00:00
|
|
|
options.write_buffer_manager =
|
|
|
|
std::make_shared<ROCKSDB_NAMESPACE::WriteBufferManager>(1 << 20);
|
2019-12-17 21:20:42 +00:00
|
|
|
options.avoid_unnecessary_blocking_io = true;
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
size_t base_value = options.write_buffer_manager->memory_usage();
|
|
|
|
|
|
|
|
ASSERT_OK(Put("a", "a"));
|
|
|
|
Iterator* iter = db_->NewIterator(ReadOptions());
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iter->status());
|
2019-12-17 21:20:42 +00:00
|
|
|
ASSERT_OK(Flush());
|
|
|
|
size_t value = options.write_buffer_manager->memory_usage();
|
|
|
|
ASSERT_GT(value, base_value);
|
|
|
|
|
|
|
|
db_->GetEnv()->SetBackgroundThreads(1, Env::Priority::HIGH);
|
|
|
|
test::SleepingBackgroundTask sleeping_task_after;
|
|
|
|
db_->GetEnv()->Schedule(&test::SleepingBackgroundTask::DoSleepTask,
|
|
|
|
&sleeping_task_after, Env::Priority::HIGH);
|
|
|
|
delete iter;
|
|
|
|
|
|
|
|
Env::Default()->SleepForMicroseconds(100000);
|
|
|
|
value = options.write_buffer_manager->memory_usage();
|
|
|
|
ASSERT_GT(value, base_value);
|
|
|
|
|
|
|
|
sleeping_task_after.WakeUp();
|
|
|
|
sleeping_task_after.WaitUntilDone();
|
|
|
|
|
|
|
|
test::SleepingBackgroundTask sleeping_task_after2;
|
|
|
|
db_->GetEnv()->Schedule(&test::SleepingBackgroundTask::DoSleepTask,
|
|
|
|
&sleeping_task_after2, Env::Priority::HIGH);
|
|
|
|
sleeping_task_after2.WakeUp();
|
|
|
|
sleeping_task_after2.WaitUntilDone();
|
|
|
|
|
|
|
|
value = options.write_buffer_manager->memory_usage();
|
|
|
|
ASSERT_EQ(base_value, value);
|
|
|
|
}
|
2020-01-07 04:08:24 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, SwitchMemtableRaceWithNewManifest) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
options.max_manifest_file_size = 10;
|
|
|
|
options.create_if_missing = true;
|
|
|
|
CreateAndReopenWithCF({"pikachu"}, options);
|
2020-01-07 21:45:21 +00:00
|
|
|
ASSERT_EQ(2, handles_.size());
|
|
|
|
|
2020-01-07 04:08:24 +00:00
|
|
|
ASSERT_OK(Put("foo", "value"));
|
2020-01-07 21:45:21 +00:00
|
|
|
const int kL0Files = options.level0_file_num_compaction_trigger;
|
|
|
|
for (int i = 0; i < kL0Files; ++i) {
|
|
|
|
ASSERT_OK(Put(/*cf=*/1, "a", std::to_string(i)));
|
|
|
|
ASSERT_OK(Flush(/*cf=*/1));
|
|
|
|
}
|
|
|
|
|
2020-01-07 04:08:24 +00:00
|
|
|
port::Thread thread([&]() { ASSERT_OK(Flush()); });
|
2020-01-07 21:45:21 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2020-01-07 04:08:24 +00:00
|
|
|
thread.join();
|
|
|
|
}
|
2020-01-14 00:25:28 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, SameSmallestInSameLevel) {
|
|
|
|
// This test validates fractional casacading logic when several files at one
|
|
|
|
// one level only contains the same user key.
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.merge_operator = MergeOperators::CreateStringAppendOperator();
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put("key", "1"));
|
|
|
|
ASSERT_OK(Put("key", "2"));
|
|
|
|
ASSERT_OK(db_->Merge(WriteOptions(), "key", "3"));
|
|
|
|
ASSERT_OK(db_->Merge(WriteOptions(), "key", "4"));
|
2021-05-20 23:06:12 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-14 00:25:28 +00:00
|
|
|
CompactRangeOptions cro;
|
|
|
|
cro.change_level = true;
|
|
|
|
cro.target_level = 2;
|
|
|
|
ASSERT_OK(dbfull()->CompactRange(cro, db_->DefaultColumnFamily(), nullptr,
|
|
|
|
nullptr));
|
|
|
|
|
|
|
|
ASSERT_OK(db_->Merge(WriteOptions(), "key", "5"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-14 00:25:28 +00:00
|
|
|
ASSERT_OK(db_->Merge(WriteOptions(), "key", "6"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-14 00:25:28 +00:00
|
|
|
ASSERT_OK(db_->Merge(WriteOptions(), "key", "7"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-14 00:25:28 +00:00
|
|
|
ASSERT_OK(db_->Merge(WriteOptions(), "key", "8"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2020-01-14 00:25:28 +00:00
|
|
|
ASSERT_EQ("0,4,1", FilesPerLevel());
|
|
|
|
|
|
|
|
ASSERT_EQ("2,3,4,5,6,7,8", Get("key"));
|
|
|
|
}
|
2020-01-15 22:03:18 +00:00
|
|
|
|
2020-05-04 21:15:55 +00:00
|
|
|
TEST_F(DBTest2, FileConsistencyCheckInOpen) {
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
2020-05-04 21:15:55 +00:00
|
|
|
|
|
|
|
SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"VersionBuilder::CheckConsistencyBeforeReturn", [&](void* arg) {
|
|
|
|
Status* ret_s = static_cast<Status*>(arg);
|
|
|
|
*ret_s = Status::Corruption("fcc");
|
|
|
|
});
|
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.force_consistency_checks = true;
|
|
|
|
ASSERT_NOK(TryReopen(options));
|
|
|
|
|
|
|
|
SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
}
|
|
|
|
|
2020-01-15 22:03:18 +00:00
|
|
|
TEST_F(DBTest2, BlockBasedTablePrefixIndexSeekForPrev) {
|
|
|
|
// create a DB with block prefix index
|
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
table_options.block_size = 300;
|
|
|
|
table_options.index_type = BlockBasedTableOptions::kHashSearch;
|
|
|
|
table_options.index_shortening =
|
|
|
|
BlockBasedTableOptions::IndexShorteningMode::kNoShortening;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
|
|
|
options.prefix_extractor.reset(NewFixedPrefixTransform(1));
|
|
|
|
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
Random rnd(301);
|
2020-07-09 21:33:42 +00:00
|
|
|
std::string large_value = rnd.RandomString(500);
|
2020-01-15 22:03:18 +00:00
|
|
|
|
|
|
|
ASSERT_OK(Put("a1", large_value));
|
|
|
|
ASSERT_OK(Put("x1", large_value));
|
|
|
|
ASSERT_OK(Put("y1", large_value));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-15 22:03:18 +00:00
|
|
|
|
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ReadOptions()));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
2020-01-15 22:03:18 +00:00
|
|
|
iterator->SeekForPrev("x3");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("x1", iterator->key().ToString());
|
|
|
|
|
|
|
|
iterator->SeekForPrev("a3");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("a1", iterator->key().ToString());
|
|
|
|
|
|
|
|
iterator->SeekForPrev("y3");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("y1", iterator->key().ToString());
|
|
|
|
|
|
|
|
// Query more than one non-existing prefix to cover the case both
|
|
|
|
// of empty hash bucket and hash bucket conflict.
|
|
|
|
iterator->SeekForPrev("b1");
|
|
|
|
// Result should be not valid or "a1".
|
|
|
|
if (iterator->Valid()) {
|
|
|
|
ASSERT_EQ("a1", iterator->key().ToString());
|
|
|
|
}
|
|
|
|
|
|
|
|
iterator->SeekForPrev("c1");
|
|
|
|
// Result should be not valid or "a1".
|
|
|
|
if (iterator->Valid()) {
|
|
|
|
ASSERT_EQ("a1", iterator->key().ToString());
|
|
|
|
}
|
|
|
|
|
|
|
|
iterator->SeekForPrev("d1");
|
|
|
|
// Result should be not valid or "a1".
|
|
|
|
if (iterator->Valid()) {
|
|
|
|
ASSERT_EQ("a1", iterator->key().ToString());
|
|
|
|
}
|
2020-01-16 18:46:05 +00:00
|
|
|
|
|
|
|
iterator->SeekForPrev("y3");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("y1", iterator->key().ToString());
|
2020-01-15 22:03:18 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-08-26 01:59:19 +00:00
|
|
|
TEST_F(DBTest2, PartitionedIndexPrefetchFailure) {
|
|
|
|
Options options = last_options_;
|
Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 17:31:34 +00:00
|
|
|
options.env = env_;
|
2020-08-26 01:59:19 +00:00
|
|
|
options.max_open_files = 20;
|
|
|
|
BlockBasedTableOptions bbto;
|
|
|
|
bbto.index_type = BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
|
|
|
|
bbto.metadata_block_size = 128;
|
|
|
|
bbto.block_size = 128;
|
|
|
|
bbto.block_cache = NewLRUCache(16777216);
|
|
|
|
bbto.cache_index_and_filter_blocks = true;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
// Force no table cache so every read will preload the SST file.
|
|
|
|
dbfull()->TEST_table_cache()->SetCapacity(0);
|
|
|
|
bbto.block_cache->SetCapacity(0);
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
for (int i = 0; i < 4096; i++) {
|
|
|
|
ASSERT_OK(Put(Key(i), rnd.RandomString(32)));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
// Try different random failures in table open for 300 times.
|
|
|
|
for (int i = 0; i < 300; i++) {
|
|
|
|
env_->num_reads_fails_ = 0;
|
|
|
|
env_->rand_reads_fail_odd_ = 8;
|
|
|
|
|
|
|
|
std::string value;
|
|
|
|
Status s = dbfull()->Get(ReadOptions(), Key(1), &value);
|
|
|
|
if (env_->num_reads_fails_ > 0) {
|
|
|
|
ASSERT_NOK(s);
|
|
|
|
} else {
|
|
|
|
ASSERT_OK(s);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
env_->rand_reads_fail_odd_ = 0;
|
|
|
|
}
|
|
|
|
|
2020-01-31 19:00:24 +00:00
|
|
|
TEST_F(DBTest2, ChangePrefixExtractor) {
|
|
|
|
for (bool use_partitioned_filter : {true, false}) {
|
|
|
|
// create a DB with block prefix index
|
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
|
2020-01-31 23:42:34 +00:00
|
|
|
// Sometimes filter is checked based on upper bound. Assert counters
|
|
|
|
// for that case. Otherwise, only check data correctness.
|
|
|
|
bool expect_filter_check = !use_partitioned_filter;
|
2020-01-31 19:00:24 +00:00
|
|
|
table_options.partition_filters = use_partitioned_filter;
|
|
|
|
if (use_partitioned_filter) {
|
|
|
|
table_options.index_type =
|
|
|
|
BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
|
|
|
|
}
|
|
|
|
table_options.filter_policy.reset(NewBloomFilterPolicy(10, false));
|
|
|
|
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
|
|
|
options.statistics = CreateDBStatistics();
|
|
|
|
|
|
|
|
options.prefix_extractor.reset(NewFixedPrefixTransform(2));
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
Random rnd(301);
|
|
|
|
|
|
|
|
ASSERT_OK(Put("aa", ""));
|
|
|
|
ASSERT_OK(Put("xb", ""));
|
|
|
|
ASSERT_OK(Put("xx1", ""));
|
|
|
|
ASSERT_OK(Put("xz1", ""));
|
|
|
|
ASSERT_OK(Put("zz", ""));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-31 19:00:24 +00:00
|
|
|
|
|
|
|
// After reopening DB with prefix size 2 => 1, prefix extractor
|
|
|
|
// won't take effective unless it won't change results based
|
|
|
|
// on upper bound and seek key.
|
|
|
|
options.prefix_extractor.reset(NewFixedPrefixTransform(1));
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ReadOptions()));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
2020-01-31 19:00:24 +00:00
|
|
|
iterator->Seek("xa");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("xb", iterator->key().ToString());
|
2020-01-31 23:42:34 +00:00
|
|
|
if (expect_filter_check) {
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, PopTicker(options, NON_LAST_LEVEL_SEEK_FILTER_MATCH));
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
iterator->Seek("xz");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("xz1", iterator->key().ToString());
|
2020-01-31 23:42:34 +00:00
|
|
|
if (expect_filter_check) {
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, PopTicker(options, NON_LAST_LEVEL_SEEK_FILTER_MATCH));
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
std::string ub_str = "xg9";
|
|
|
|
Slice ub(ub_str);
|
|
|
|
ReadOptions ro;
|
|
|
|
ro.iterate_upper_bound = &ub;
|
|
|
|
|
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
2020-01-31 19:00:24 +00:00
|
|
|
|
|
|
|
// SeekForPrev() never uses prefix bloom if it is changed.
|
|
|
|
iterator->SeekForPrev("xg0");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("xb", iterator->key().ToString());
|
2020-01-31 23:42:34 +00:00
|
|
|
if (expect_filter_check) {
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, PopTicker(options, NON_LAST_LEVEL_SEEK_FILTER_MATCH));
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ub_str = "xx9";
|
|
|
|
ub = Slice(ub_str);
|
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
2020-01-31 19:00:24 +00:00
|
|
|
|
|
|
|
iterator->Seek("x");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("xb", iterator->key().ToString());
|
2020-01-31 23:42:34 +00:00
|
|
|
if (expect_filter_check) {
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, PopTicker(options, NON_LAST_LEVEL_SEEK_FILTER_MATCH));
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
iterator->Seek("xx0");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("xx1", iterator->key().ToString());
|
2020-01-31 23:42:34 +00:00
|
|
|
if (expect_filter_check) {
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(1, PopTicker(options, NON_LAST_LEVEL_SEEK_FILTER_MATCH));
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
CompactRangeOptions compact_range_opts;
|
|
|
|
compact_range_opts.bottommost_level_compaction =
|
|
|
|
BottommostLevelCompaction::kForce;
|
|
|
|
ASSERT_OK(db_->CompactRange(compact_range_opts, nullptr, nullptr));
|
|
|
|
ASSERT_OK(db_->CompactRange(compact_range_opts, nullptr, nullptr));
|
|
|
|
|
|
|
|
// Re-execute similar queries after a full compaction
|
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ReadOptions()));
|
|
|
|
|
|
|
|
iterator->Seek("x");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("xb", iterator->key().ToString());
|
2020-01-31 23:42:34 +00:00
|
|
|
if (expect_filter_check) {
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(1, PopTicker(options, NON_LAST_LEVEL_SEEK_FILTER_MATCH));
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
iterator->Seek("xg");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("xx1", iterator->key().ToString());
|
2020-01-31 23:42:34 +00:00
|
|
|
if (expect_filter_check) {
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(1, PopTicker(options, NON_LAST_LEVEL_SEEK_FILTER_MATCH));
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
iterator->Seek("xz");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("xz1", iterator->key().ToString());
|
2020-01-31 23:42:34 +00:00
|
|
|
if (expect_filter_check) {
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(1, PopTicker(options, NON_LAST_LEVEL_SEEK_FILTER_MATCH));
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
|
|
|
|
ASSERT_OK(iterator->status());
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
|
|
|
|
|
|
|
iterator->SeekForPrev("xx0");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("xb", iterator->key().ToString());
|
2020-01-31 23:42:34 +00:00
|
|
|
if (expect_filter_check) {
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(1, PopTicker(options, NON_LAST_LEVEL_SEEK_FILTER_MATCH));
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
iterator->Seek("xx0");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("xx1", iterator->key().ToString());
|
2020-01-31 23:42:34 +00:00
|
|
|
if (expect_filter_check) {
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(1, PopTicker(options, NON_LAST_LEVEL_SEEK_FILTER_MATCH));
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
|
|
|
|
ASSERT_OK(iterator->status());
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
ub_str = "xg9";
|
|
|
|
ub = Slice(ub_str);
|
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
|
|
|
iterator->SeekForPrev("xg0");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("xb", iterator->key().ToString());
|
2020-01-31 23:42:34 +00:00
|
|
|
if (expect_filter_check) {
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(1, PopTicker(options, NON_LAST_LEVEL_SEEK_FILTER_MATCH));
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
2020-01-31 19:00:24 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-01-17 09:39:22 +00:00
|
|
|
TEST_F(DBTest2, BlockBasedTablePrefixGetIndexNotFound) {
|
|
|
|
// create a DB with block prefix index
|
|
|
|
BlockBasedTableOptions table_options;
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
table_options.block_size = 300;
|
|
|
|
table_options.index_type = BlockBasedTableOptions::kHashSearch;
|
|
|
|
table_options.index_shortening =
|
|
|
|
BlockBasedTableOptions::IndexShorteningMode::kNoShortening;
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
|
|
|
options.prefix_extractor.reset(NewFixedPrefixTransform(1));
|
|
|
|
options.level0_file_num_compaction_trigger = 8;
|
|
|
|
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put("b1", "ok"));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-17 09:39:22 +00:00
|
|
|
|
|
|
|
// Flushing several files so that the chance that hash bucket
|
|
|
|
// is empty fo "b" in at least one of the files is high.
|
|
|
|
ASSERT_OK(Put("a1", ""));
|
|
|
|
ASSERT_OK(Put("c1", ""));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-17 09:39:22 +00:00
|
|
|
|
|
|
|
ASSERT_OK(Put("a2", ""));
|
|
|
|
ASSERT_OK(Put("c2", ""));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-17 09:39:22 +00:00
|
|
|
|
|
|
|
ASSERT_OK(Put("a3", ""));
|
|
|
|
ASSERT_OK(Put("c3", ""));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-17 09:39:22 +00:00
|
|
|
|
|
|
|
ASSERT_OK(Put("a4", ""));
|
|
|
|
ASSERT_OK(Put("c4", ""));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-17 09:39:22 +00:00
|
|
|
|
|
|
|
ASSERT_OK(Put("a5", ""));
|
|
|
|
ASSERT_OK(Put("c5", ""));
|
2021-08-16 15:09:46 +00:00
|
|
|
ASSERT_OK(Flush());
|
2020-01-17 09:39:22 +00:00
|
|
|
|
|
|
|
ASSERT_EQ("ok", Get("b1"));
|
|
|
|
}
|
|
|
|
|
2020-01-28 22:42:21 +00:00
|
|
|
TEST_F(DBTest2, AutoPrefixMode1) {
|
2022-05-19 20:09:03 +00:00
|
|
|
do {
|
|
|
|
// create a DB with block prefix index
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
BlockBasedTableOptions table_options =
|
|
|
|
*options.table_factory->GetOptions<BlockBasedTableOptions>();
|
|
|
|
table_options.filter_policy.reset(NewBloomFilterPolicy(10, false));
|
|
|
|
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
|
|
|
|
options.prefix_extractor.reset(NewFixedPrefixTransform(1));
|
|
|
|
options.statistics = CreateDBStatistics();
|
2020-01-28 22:42:21 +00:00
|
|
|
|
2022-05-19 20:09:03 +00:00
|
|
|
Reopen(options);
|
2020-01-28 22:42:21 +00:00
|
|
|
|
2022-05-19 20:09:03 +00:00
|
|
|
Random rnd(301);
|
|
|
|
std::string large_value = rnd.RandomString(500);
|
2020-01-28 22:42:21 +00:00
|
|
|
|
2022-05-19 20:09:03 +00:00
|
|
|
ASSERT_OK(Put("a1", large_value));
|
|
|
|
ASSERT_OK(Put("x1", large_value));
|
|
|
|
ASSERT_OK(Put("y1", large_value));
|
|
|
|
ASSERT_OK(Flush());
|
2020-01-28 22:42:21 +00:00
|
|
|
|
2022-05-19 20:09:03 +00:00
|
|
|
ReadOptions ro;
|
|
|
|
ro.total_order_seek = false;
|
|
|
|
ro.auto_prefix_mode = true;
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
const auto hit_stat = options.num_levels == 1
|
|
|
|
? LAST_LEVEL_SEEK_FILTER_MATCH
|
|
|
|
: NON_LAST_LEVEL_SEEK_FILTER_MATCH;
|
|
|
|
const auto miss_stat = options.num_levels == 1
|
|
|
|
? LAST_LEVEL_SEEK_FILTERED
|
|
|
|
: NON_LAST_LEVEL_SEEK_FILTERED;
|
2022-05-19 20:09:03 +00:00
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
|
|
|
iterator->Seek("b1");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("x1", iterator->key().ToString());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
2022-05-19 20:09:03 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
|
|
|
}
|
2020-01-28 22:42:21 +00:00
|
|
|
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
Slice ub;
|
2020-01-28 22:42:21 +00:00
|
|
|
ro.iterate_upper_bound = &ub;
|
|
|
|
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ub = "b9";
|
2022-05-19 20:09:03 +00:00
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
|
|
|
iterator->Seek("b1");
|
|
|
|
ASSERT_FALSE(iterator->Valid());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(1, TestGetAndResetTickerCount(options, miss_stat));
|
2022-05-19 20:09:03 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
|
|
|
}
|
2020-01-28 22:42:21 +00:00
|
|
|
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ub = "z";
|
2022-05-19 20:09:03 +00:00
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
|
|
|
iterator->Seek("b1");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("x1", iterator->key().ToString());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
2022-05-19 20:09:03 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
|
|
|
}
|
2020-01-28 22:42:21 +00:00
|
|
|
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ub = "c";
|
2022-05-19 20:09:03 +00:00
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
|
|
|
iterator->Seek("b1");
|
|
|
|
ASSERT_FALSE(iterator->Valid());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(1, TestGetAndResetTickerCount(options, miss_stat));
|
2022-05-19 20:09:03 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
|
|
|
}
|
2020-01-28 22:42:21 +00:00
|
|
|
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ub = "c1";
|
2022-05-19 20:09:03 +00:00
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
|
|
|
iterator->Seek("b1");
|
|
|
|
ASSERT_FALSE(iterator->Valid());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
2022-05-19 20:09:03 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
}
|
2020-01-28 22:42:21 +00:00
|
|
|
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
// The same queries without recreating iterator
|
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
|
|
|
|
|
|
|
ub = "b9";
|
|
|
|
iterator->Seek("b1");
|
|
|
|
ASSERT_FALSE(iterator->Valid());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(1, TestGetAndResetTickerCount(options, miss_stat));
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
2020-01-28 22:42:21 +00:00
|
|
|
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ub = "z";
|
2022-05-19 20:09:03 +00:00
|
|
|
iterator->Seek("b1");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("x1", iterator->key().ToString());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
2022-05-19 20:09:03 +00:00
|
|
|
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ub = "c";
|
2022-05-19 20:09:03 +00:00
|
|
|
iterator->Seek("b1");
|
|
|
|
ASSERT_FALSE(iterator->Valid());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(1, TestGetAndResetTickerCount(options, miss_stat));
|
2022-05-19 20:09:03 +00:00
|
|
|
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ub = "b9";
|
2022-05-19 20:09:03 +00:00
|
|
|
iterator->SeekForPrev("b1");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("a1", iterator->key().ToString());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
2022-05-19 20:09:03 +00:00
|
|
|
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ub = "zz";
|
2022-05-19 20:09:03 +00:00
|
|
|
iterator->SeekToLast();
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("y1", iterator->key().ToString());
|
|
|
|
|
|
|
|
iterator->SeekToFirst();
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("a1", iterator->key().ToString());
|
|
|
|
}
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
|
|
|
|
// Similar, now with reverse comparator
|
|
|
|
// Technically, we are violating axiom 2 of prefix_extractors, but
|
|
|
|
// it should be revised because of major use-cases using
|
|
|
|
// ReverseBytewiseComparator with capped/fixed prefix Seek. (FIXME)
|
|
|
|
options.comparator = ReverseBytewiseComparator();
|
|
|
|
options.prefix_extractor.reset(NewFixedPrefixTransform(1));
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
ASSERT_OK(Put("a1", large_value));
|
|
|
|
ASSERT_OK(Put("x1", large_value));
|
|
|
|
ASSERT_OK(Put("y1", large_value));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
{
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
|
|
|
|
|
|
|
ub = "b1";
|
|
|
|
iterator->Seek("b9");
|
|
|
|
ASSERT_FALSE(iterator->Valid());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(1, TestGetAndResetTickerCount(options, miss_stat));
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
|
|
|
|
|
|
|
ub = "b1";
|
|
|
|
iterator->Seek("z");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("y1", iterator->key().ToString());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
|
|
|
|
ub = "b1";
|
|
|
|
iterator->Seek("c");
|
|
|
|
ASSERT_FALSE(iterator->Valid());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
|
|
|
|
ub = "b";
|
|
|
|
iterator->Seek("c9");
|
|
|
|
ASSERT_FALSE(iterator->Valid());
|
|
|
|
// Fails if ReverseBytewiseComparator::IsSameLengthImmediateSuccessor
|
|
|
|
// is "correctly" implemented.
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
|
|
|
|
ub = "a";
|
|
|
|
iterator->Seek("b9");
|
|
|
|
// Fails if ReverseBytewiseComparator::IsSameLengthImmediateSuccessor
|
|
|
|
// is "correctly" implemented.
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("a1", iterator->key().ToString());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
|
|
|
|
ub = "b";
|
|
|
|
iterator->Seek("a");
|
|
|
|
ASSERT_FALSE(iterator->Valid());
|
|
|
|
// Fails if ReverseBytewiseComparator::IsSameLengthImmediateSuccessor
|
|
|
|
// matches BytewiseComparator::IsSameLengthImmediateSuccessor. Upper
|
|
|
|
// comparing before seek key prevents a real bug from surfacing.
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
|
|
|
|
ub = "b1";
|
|
|
|
iterator->SeekForPrev("b9");
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
// Fails if ReverseBytewiseComparator::IsSameLengthImmediateSuccessor
|
|
|
|
// is "correctly" implemented.
|
|
|
|
ASSERT_EQ("x1", iterator->key().ToString());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
|
|
|
|
ub = "a";
|
|
|
|
iterator->SeekToLast();
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("a1", iterator->key().ToString());
|
|
|
|
|
|
|
|
iterator->SeekToFirst();
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("y1", iterator->key().ToString());
|
|
|
|
}
|
|
|
|
|
|
|
|
// Now something a bit different, related to "short" keys that
|
|
|
|
// auto_prefix_mode can omit. See "BUG" section of auto_prefix_mode.
|
|
|
|
options.comparator = BytewiseComparator();
|
|
|
|
for (const auto config : {"fixed:2", "capped:2"}) {
|
|
|
|
ASSERT_OK(SliceTransform::CreateFromString(ConfigOptions(), config,
|
|
|
|
&options.prefix_extractor));
|
|
|
|
|
|
|
|
// FIXME: kHashSearch, etc. requires all keys be InDomain
|
|
|
|
if (StartsWith(config, "fixed") &&
|
|
|
|
(table_options.index_type == BlockBasedTableOptions::kHashSearch ||
|
|
|
|
StartsWith(options.memtable_factory->Name(), "Hash"))) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
const char* a_end_stuff = "a\xffXYZ";
|
|
|
|
const char* b_begin_stuff = "b\x00XYZ";
|
|
|
|
ASSERT_OK(Put("a", large_value));
|
|
|
|
ASSERT_OK(Put("b", large_value));
|
|
|
|
ASSERT_OK(Put(Slice(b_begin_stuff, 3), large_value));
|
|
|
|
ASSERT_OK(Put("c", large_value));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
// control showing valid optimization with auto_prefix mode
|
|
|
|
ub = Slice(a_end_stuff, 4);
|
|
|
|
ro.iterate_upper_bound = &ub;
|
|
|
|
|
|
|
|
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
|
|
|
|
iterator->Seek(Slice(a_end_stuff, 2));
|
|
|
|
ASSERT_FALSE(iterator->Valid());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(1, TestGetAndResetTickerCount(options, miss_stat));
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
|
|
|
|
|
|
|
// test, cannot be validly optimized with auto_prefix_mode
|
|
|
|
ub = Slice(b_begin_stuff, 2);
|
|
|
|
ro.iterate_upper_bound = &ub;
|
|
|
|
|
|
|
|
iterator->Seek(Slice(a_end_stuff, 2));
|
|
|
|
// !!! BUG !!! See "BUG" section of auto_prefix_mode.
|
|
|
|
ASSERT_FALSE(iterator->Valid());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(1, TestGetAndResetTickerCount(options, miss_stat));
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
|
|
|
|
|
|
|
// To prove that is the wrong result, now use total order seek
|
|
|
|
ReadOptions tos_ro = ro;
|
|
|
|
tos_ro.total_order_seek = true;
|
|
|
|
tos_ro.auto_prefix_mode = false;
|
|
|
|
iterator.reset(db_->NewIterator(tos_ro));
|
|
|
|
iterator->Seek(Slice(a_end_stuff, 2));
|
|
|
|
ASSERT_TRUE(iterator->Valid());
|
|
|
|
ASSERT_EQ("b", iterator->key().ToString());
|
Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 22:25:49 +00:00
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, hit_stat));
|
|
|
|
EXPECT_EQ(0, TestGetAndResetTickerCount(options, miss_stat));
|
Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 18:08:50 +00:00
|
|
|
ASSERT_OK(iterator->status());
|
|
|
|
}
|
2022-05-19 20:09:03 +00:00
|
|
|
} while (ChangeOptions(kSkipPlainTable));
|
2020-01-28 22:42:21 +00:00
|
|
|
}
|
Handle rename() failure in non-local FS (#8192)
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
- Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
- If process reopens the db immediately after the failure, then the CURRENT file can point
to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
succeed and ignore the other.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D27804648
Pulled By: riversand963
fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
2021-04-20 01:10:23 +00:00
|
|
|
|
|
|
|
class RenameCurrentTest : public DBTestBase,
|
|
|
|
public testing::WithParamInterface<std::string> {
|
|
|
|
public:
|
|
|
|
RenameCurrentTest()
|
|
|
|
: DBTestBase("rename_current_test", /*env_do_fsync=*/true),
|
|
|
|
sync_point_(GetParam()) {}
|
|
|
|
|
2024-01-05 19:53:57 +00:00
|
|
|
~RenameCurrentTest() override = default;
|
Handle rename() failure in non-local FS (#8192)
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
- Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
- If process reopens the db immediately after the failure, then the CURRENT file can point
to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
succeed and ignore the other.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D27804648
Pulled By: riversand963
fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
2021-04-20 01:10:23 +00:00
|
|
|
|
|
|
|
void SetUp() override {
|
|
|
|
env_->no_file_overwrite_.store(true, std::memory_order_release);
|
|
|
|
}
|
|
|
|
|
|
|
|
void TearDown() override {
|
|
|
|
env_->no_file_overwrite_.store(false, std::memory_order_release);
|
|
|
|
}
|
|
|
|
|
|
|
|
void SetupSyncPoints() {
|
|
|
|
SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
SyncPoint::GetInstance()->SetCallBack(sync_point_, [&](void* arg) {
|
|
|
|
Status* s = reinterpret_cast<Status*>(arg);
|
|
|
|
assert(s);
|
|
|
|
*s = Status::IOError("Injected IO error.");
|
|
|
|
});
|
|
|
|
}
|
|
|
|
|
|
|
|
const std::string sync_point_;
|
|
|
|
};
|
|
|
|
|
|
|
|
INSTANTIATE_TEST_CASE_P(DistributedFS, RenameCurrentTest,
|
|
|
|
::testing::Values("SetCurrentFile:BeforeRename",
|
|
|
|
"SetCurrentFile:AfterRename"));
|
|
|
|
|
|
|
|
TEST_P(RenameCurrentTest, Open) {
|
|
|
|
Destroy(last_options_);
|
|
|
|
Options options = GetDefaultOptions();
|
|
|
|
options.create_if_missing = true;
|
|
|
|
SetupSyncPoints();
|
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
Status s = TryReopen(options);
|
|
|
|
ASSERT_NOK(s);
|
|
|
|
|
|
|
|
SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
Reopen(options);
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_P(RenameCurrentTest, Flush) {
|
|
|
|
Destroy(last_options_);
|
|
|
|
Options options = GetDefaultOptions();
|
|
|
|
options.max_manifest_file_size = 1;
|
|
|
|
options.create_if_missing = true;
|
|
|
|
Reopen(options);
|
|
|
|
ASSERT_OK(Put("key", "value"));
|
|
|
|
SetupSyncPoints();
|
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
ASSERT_NOK(Flush());
|
|
|
|
|
|
|
|
ASSERT_NOK(Put("foo", "value"));
|
|
|
|
|
|
|
|
SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
Reopen(options);
|
|
|
|
ASSERT_EQ("value", Get("key"));
|
|
|
|
ASSERT_EQ("NOT_FOUND", Get("foo"));
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_P(RenameCurrentTest, Compaction) {
|
|
|
|
Destroy(last_options_);
|
|
|
|
Options options = GetDefaultOptions();
|
|
|
|
options.max_manifest_file_size = 1;
|
|
|
|
options.create_if_missing = true;
|
|
|
|
Reopen(options);
|
|
|
|
ASSERT_OK(Put("a", "a_value"));
|
|
|
|
ASSERT_OK(Put("c", "c_value"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
ASSERT_OK(Put("b", "b_value"));
|
|
|
|
ASSERT_OK(Put("d", "d_value"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
SetupSyncPoints();
|
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
ASSERT_NOK(db_->CompactRange(CompactRangeOptions(), /*begin=*/nullptr,
|
|
|
|
/*end=*/nullptr));
|
|
|
|
|
|
|
|
ASSERT_NOK(Put("foo", "value"));
|
|
|
|
|
|
|
|
SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
Reopen(options);
|
|
|
|
ASSERT_EQ("NOT_FOUND", Get("foo"));
|
|
|
|
ASSERT_EQ("d_value", Get("d"));
|
|
|
|
}
|
2021-05-17 22:14:34 +00:00
|
|
|
|
2022-08-08 21:36:34 +00:00
|
|
|
TEST_F(DBTest2, LastLevelTemperature) {
|
2022-02-18 18:26:45 +00:00
|
|
|
class TestListener : public EventListener {
|
|
|
|
public:
|
|
|
|
void OnFileReadFinish(const FileOperationInfo& info) override {
|
|
|
|
UpdateFileTemperature(info);
|
|
|
|
}
|
|
|
|
|
|
|
|
void OnFileWriteFinish(const FileOperationInfo& info) override {
|
|
|
|
UpdateFileTemperature(info);
|
|
|
|
}
|
|
|
|
|
|
|
|
void OnFileFlushFinish(const FileOperationInfo& info) override {
|
|
|
|
UpdateFileTemperature(info);
|
|
|
|
}
|
|
|
|
|
|
|
|
void OnFileSyncFinish(const FileOperationInfo& info) override {
|
|
|
|
UpdateFileTemperature(info);
|
|
|
|
}
|
|
|
|
|
|
|
|
void OnFileCloseFinish(const FileOperationInfo& info) override {
|
|
|
|
UpdateFileTemperature(info);
|
|
|
|
}
|
|
|
|
|
|
|
|
bool ShouldBeNotifiedOnFileIO() override { return true; }
|
|
|
|
|
|
|
|
std::unordered_map<uint64_t, Temperature> file_temperatures;
|
|
|
|
|
|
|
|
private:
|
|
|
|
void UpdateFileTemperature(const FileOperationInfo& info) {
|
|
|
|
auto filename = GetFileName(info.path);
|
|
|
|
uint64_t number;
|
|
|
|
FileType type;
|
|
|
|
ASSERT_TRUE(ParseFileName(filename, &number, &type));
|
|
|
|
if (type == kTableFile) {
|
|
|
|
MutexLock l(&mutex_);
|
|
|
|
auto ret = file_temperatures.insert({number, info.temperature});
|
|
|
|
if (!ret.second) {
|
|
|
|
// the same file temperature should always be the same for all events
|
|
|
|
ASSERT_TRUE(ret.first->second == info.temperature);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
std::string GetFileName(const std::string& fname) {
|
|
|
|
auto filename = fname.substr(fname.find_last_of(kFilePathSeparator) + 1);
|
|
|
|
// workaround only for Windows that the file path could contain both
|
|
|
|
// Windows FilePathSeparator and '/'
|
|
|
|
filename = filename.substr(filename.find_last_of('/') + 1);
|
|
|
|
return filename;
|
|
|
|
}
|
|
|
|
|
|
|
|
port::Mutex mutex_;
|
|
|
|
};
|
|
|
|
|
2022-08-08 21:36:34 +00:00
|
|
|
const int kNumLevels = 7;
|
|
|
|
const int kLastLevel = kNumLevels - 1;
|
|
|
|
|
2022-02-18 18:26:45 +00:00
|
|
|
auto* listener = new TestListener();
|
|
|
|
|
2021-05-17 22:14:34 +00:00
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.bottommost_temperature = Temperature::kWarm;
|
|
|
|
options.level0_file_num_compaction_trigger = 2;
|
2022-08-08 21:36:34 +00:00
|
|
|
options.level_compaction_dynamic_level_bytes = true;
|
|
|
|
options.num_levels = kNumLevels;
|
2021-11-16 23:15:48 +00:00
|
|
|
options.statistics = CreateDBStatistics();
|
2022-02-18 18:26:45 +00:00
|
|
|
options.listeners.emplace_back(listener);
|
2021-05-17 22:14:34 +00:00
|
|
|
Reopen(options);
|
|
|
|
|
2021-08-15 21:16:43 +00:00
|
|
|
auto size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kHot);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
|
2021-05-17 22:14:34 +00:00
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
|
|
|
|
2021-10-07 21:57:02 +00:00
|
|
|
get_iostats_context()->Reset();
|
|
|
|
IOStatsContext* iostats = get_iostats_context();
|
|
|
|
|
2021-08-09 20:43:18 +00:00
|
|
|
ColumnFamilyMetaData metadata;
|
|
|
|
db_->GetColumnFamilyMetaData(&metadata);
|
|
|
|
ASSERT_EQ(1, metadata.file_count);
|
2022-08-08 21:36:34 +00:00
|
|
|
SstFileMetaData meta = metadata.levels[kLastLevel].files[0];
|
2022-02-18 18:26:45 +00:00
|
|
|
ASSERT_EQ(Temperature::kWarm, meta.temperature);
|
|
|
|
uint64_t number;
|
|
|
|
FileType type;
|
|
|
|
ASSERT_TRUE(ParseFileName(meta.name, &number, &type));
|
|
|
|
ASSERT_EQ(listener->file_temperatures.at(number), meta.temperature);
|
|
|
|
|
2021-08-15 21:16:43 +00:00
|
|
|
size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_GT(size, 0);
|
2021-10-07 21:57:02 +00:00
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_read_count, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.warm_file_read_count, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_read_count, 0);
|
2021-11-16 23:15:48 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(HOT_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_GT(options.statistics->getTickerCount(WARM_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(COLD_FILE_READ_BYTES), 0);
|
2021-10-07 21:57:02 +00:00
|
|
|
|
|
|
|
ASSERT_EQ("bar", Get("foo"));
|
|
|
|
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_read_count, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.warm_file_read_count, 1);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_read_count, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_bytes_read, 0);
|
|
|
|
ASSERT_GT(iostats->file_io_stats_by_temperature.warm_file_bytes_read, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.cold_file_bytes_read, 0);
|
2021-11-16 23:15:48 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(HOT_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_GT(options.statistics->getTickerCount(WARM_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(COLD_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(HOT_FILE_READ_COUNT), 0);
|
|
|
|
ASSERT_GT(options.statistics->getTickerCount(WARM_FILE_READ_COUNT), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(COLD_FILE_READ_COUNT), 0);
|
2021-08-09 20:43:18 +00:00
|
|
|
|
|
|
|
// non-bottommost file still has unknown temperature
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
2021-10-07 21:57:02 +00:00
|
|
|
ASSERT_EQ("bar", Get("bar"));
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_read_count, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.warm_file_read_count, 1);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_read_count, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_bytes_read, 0);
|
|
|
|
ASSERT_GT(iostats->file_io_stats_by_temperature.warm_file_bytes_read, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.cold_file_bytes_read, 0);
|
2021-11-16 23:15:48 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(HOT_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_GT(options.statistics->getTickerCount(WARM_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(COLD_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(HOT_FILE_READ_COUNT), 0);
|
|
|
|
ASSERT_GT(options.statistics->getTickerCount(WARM_FILE_READ_COUNT), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(COLD_FILE_READ_COUNT), 0);
|
2021-10-07 21:57:02 +00:00
|
|
|
|
2021-08-09 20:43:18 +00:00
|
|
|
db_->GetColumnFamilyMetaData(&metadata);
|
|
|
|
ASSERT_EQ(2, metadata.file_count);
|
2022-02-18 18:26:45 +00:00
|
|
|
meta = metadata.levels[0].files[0];
|
|
|
|
ASSERT_EQ(Temperature::kUnknown, meta.temperature);
|
|
|
|
ASSERT_TRUE(ParseFileName(meta.name, &number, &type));
|
|
|
|
ASSERT_EQ(listener->file_temperatures.at(number), meta.temperature);
|
|
|
|
|
2022-08-08 21:36:34 +00:00
|
|
|
meta = metadata.levels[kLastLevel].files[0];
|
2022-02-18 18:26:45 +00:00
|
|
|
ASSERT_EQ(Temperature::kWarm, meta.temperature);
|
|
|
|
ASSERT_TRUE(ParseFileName(meta.name, &number, &type));
|
|
|
|
ASSERT_EQ(listener->file_temperatures.at(number), meta.temperature);
|
|
|
|
|
2021-08-15 21:16:43 +00:00
|
|
|
size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_GT(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_GT(size, 0);
|
2021-08-09 20:43:18 +00:00
|
|
|
|
|
|
|
// reopen and check the information is persisted
|
2021-05-17 22:14:34 +00:00
|
|
|
Reopen(options);
|
2021-08-09 20:43:18 +00:00
|
|
|
db_->GetColumnFamilyMetaData(&metadata);
|
|
|
|
ASSERT_EQ(2, metadata.file_count);
|
2022-02-18 18:26:45 +00:00
|
|
|
meta = metadata.levels[0].files[0];
|
|
|
|
ASSERT_EQ(Temperature::kUnknown, meta.temperature);
|
|
|
|
ASSERT_TRUE(ParseFileName(meta.name, &number, &type));
|
|
|
|
ASSERT_EQ(listener->file_temperatures.at(number), meta.temperature);
|
|
|
|
|
2022-08-08 21:36:34 +00:00
|
|
|
meta = metadata.levels[kLastLevel].files[0];
|
2022-02-18 18:26:45 +00:00
|
|
|
ASSERT_EQ(Temperature::kWarm, meta.temperature);
|
|
|
|
ASSERT_TRUE(ParseFileName(meta.name, &number, &type));
|
|
|
|
ASSERT_EQ(listener->file_temperatures.at(number), meta.temperature);
|
2021-08-15 21:16:43 +00:00
|
|
|
size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_GT(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_GT(size, 0);
|
|
|
|
|
|
|
|
// check other non-exist temperatures
|
|
|
|
size = GetSstSizeHelper(Temperature::kHot);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kCold);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
std::string prop;
|
|
|
|
ASSERT_TRUE(dbfull()->GetProperty(
|
|
|
|
DB::Properties::kLiveSstFilesSizeAtTemperature + std::to_string(22),
|
|
|
|
&prop));
|
|
|
|
ASSERT_EQ(std::atoi(prop.c_str()), 0);
|
2021-12-03 22:42:05 +00:00
|
|
|
|
|
|
|
Reopen(options);
|
|
|
|
db_->GetColumnFamilyMetaData(&metadata);
|
|
|
|
ASSERT_EQ(2, metadata.file_count);
|
2022-02-18 18:26:45 +00:00
|
|
|
meta = metadata.levels[0].files[0];
|
|
|
|
ASSERT_EQ(Temperature::kUnknown, meta.temperature);
|
|
|
|
ASSERT_TRUE(ParseFileName(meta.name, &number, &type));
|
|
|
|
ASSERT_EQ(listener->file_temperatures.at(number), meta.temperature);
|
|
|
|
|
2022-08-08 21:36:34 +00:00
|
|
|
meta = metadata.levels[kLastLevel].files[0];
|
2022-02-18 18:26:45 +00:00
|
|
|
ASSERT_EQ(Temperature::kWarm, meta.temperature);
|
|
|
|
ASSERT_TRUE(ParseFileName(meta.name, &number, &type));
|
|
|
|
ASSERT_EQ(listener->file_temperatures.at(number), meta.temperature);
|
2021-08-09 20:43:18 +00:00
|
|
|
}
|
|
|
|
|
2022-08-08 21:36:34 +00:00
|
|
|
TEST_F(DBTest2, LastLevelTemperatureUniversal) {
|
2021-08-09 20:43:18 +00:00
|
|
|
const int kTriggerNum = 3;
|
|
|
|
const int kNumLevels = 5;
|
|
|
|
const int kBottommostLevel = kNumLevels - 1;
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.compaction_style = kCompactionStyleUniversal;
|
|
|
|
options.level0_file_num_compaction_trigger = kTriggerNum;
|
|
|
|
options.num_levels = kNumLevels;
|
2021-11-16 23:15:48 +00:00
|
|
|
options.statistics = CreateDBStatistics();
|
2021-08-09 20:43:18 +00:00
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
2021-08-15 21:16:43 +00:00
|
|
|
auto size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kHot);
|
|
|
|
ASSERT_EQ(size, 0);
|
2021-10-07 21:57:02 +00:00
|
|
|
get_iostats_context()->Reset();
|
|
|
|
IOStatsContext* iostats = get_iostats_context();
|
2021-08-15 21:16:43 +00:00
|
|
|
|
2021-08-09 20:43:18 +00:00
|
|
|
for (int i = 0; i < kTriggerNum; i++) {
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
2021-05-17 22:14:34 +00:00
|
|
|
|
|
|
|
ColumnFamilyMetaData metadata;
|
|
|
|
db_->GetColumnFamilyMetaData(&metadata);
|
|
|
|
ASSERT_EQ(1, metadata.file_count);
|
2021-08-09 20:43:18 +00:00
|
|
|
ASSERT_EQ(Temperature::kUnknown,
|
|
|
|
metadata.levels[kBottommostLevel].files[0].temperature);
|
2021-08-15 21:16:43 +00:00
|
|
|
size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_GT(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_EQ(size, 0);
|
2021-10-07 21:57:02 +00:00
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_read_count, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.warm_file_read_count, 0);
|
2022-10-22 15:57:38 +00:00
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.cold_file_read_count, 0);
|
2021-11-16 23:15:48 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(HOT_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(WARM_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(COLD_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(HOT_FILE_READ_COUNT), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(WARM_FILE_READ_COUNT), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(COLD_FILE_READ_COUNT), 0);
|
2021-10-07 21:57:02 +00:00
|
|
|
ASSERT_EQ("bar", Get("foo"));
|
|
|
|
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_read_count, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.warm_file_read_count, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_read_count, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.hot_file_bytes_read, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.warm_file_bytes_read, 0);
|
|
|
|
ASSERT_EQ(iostats->file_io_stats_by_temperature.cold_file_bytes_read, 0);
|
2021-11-16 23:15:48 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(HOT_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(WARM_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(COLD_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(HOT_FILE_READ_COUNT), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(WARM_FILE_READ_COUNT), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(COLD_FILE_READ_COUNT), 0);
|
2021-08-09 20:43:18 +00:00
|
|
|
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
|
|
|
db_->GetColumnFamilyMetaData(&metadata);
|
|
|
|
ASSERT_EQ(2, metadata.file_count);
|
|
|
|
ASSERT_EQ(Temperature::kUnknown, metadata.levels[0].files[0].temperature);
|
2021-08-15 21:16:43 +00:00
|
|
|
size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_GT(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_EQ(size, 0);
|
2021-08-09 20:43:18 +00:00
|
|
|
|
|
|
|
// Update bottommost temperature
|
|
|
|
options.bottommost_temperature = Temperature::kWarm;
|
|
|
|
Reopen(options);
|
|
|
|
db_->GetColumnFamilyMetaData(&metadata);
|
|
|
|
// Should not impact existing ones
|
|
|
|
ASSERT_EQ(Temperature::kUnknown,
|
|
|
|
metadata.levels[kBottommostLevel].files[0].temperature);
|
2021-08-15 21:16:43 +00:00
|
|
|
size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_GT(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_EQ(size, 0);
|
2021-08-09 20:43:18 +00:00
|
|
|
|
|
|
|
// new generated file should have the new settings
|
|
|
|
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));
|
|
|
|
db_->GetColumnFamilyMetaData(&metadata);
|
|
|
|
ASSERT_EQ(1, metadata.file_count);
|
|
|
|
ASSERT_EQ(Temperature::kWarm,
|
|
|
|
metadata.levels[kBottommostLevel].files[0].temperature);
|
2021-08-15 21:16:43 +00:00
|
|
|
size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_GT(size, 0);
|
2021-11-16 23:15:48 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(HOT_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_GT(options.statistics->getTickerCount(WARM_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(COLD_FILE_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(HOT_FILE_READ_COUNT), 0);
|
|
|
|
ASSERT_GT(options.statistics->getTickerCount(WARM_FILE_READ_COUNT), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(COLD_FILE_READ_COUNT), 0);
|
2021-08-09 20:43:18 +00:00
|
|
|
|
|
|
|
// non-bottommost file still has unknown temperature
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
|
|
|
db_->GetColumnFamilyMetaData(&metadata);
|
|
|
|
ASSERT_EQ(2, metadata.file_count);
|
|
|
|
ASSERT_EQ(Temperature::kUnknown, metadata.levels[0].files[0].temperature);
|
2021-08-15 21:16:43 +00:00
|
|
|
size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_GT(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_GT(size, 0);
|
|
|
|
|
|
|
|
// check other non-exist temperatures
|
|
|
|
size = GetSstSizeHelper(Temperature::kHot);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kCold);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
std::string prop;
|
|
|
|
ASSERT_TRUE(dbfull()->GetProperty(
|
|
|
|
DB::Properties::kLiveSstFilesSizeAtTemperature + std::to_string(22),
|
|
|
|
&prop));
|
|
|
|
ASSERT_EQ(std::atoi(prop.c_str()), 0);
|
2022-01-25 22:58:48 +00:00
|
|
|
|
|
|
|
// Update bottommost temperature dynamically with SetOptions
|
2022-08-08 21:36:34 +00:00
|
|
|
auto s = db_->SetOptions({{"last_level_temperature", "kCold"}});
|
2022-01-25 22:58:48 +00:00
|
|
|
ASSERT_OK(s);
|
|
|
|
ASSERT_EQ(db_->GetOptions().bottommost_temperature, Temperature::kCold);
|
|
|
|
db_->GetColumnFamilyMetaData(&metadata);
|
|
|
|
// Should not impact the existing files
|
|
|
|
ASSERT_EQ(Temperature::kWarm,
|
|
|
|
metadata.levels[kBottommostLevel].files[0].temperature);
|
|
|
|
size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_GT(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_GT(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kCold);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
|
|
|
|
// new generated files should have the new settings
|
|
|
|
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));
|
|
|
|
db_->GetColumnFamilyMetaData(&metadata);
|
|
|
|
ASSERT_EQ(1, metadata.file_count);
|
|
|
|
ASSERT_EQ(Temperature::kCold,
|
|
|
|
metadata.levels[kBottommostLevel].files[0].temperature);
|
|
|
|
size = GetSstSizeHelper(Temperature::kUnknown);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_EQ(size, 0);
|
|
|
|
size = GetSstSizeHelper(Temperature::kCold);
|
|
|
|
ASSERT_GT(size, 0);
|
2022-06-02 20:10:49 +00:00
|
|
|
|
|
|
|
// kLastTemperature is an invalid temperature
|
|
|
|
options.bottommost_temperature = Temperature::kLastTemperature;
|
|
|
|
s = TryReopen(options);
|
|
|
|
ASSERT_TRUE(s.IsIOError());
|
2021-05-17 22:14:34 +00:00
|
|
|
}
|
2022-02-18 21:35:36 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, LastLevelStatistics) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.bottommost_temperature = Temperature::kWarm;
|
2023-08-21 19:14:03 +00:00
|
|
|
options.default_temperature = Temperature::kHot;
|
2022-02-18 21:35:36 +00:00
|
|
|
options.level0_file_num_compaction_trigger = 2;
|
|
|
|
options.level_compaction_dynamic_level_bytes = true;
|
|
|
|
options.statistics = CreateDBStatistics();
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
// generate 1 sst on level 0
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_EQ("bar", Get("bar"));
|
|
|
|
|
|
|
|
ASSERT_GT(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_BYTES), 0);
|
|
|
|
ASSERT_GT(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_COUNT), 0);
|
2023-08-21 19:14:03 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_BYTES),
|
|
|
|
options.statistics->getTickerCount(HOT_FILE_READ_BYTES));
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_COUNT),
|
|
|
|
options.statistics->getTickerCount(HOT_FILE_READ_COUNT));
|
2022-02-18 21:35:36 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(LAST_LEVEL_READ_BYTES), 0);
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(LAST_LEVEL_READ_COUNT), 0);
|
|
|
|
|
|
|
|
// 2nd flush to trigger compaction
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
|
|
|
ASSERT_EQ("bar", Get("bar"));
|
|
|
|
|
2023-08-21 19:14:03 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_BYTES),
|
|
|
|
options.statistics->getTickerCount(HOT_FILE_READ_BYTES));
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_COUNT),
|
|
|
|
options.statistics->getTickerCount(HOT_FILE_READ_COUNT));
|
2022-02-18 21:35:36 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(LAST_LEVEL_READ_BYTES),
|
|
|
|
options.statistics->getTickerCount(WARM_FILE_READ_BYTES));
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(LAST_LEVEL_READ_COUNT),
|
|
|
|
options.statistics->getTickerCount(WARM_FILE_READ_COUNT));
|
|
|
|
|
|
|
|
auto pre_bytes =
|
|
|
|
options.statistics->getTickerCount(NON_LAST_LEVEL_READ_BYTES);
|
|
|
|
auto pre_count =
|
|
|
|
options.statistics->getTickerCount(NON_LAST_LEVEL_READ_COUNT);
|
|
|
|
|
|
|
|
// 3rd flush to generate 1 sst on level 0
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_EQ("bar", Get("bar"));
|
|
|
|
|
|
|
|
ASSERT_GT(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_BYTES),
|
|
|
|
pre_bytes);
|
|
|
|
ASSERT_GT(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_COUNT),
|
|
|
|
pre_count);
|
2023-08-21 19:14:03 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_BYTES),
|
|
|
|
options.statistics->getTickerCount(HOT_FILE_READ_BYTES));
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_COUNT),
|
|
|
|
options.statistics->getTickerCount(HOT_FILE_READ_COUNT));
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(LAST_LEVEL_READ_BYTES),
|
|
|
|
options.statistics->getTickerCount(WARM_FILE_READ_BYTES));
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(LAST_LEVEL_READ_COUNT),
|
|
|
|
options.statistics->getTickerCount(WARM_FILE_READ_COUNT));
|
|
|
|
|
|
|
|
// Not a realistic setting to make last level kWarm and default temp kCold.
|
|
|
|
// This is just for testing default temp can be reset on reopen while the
|
|
|
|
// last level temp is consistent across DB reopen because those file's temp
|
|
|
|
// are persisted in manifest.
|
|
|
|
options.default_temperature = Temperature::kCold;
|
|
|
|
ASSERT_OK(options.statistics->Reset());
|
|
|
|
Reopen(options);
|
|
|
|
ASSERT_EQ("bar", Get("bar"));
|
|
|
|
|
|
|
|
ASSERT_EQ(0, options.statistics->getTickerCount(HOT_FILE_READ_BYTES));
|
|
|
|
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_BYTES),
|
|
|
|
options.statistics->getTickerCount(COLD_FILE_READ_BYTES));
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(NON_LAST_LEVEL_READ_COUNT),
|
|
|
|
options.statistics->getTickerCount(COLD_FILE_READ_COUNT));
|
2022-02-18 21:35:36 +00:00
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(LAST_LEVEL_READ_BYTES),
|
|
|
|
options.statistics->getTickerCount(WARM_FILE_READ_BYTES));
|
|
|
|
ASSERT_EQ(options.statistics->getTickerCount(LAST_LEVEL_READ_COUNT),
|
|
|
|
options.statistics->getTickerCount(WARM_FILE_READ_COUNT));
|
|
|
|
}
|
2022-02-19 02:18:49 +00:00
|
|
|
|
2022-03-04 20:32:30 +00:00
|
|
|
TEST_F(DBTest2, CheckpointFileTemperature) {
|
|
|
|
class NoLinkTestFS : public FileTemperatureTestFS {
|
|
|
|
using FileTemperatureTestFS::FileTemperatureTestFS;
|
2022-02-22 02:50:50 +00:00
|
|
|
|
2022-03-04 20:32:30 +00:00
|
|
|
IOStatus LinkFile(const std::string&, const std::string&, const IOOptions&,
|
|
|
|
IODebugContext*) override {
|
|
|
|
// return not supported to force checkpoint copy the file instead of just
|
|
|
|
// link
|
|
|
|
return IOStatus::NotSupported();
|
2022-02-19 02:18:49 +00:00
|
|
|
}
|
2022-03-04 20:32:30 +00:00
|
|
|
};
|
|
|
|
auto test_fs = std::make_shared<NoLinkTestFS>(env_->GetFileSystem());
|
2022-02-19 02:18:49 +00:00
|
|
|
std::unique_ptr<Env> env(new CompositeEnvWrapper(env_, test_fs));
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.bottommost_temperature = Temperature::kWarm;
|
2022-08-08 21:36:34 +00:00
|
|
|
// set dynamic_level to true so the compaction would compact the data to the
|
|
|
|
// last level directly which will have the last_level_temperature
|
|
|
|
options.level_compaction_dynamic_level_bytes = true;
|
2022-02-19 02:18:49 +00:00
|
|
|
options.level0_file_num_compaction_trigger = 2;
|
|
|
|
options.env = env.get();
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
// generate a bottommost file and a non-bottommost file
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
ASSERT_OK(Put("bar", "bar"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
auto size = GetSstSizeHelper(Temperature::kWarm);
|
|
|
|
ASSERT_GT(size, 0);
|
|
|
|
|
|
|
|
std::map<uint64_t, Temperature> temperatures;
|
|
|
|
std::vector<LiveFileStorageInfo> infos;
|
|
|
|
ASSERT_OK(
|
|
|
|
dbfull()->GetLiveFilesStorageInfo(LiveFilesStorageInfoOptions(), &infos));
|
2024-01-05 19:53:57 +00:00
|
|
|
for (const auto& info : infos) {
|
2022-02-19 02:18:49 +00:00
|
|
|
temperatures.emplace(info.file_number, info.temperature);
|
|
|
|
}
|
|
|
|
|
New backup meta schema, with file temperatures (#9660)
Summary:
The primary goal of this change is to add support for backing up and
restoring (applying on restore) file temperature metadata, without
committing to either the DB manifest or the FS reported "current"
temperatures being exclusive "source of truth".
To achieve this goal, we need to add temperature information to backup
metadata, which requires updated backup meta schema. Fortunately I
prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
6.19.0 for this kind of schema update. (Previously, backup meta schema
was not extensible! Making this schema update public will allow some
other "nice to have" features like taking backups with hard links, and
avoiding crc32c checksum computation when another checksum is already
available.) While schema version 2 is newly public, the default schema
version is still 1. Until we change the default, users will need to set
to 2 to enable features like temperature data backup+restore. New
metadata like temperature information will be ignored with a warning
in versions before this change and since 6.19.0. The metadata is
considered ignorable because a functioning DB can be restored without
it.
Some detail:
* Some renaming because "future schema" is now just public schema 2.
* Initialize some atomics in TestFs (linter reported)
* Add temperature hint support to SstFileDumper (used by BackupEngine)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660
Test Plan:
related unit test majorly updated for the new functionality,
including some shared testing support for tracking temperatures in a FS.
Some other tests and testing hooks into production code also updated for
making the backup meta schema change public.
Reviewed By: ajkr
Differential Revision: D34686968
Pulled By: pdillinger
fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
2022-03-18 18:06:17 +00:00
|
|
|
test_fs->PopRequestedSstFileTemperatures();
|
2022-02-19 02:18:49 +00:00
|
|
|
Checkpoint* checkpoint;
|
|
|
|
ASSERT_OK(Checkpoint::Create(db_, &checkpoint));
|
|
|
|
ASSERT_OK(
|
|
|
|
checkpoint->CreateCheckpoint(dbname_ + kFilePathSeparator + "tempcp"));
|
|
|
|
|
|
|
|
// checking src file src_temperature hints: 2 sst files: 1 sst is kWarm,
|
|
|
|
// another is kUnknown
|
New backup meta schema, with file temperatures (#9660)
Summary:
The primary goal of this change is to add support for backing up and
restoring (applying on restore) file temperature metadata, without
committing to either the DB manifest or the FS reported "current"
temperatures being exclusive "source of truth".
To achieve this goal, we need to add temperature information to backup
metadata, which requires updated backup meta schema. Fortunately I
prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
6.19.0 for this kind of schema update. (Previously, backup meta schema
was not extensible! Making this schema update public will allow some
other "nice to have" features like taking backups with hard links, and
avoiding crc32c checksum computation when another checksum is already
available.) While schema version 2 is newly public, the default schema
version is still 1. Until we change the default, users will need to set
to 2 to enable features like temperature data backup+restore. New
metadata like temperature information will be ignored with a warning
in versions before this change and since 6.19.0. The metadata is
considered ignorable because a functioning DB can be restored without
it.
Some detail:
* Some renaming because "future schema" is now just public schema 2.
* Initialize some atomics in TestFs (linter reported)
* Add temperature hint support to SstFileDumper (used by BackupEngine)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660
Test Plan:
related unit test majorly updated for the new functionality,
including some shared testing support for tracking temperatures in a FS.
Some other tests and testing hooks into production code also updated for
making the backup meta schema change public.
Reviewed By: ajkr
Differential Revision: D34686968
Pulled By: pdillinger
fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
2022-03-18 18:06:17 +00:00
|
|
|
std::vector<std::pair<uint64_t, Temperature>> requested_temps;
|
|
|
|
test_fs->PopRequestedSstFileTemperatures(&requested_temps);
|
|
|
|
// Two requests
|
|
|
|
ASSERT_EQ(requested_temps.size(), 2);
|
|
|
|
std::set<uint64_t> distinct_requests;
|
|
|
|
for (const auto& requested_temp : requested_temps) {
|
|
|
|
// Matching manifest temperatures
|
|
|
|
ASSERT_EQ(temperatures.at(requested_temp.first), requested_temp.second);
|
|
|
|
distinct_requests.insert(requested_temp.first);
|
|
|
|
}
|
|
|
|
// Each request to distinct file
|
|
|
|
ASSERT_EQ(distinct_requests.size(), requested_temps.size());
|
|
|
|
|
2022-02-19 02:18:49 +00:00
|
|
|
delete checkpoint;
|
|
|
|
Close();
|
|
|
|
}
|
2022-03-18 23:35:51 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, FileTemperatureManifestFixup) {
|
|
|
|
auto test_fs = std::make_shared<FileTemperatureTestFS>(env_->GetFileSystem());
|
|
|
|
std::unique_ptr<Env> env(new CompositeEnvWrapper(env_, test_fs));
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.bottommost_temperature = Temperature::kWarm;
|
2022-08-08 21:36:34 +00:00
|
|
|
// set dynamic_level to true so the compaction would compact the data to the
|
|
|
|
// last level directly which will have the last_level_temperature
|
|
|
|
options.level_compaction_dynamic_level_bytes = true;
|
2022-03-18 23:35:51 +00:00
|
|
|
options.level0_file_num_compaction_trigger = 2;
|
|
|
|
options.env = env.get();
|
|
|
|
std::vector<std::string> cfs = {/*"default",*/ "test1", "test2"};
|
|
|
|
CreateAndReopenWithCF(cfs, options);
|
|
|
|
// Needed for later re-opens (weird)
|
|
|
|
cfs.insert(cfs.begin(), kDefaultColumnFamilyName);
|
|
|
|
|
|
|
|
// Generate a bottommost file in all CFs
|
|
|
|
for (int cf = 0; cf < 3; ++cf) {
|
|
|
|
ASSERT_OK(Put(cf, "a", "val"));
|
|
|
|
ASSERT_OK(Put(cf, "c", "val"));
|
|
|
|
ASSERT_OK(Flush(cf));
|
|
|
|
ASSERT_OK(Put(cf, "b", "val"));
|
|
|
|
ASSERT_OK(Put(cf, "d", "val"));
|
|
|
|
ASSERT_OK(Flush(cf));
|
|
|
|
}
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
|
|
|
|
|
|
|
// verify
|
|
|
|
ASSERT_GT(GetSstSizeHelper(Temperature::kWarm), 0);
|
|
|
|
ASSERT_EQ(GetSstSizeHelper(Temperature::kUnknown), 0);
|
|
|
|
ASSERT_EQ(GetSstSizeHelper(Temperature::kCold), 0);
|
|
|
|
ASSERT_EQ(GetSstSizeHelper(Temperature::kHot), 0);
|
|
|
|
|
|
|
|
// Generate a non-bottommost file in all CFs
|
|
|
|
for (int cf = 0; cf < 3; ++cf) {
|
|
|
|
ASSERT_OK(Put(cf, "e", "val"));
|
|
|
|
ASSERT_OK(Flush(cf));
|
|
|
|
}
|
|
|
|
|
|
|
|
// re-verify
|
|
|
|
ASSERT_GT(GetSstSizeHelper(Temperature::kWarm), 0);
|
|
|
|
// Not supported: ASSERT_GT(GetSstSizeHelper(Temperature::kUnknown), 0);
|
|
|
|
ASSERT_EQ(GetSstSizeHelper(Temperature::kCold), 0);
|
|
|
|
ASSERT_EQ(GetSstSizeHelper(Temperature::kHot), 0);
|
|
|
|
|
|
|
|
// Now change FS temperature on bottommost file(s) to kCold
|
|
|
|
std::map<uint64_t, Temperature> current_temps;
|
|
|
|
test_fs->CopyCurrentSstFileTemperatures(¤t_temps);
|
|
|
|
for (auto e : current_temps) {
|
|
|
|
if (e.second == Temperature::kWarm) {
|
|
|
|
test_fs->OverrideSstFileTemperature(e.first, Temperature::kCold);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// Metadata not yet updated
|
|
|
|
ASSERT_EQ(Get("a"), "val");
|
|
|
|
ASSERT_EQ(GetSstSizeHelper(Temperature::kCold), 0);
|
|
|
|
|
|
|
|
// Update with Close and UpdateManifestForFilesState, but first save cf
|
|
|
|
// descriptors
|
|
|
|
std::vector<ColumnFamilyDescriptor> column_families;
|
|
|
|
for (size_t i = 0; i < handles_.size(); ++i) {
|
|
|
|
ColumnFamilyDescriptor cfdescriptor;
|
|
|
|
handles_[i]->GetDescriptor(&cfdescriptor).PermitUncheckedError();
|
|
|
|
column_families.push_back(cfdescriptor);
|
|
|
|
}
|
|
|
|
Close();
|
|
|
|
experimental::UpdateManifestForFilesStateOptions update_opts;
|
|
|
|
update_opts.update_temperatures = true;
|
|
|
|
|
|
|
|
ASSERT_OK(experimental::UpdateManifestForFilesState(
|
|
|
|
options, dbname_, column_families, update_opts));
|
|
|
|
|
|
|
|
// Re-open and re-verify after update
|
|
|
|
ReopenWithColumnFamilies(cfs, options);
|
|
|
|
ASSERT_GT(GetSstSizeHelper(Temperature::kCold), 0);
|
|
|
|
// Not supported: ASSERT_GT(GetSstSizeHelper(Temperature::kUnknown), 0);
|
|
|
|
ASSERT_EQ(GetSstSizeHelper(Temperature::kWarm), 0);
|
|
|
|
ASSERT_EQ(GetSstSizeHelper(Temperature::kHot), 0);
|
|
|
|
|
|
|
|
// Change kUnknown to kHot
|
|
|
|
test_fs->CopyCurrentSstFileTemperatures(¤t_temps);
|
|
|
|
for (auto e : current_temps) {
|
|
|
|
if (e.second == Temperature::kUnknown) {
|
|
|
|
test_fs->OverrideSstFileTemperature(e.first, Temperature::kHot);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Update with Close and UpdateManifestForFilesState
|
|
|
|
Close();
|
|
|
|
ASSERT_OK(experimental::UpdateManifestForFilesState(
|
|
|
|
options, dbname_, column_families, update_opts));
|
|
|
|
|
|
|
|
// Re-open and re-verify after update
|
|
|
|
ReopenWithColumnFamilies(cfs, options);
|
|
|
|
ASSERT_GT(GetSstSizeHelper(Temperature::kCold), 0);
|
|
|
|
ASSERT_EQ(GetSstSizeHelper(Temperature::kUnknown), 0);
|
|
|
|
ASSERT_EQ(GetSstSizeHelper(Temperature::kWarm), 0);
|
|
|
|
ASSERT_GT(GetSstSizeHelper(Temperature::kHot), 0);
|
|
|
|
|
|
|
|
Close();
|
|
|
|
}
|
2020-06-12 01:39:21 +00:00
|
|
|
|
|
|
|
// WAL recovery mode is WALRecoveryMode::kPointInTimeRecovery.
|
|
|
|
TEST_F(DBTest2, PointInTimeRecoveryWithIOErrorWhileReadingWal) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
ASSERT_OK(Put("foo", "value0"));
|
|
|
|
Close();
|
|
|
|
SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
SyncPoint::GetInstance()->ClearAllCallBacks();
|
|
|
|
bool should_inject_error = false;
|
|
|
|
SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"DBImpl::RecoverLogFiles:BeforeReadWal",
|
|
|
|
[&](void* /*arg*/) { should_inject_error = true; });
|
|
|
|
SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"LogReader::ReadMore:AfterReadFile", [&](void* arg) {
|
|
|
|
if (should_inject_error) {
|
|
|
|
ASSERT_NE(nullptr, arg);
|
|
|
|
*reinterpret_cast<Status*>(arg) = Status::IOError("Injected IOError");
|
|
|
|
}
|
|
|
|
});
|
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
options.avoid_flush_during_recovery = true;
|
|
|
|
options.wal_recovery_mode = WALRecoveryMode::kPointInTimeRecovery;
|
|
|
|
Status s = TryReopen(options);
|
|
|
|
ASSERT_TRUE(s.IsIOError());
|
|
|
|
}
|
Fix the false positive alert of CF consistency check in WAL recovery (#8207)
Summary:
In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.
Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.
Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207
Test Plan: added unit test, make crash_test
Reviewed By: riversand963
Differential Revision: D27898839
Pulled By: zhichao-cao
fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
2021-04-22 17:27:56 +00:00
|
|
|
|
|
|
|
TEST_F(DBTest2, PointInTimeRecoveryWithSyncFailureInCFCreation) {
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
|
|
|
|
{{"DBImpl::BackgroundCallFlush:Start:1",
|
|
|
|
"PointInTimeRecoveryWithSyncFailureInCFCreation:1"},
|
|
|
|
{"PointInTimeRecoveryWithSyncFailureInCFCreation:2",
|
|
|
|
"DBImpl::BackgroundCallFlush:Start:2"}});
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
CreateColumnFamilies({"test1"}, Options());
|
|
|
|
ASSERT_OK(Put("foo", "bar"));
|
|
|
|
|
|
|
|
// Creating a CF when a flush is going on, log is synced but the
|
|
|
|
// closed log file is not synced and corrupted.
|
|
|
|
port::Thread flush_thread([&]() { ASSERT_NOK(Flush()); });
|
|
|
|
TEST_SYNC_POINT("PointInTimeRecoveryWithSyncFailureInCFCreation:1");
|
|
|
|
CreateColumnFamilies({"test2"}, Options());
|
|
|
|
env_->corrupt_in_sync_ = true;
|
|
|
|
TEST_SYNC_POINT("PointInTimeRecoveryWithSyncFailureInCFCreation:2");
|
|
|
|
flush_thread.join();
|
|
|
|
env_->corrupt_in_sync_ = false;
|
|
|
|
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
|
|
|
|
// Reopening the DB should not corrupt anything
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.wal_recovery_mode = WALRecoveryMode::kPointInTimeRecovery;
|
|
|
|
ReopenWithColumnFamilies({"default", "test1", "test2"}, options);
|
|
|
|
}
|
2021-07-30 19:15:04 +00:00
|
|
|
|
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: ajkr
Differential Revision: D41063187
Pulled By: hx235
fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
|
|
|
TEST_F(DBTest2, SortL0FilesByEpochNumber) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.num_levels = 1;
|
|
|
|
options.compaction_style = kCompactionStyleUniversal;
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
// Set up L0 files to be sorted by their epoch_number
|
|
|
|
ASSERT_OK(Put("key1", "seq1"));
|
|
|
|
|
|
|
|
SstFileWriter sst_file_writer{EnvOptions(), options};
|
|
|
|
std::string external_file1 = dbname_ + "/test_files1.sst";
|
|
|
|
std::string external_file2 = dbname_ + "/test_files2.sst";
|
|
|
|
ASSERT_OK(sst_file_writer.Open(external_file1));
|
|
|
|
ASSERT_OK(sst_file_writer.Put("key2", "seq0"));
|
|
|
|
ASSERT_OK(sst_file_writer.Finish());
|
|
|
|
ASSERT_OK(sst_file_writer.Open(external_file2));
|
|
|
|
ASSERT_OK(sst_file_writer.Put("key3", "seq0"));
|
|
|
|
ASSERT_OK(sst_file_writer.Finish());
|
|
|
|
|
|
|
|
ASSERT_OK(Put("key4", "seq2"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
auto* handle = db_->DefaultColumnFamily();
|
|
|
|
ASSERT_OK(db_->IngestExternalFile(handle, {external_file1, external_file2},
|
|
|
|
IngestExternalFileOptions()));
|
|
|
|
|
|
|
|
// To verify L0 files are sorted by epoch_number in descending order
|
|
|
|
// instead of largest_seqno
|
|
|
|
std::vector<FileMetaData*> level0_files = GetLevelFileMetadatas(0 /* level*/);
|
|
|
|
ASSERT_EQ(level0_files.size(), 3);
|
|
|
|
|
|
|
|
EXPECT_EQ(level0_files[0]->epoch_number, 3);
|
|
|
|
EXPECT_EQ(level0_files[0]->fd.largest_seqno, 0);
|
|
|
|
ASSERT_EQ(level0_files[0]->num_entries, 1);
|
|
|
|
ASSERT_TRUE(level0_files[0]->largest.user_key() == Slice("key3"));
|
|
|
|
|
|
|
|
EXPECT_EQ(level0_files[1]->epoch_number, 2);
|
|
|
|
EXPECT_EQ(level0_files[1]->fd.largest_seqno, 0);
|
|
|
|
ASSERT_EQ(level0_files[1]->num_entries, 1);
|
|
|
|
ASSERT_TRUE(level0_files[1]->largest.user_key() == Slice("key2"));
|
|
|
|
|
|
|
|
EXPECT_EQ(level0_files[2]->epoch_number, 1);
|
|
|
|
EXPECT_EQ(level0_files[2]->fd.largest_seqno, 2);
|
|
|
|
ASSERT_EQ(level0_files[2]->num_entries, 2);
|
|
|
|
ASSERT_TRUE(level0_files[2]->largest.user_key() == Slice("key4"));
|
|
|
|
ASSERT_TRUE(level0_files[2]->smallest.user_key() == Slice("key1"));
|
|
|
|
|
|
|
|
// To verify compacted file is assigned with the minimum epoch_number
|
|
|
|
// among input files'
|
|
|
|
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));
|
|
|
|
|
|
|
|
level0_files = GetLevelFileMetadatas(0 /* level*/);
|
|
|
|
ASSERT_EQ(level0_files.size(), 1);
|
|
|
|
EXPECT_EQ(level0_files[0]->epoch_number, 1);
|
|
|
|
ASSERT_EQ(level0_files[0]->num_entries, 4);
|
|
|
|
ASSERT_TRUE(level0_files[0]->largest.user_key() == Slice("key4"));
|
|
|
|
ASSERT_TRUE(level0_files[0]->smallest.user_key() == Slice("key1"));
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, SameEpochNumberAfterCompactRangeChangeLevel) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.num_levels = 7;
|
|
|
|
options.compaction_style = CompactionStyle::kCompactionStyleLevel;
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
// Set up the file in L1 to be moved to L0 in later step of CompactRange()
|
|
|
|
ASSERT_OK(Put("key1", "seq1"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
MoveFilesToLevel(1, 0);
|
|
|
|
std::vector<FileMetaData*> level0_files = GetLevelFileMetadatas(0 /* level*/);
|
|
|
|
ASSERT_EQ(level0_files.size(), 0);
|
|
|
|
std::vector<FileMetaData*> level1_files = GetLevelFileMetadatas(1 /* level*/);
|
|
|
|
ASSERT_EQ(level1_files.size(), 1);
|
|
|
|
std::vector<FileMetaData*> level2_files = GetLevelFileMetadatas(2 /* level*/);
|
|
|
|
ASSERT_EQ(level2_files.size(), 0);
|
|
|
|
|
|
|
|
ASSERT_EQ(level1_files[0]->epoch_number, 1);
|
|
|
|
|
|
|
|
// To verify CompactRange() moving file to L0 still keeps the file's
|
|
|
|
// epoch_number
|
|
|
|
CompactRangeOptions croptions;
|
|
|
|
croptions.change_level = true;
|
|
|
|
croptions.target_level = 0;
|
|
|
|
ASSERT_OK(db_->CompactRange(croptions, nullptr, nullptr));
|
|
|
|
level0_files = GetLevelFileMetadatas(0 /* level*/);
|
|
|
|
level1_files = GetLevelFileMetadatas(1 /* level*/);
|
|
|
|
ASSERT_EQ(level0_files.size(), 1);
|
|
|
|
ASSERT_EQ(level1_files.size(), 0);
|
|
|
|
|
|
|
|
EXPECT_EQ(level0_files[0]->epoch_number, 1);
|
|
|
|
|
|
|
|
ASSERT_EQ(level0_files[0]->num_entries, 1);
|
|
|
|
ASSERT_TRUE(level0_files[0]->largest.user_key() == Slice("key1"));
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, RecoverEpochNumber) {
|
|
|
|
for (bool allow_ingest_behind : {true, false}) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.allow_ingest_behind = allow_ingest_behind;
|
|
|
|
options.num_levels = 7;
|
|
|
|
options.compaction_style = kCompactionStyleLevel;
|
|
|
|
options.disable_auto_compactions = true;
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
CreateAndReopenWithCF({"cf1"}, options);
|
|
|
|
VersionSet* versions = dbfull()->GetVersionSet();
|
|
|
|
assert(versions);
|
|
|
|
const ColumnFamilyData* default_cf =
|
|
|
|
versions->GetColumnFamilySet()->GetDefault();
|
|
|
|
const ColumnFamilyData* cf1 =
|
|
|
|
versions->GetColumnFamilySet()->GetColumnFamily("cf1");
|
|
|
|
|
|
|
|
// Set up files in default CF to recover in later step
|
|
|
|
ASSERT_OK(Put("key1", "epoch1"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
MoveFilesToLevel(1 /* level*/, 0 /* cf*/);
|
|
|
|
ASSERT_OK(Put("key2", "epoch2"));
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
std::vector<FileMetaData*> level0_files =
|
|
|
|
GetLevelFileMetadatas(0 /* level*/);
|
|
|
|
ASSERT_EQ(level0_files.size(), 1);
|
|
|
|
ASSERT_EQ(level0_files[0]->epoch_number,
|
|
|
|
allow_ingest_behind
|
|
|
|
? 2 + kReservedEpochNumberForFileIngestedBehind
|
|
|
|
: 2);
|
|
|
|
ASSERT_EQ(level0_files[0]->num_entries, 1);
|
|
|
|
ASSERT_TRUE(level0_files[0]->largest.user_key() == Slice("key2"));
|
|
|
|
|
|
|
|
std::vector<FileMetaData*> level1_files =
|
|
|
|
GetLevelFileMetadatas(1 /* level*/);
|
|
|
|
ASSERT_EQ(level1_files.size(), 1);
|
|
|
|
ASSERT_EQ(level1_files[0]->epoch_number,
|
|
|
|
allow_ingest_behind
|
|
|
|
? 1 + kReservedEpochNumberForFileIngestedBehind
|
|
|
|
: 1);
|
|
|
|
ASSERT_EQ(level1_files[0]->num_entries, 1);
|
|
|
|
ASSERT_TRUE(level1_files[0]->largest.user_key() == Slice("key1"));
|
|
|
|
|
|
|
|
// Set up files in cf1 to recover in later step
|
|
|
|
ASSERT_OK(Put(1 /* cf */, "cf1_key1", "epoch1"));
|
|
|
|
ASSERT_OK(Flush(1 /* cf */));
|
|
|
|
|
|
|
|
std::vector<FileMetaData*> level0_files_cf1 =
|
|
|
|
GetLevelFileMetadatas(0 /* level*/, 1 /* cf*/);
|
|
|
|
ASSERT_EQ(level0_files_cf1.size(), 1);
|
|
|
|
ASSERT_EQ(level0_files_cf1[0]->epoch_number,
|
|
|
|
allow_ingest_behind
|
|
|
|
? 1 + kReservedEpochNumberForFileIngestedBehind
|
|
|
|
: 1);
|
|
|
|
ASSERT_EQ(level0_files_cf1[0]->num_entries, 1);
|
|
|
|
ASSERT_TRUE(level0_files_cf1[0]->largest.user_key() == Slice("cf1_key1"));
|
|
|
|
|
|
|
|
ASSERT_EQ(default_cf->GetNextEpochNumber(),
|
|
|
|
allow_ingest_behind
|
|
|
|
? 3 + kReservedEpochNumberForFileIngestedBehind
|
|
|
|
: 3);
|
|
|
|
ASSERT_EQ(cf1->GetNextEpochNumber(),
|
|
|
|
allow_ingest_behind
|
|
|
|
? 2 + kReservedEpochNumberForFileIngestedBehind
|
|
|
|
: 2);
|
|
|
|
|
|
|
|
// To verify epoch_number of files of different levels/CFs are
|
|
|
|
// persisted and recovered correctly
|
|
|
|
ReopenWithColumnFamilies({"default", "cf1"}, options);
|
|
|
|
versions = dbfull()->GetVersionSet();
|
|
|
|
assert(versions);
|
|
|
|
default_cf = versions->GetColumnFamilySet()->GetDefault();
|
|
|
|
cf1 = versions->GetColumnFamilySet()->GetColumnFamily("cf1");
|
|
|
|
|
|
|
|
level0_files = GetLevelFileMetadatas(0 /* level*/);
|
|
|
|
ASSERT_EQ(level0_files.size(), 1);
|
|
|
|
EXPECT_EQ(level0_files[0]->epoch_number,
|
|
|
|
allow_ingest_behind
|
|
|
|
? 2 + kReservedEpochNumberForFileIngestedBehind
|
|
|
|
: 2);
|
|
|
|
ASSERT_EQ(level0_files[0]->num_entries, 1);
|
|
|
|
ASSERT_TRUE(level0_files[0]->largest.user_key() == Slice("key2"));
|
|
|
|
|
|
|
|
level1_files = GetLevelFileMetadatas(1 /* level*/);
|
|
|
|
ASSERT_EQ(level1_files.size(), 1);
|
|
|
|
EXPECT_EQ(level1_files[0]->epoch_number,
|
|
|
|
allow_ingest_behind
|
|
|
|
? 1 + kReservedEpochNumberForFileIngestedBehind
|
|
|
|
: 1);
|
|
|
|
ASSERT_EQ(level1_files[0]->num_entries, 1);
|
|
|
|
ASSERT_TRUE(level1_files[0]->largest.user_key() == Slice("key1"));
|
|
|
|
|
|
|
|
level0_files_cf1 = GetLevelFileMetadatas(0 /* level*/, 1 /* cf*/);
|
|
|
|
ASSERT_EQ(level0_files_cf1.size(), 1);
|
|
|
|
EXPECT_EQ(level0_files_cf1[0]->epoch_number,
|
|
|
|
allow_ingest_behind
|
|
|
|
? 1 + kReservedEpochNumberForFileIngestedBehind
|
|
|
|
: 1);
|
|
|
|
ASSERT_EQ(level0_files_cf1[0]->num_entries, 1);
|
|
|
|
ASSERT_TRUE(level0_files_cf1[0]->largest.user_key() == Slice("cf1_key1"));
|
|
|
|
|
|
|
|
// To verify next epoch number is recovered correctly
|
|
|
|
EXPECT_EQ(default_cf->GetNextEpochNumber(),
|
|
|
|
allow_ingest_behind
|
|
|
|
? 3 + kReservedEpochNumberForFileIngestedBehind
|
|
|
|
: 3);
|
|
|
|
EXPECT_EQ(cf1->GetNextEpochNumber(),
|
|
|
|
allow_ingest_behind
|
|
|
|
? 2 + kReservedEpochNumberForFileIngestedBehind
|
|
|
|
: 2);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2021-07-30 19:15:04 +00:00
|
|
|
TEST_F(DBTest2, RenameDirectory) {
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
ASSERT_OK(Put("foo", "value0"));
|
|
|
|
Close();
|
|
|
|
auto old_dbname = dbname_;
|
|
|
|
auto new_dbname = dbname_ + "_2";
|
|
|
|
EXPECT_OK(env_->RenameFile(dbname_, new_dbname));
|
|
|
|
options.create_if_missing = false;
|
|
|
|
dbname_ = new_dbname;
|
|
|
|
ASSERT_OK(TryReopen(options));
|
|
|
|
ASSERT_EQ("value0", Get("foo"));
|
|
|
|
Destroy(options);
|
|
|
|
dbname_ = old_dbname;
|
|
|
|
}
|
2021-11-15 20:50:42 +00:00
|
|
|
|
2022-05-19 18:04:21 +00:00
|
|
|
TEST_F(DBTest2, SstUniqueIdVerifyBackwardCompatible) {
|
|
|
|
const int kNumSst = 3;
|
|
|
|
const int kLevel0Trigger = 4;
|
|
|
|
auto options = CurrentOptions();
|
|
|
|
options.level0_file_num_compaction_trigger = kLevel0Trigger;
|
|
|
|
options.statistics = CreateDBStatistics();
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
// Skip for now
|
|
|
|
options.verify_sst_unique_id_in_manifest = false;
|
|
|
|
Reopen(options);
|
2022-05-19 18:04:21 +00:00
|
|
|
|
|
|
|
std::atomic_int skipped = 0;
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
std::atomic_int passed = 0;
|
|
|
|
SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"BlockBasedTable::Open::SkippedVerifyUniqueId",
|
|
|
|
[&](void* /*arg*/) { skipped++; });
|
|
|
|
SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"BlockBasedTable::Open::PassedVerifyUniqueId",
|
|
|
|
[&](void* /*arg*/) { passed++; });
|
2022-05-19 18:04:21 +00:00
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
// generate a few SSTs
|
|
|
|
for (int i = 0; i < kNumSst; i++) {
|
|
|
|
for (int j = 0; j < 100; j++) {
|
|
|
|
ASSERT_OK(Put(Key(i * 10 + j), "value"));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
|
|
|
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
// Verification has been skipped on files so far
|
|
|
|
EXPECT_EQ(skipped, kNumSst);
|
|
|
|
EXPECT_EQ(passed, 0);
|
2022-05-19 18:04:21 +00:00
|
|
|
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
// Reopen with verification
|
2022-05-19 18:04:21 +00:00
|
|
|
options.verify_sst_unique_id_in_manifest = true;
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
skipped = 0;
|
|
|
|
passed = 0;
|
2022-05-19 18:04:21 +00:00
|
|
|
Reopen(options);
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
EXPECT_EQ(skipped, 0);
|
|
|
|
EXPECT_EQ(passed, kNumSst);
|
|
|
|
|
|
|
|
// Now simulate no unique id in manifest for next file
|
|
|
|
// NOTE: this only works for loading manifest from disk,
|
|
|
|
// not in-memory manifest, so we need to re-open below.
|
|
|
|
SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"VersionEdit::EncodeTo:UniqueId", [&](void* arg) {
|
|
|
|
auto unique_id = static_cast<UniqueId64x2*>(arg);
|
|
|
|
// remove id before writing it to manifest
|
|
|
|
(*unique_id)[0] = 0;
|
|
|
|
(*unique_id)[1] = 0;
|
|
|
|
});
|
2022-05-19 18:04:21 +00:00
|
|
|
|
|
|
|
// test compaction generated Sst
|
|
|
|
for (int i = kNumSst; i < kLevel0Trigger; i++) {
|
|
|
|
for (int j = 0; j < 100; j++) {
|
|
|
|
ASSERT_OK(Put(Key(i * 10 + j), "value"));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
|
|
|
|
|
|
|
ASSERT_EQ("0,1", FilesPerLevel(0));
|
|
|
|
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
// Reopen (with verification)
|
|
|
|
ASSERT_TRUE(options.verify_sst_unique_id_in_manifest);
|
2022-05-19 18:04:21 +00:00
|
|
|
skipped = 0;
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
passed = 0;
|
2022-05-19 18:04:21 +00:00
|
|
|
Reopen(options);
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
EXPECT_EQ(skipped, 1);
|
|
|
|
EXPECT_EQ(passed, 0);
|
2022-05-19 18:04:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, SstUniqueIdVerify) {
|
|
|
|
const int kNumSst = 3;
|
|
|
|
const int kLevel0Trigger = 4;
|
|
|
|
auto options = CurrentOptions();
|
|
|
|
options.level0_file_num_compaction_trigger = kLevel0Trigger;
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
// Allow mismatch for now
|
|
|
|
options.verify_sst_unique_id_in_manifest = false;
|
|
|
|
Reopen(options);
|
2022-05-19 18:04:21 +00:00
|
|
|
|
|
|
|
SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"PropertyBlockBuilder::AddTableProperty:Start", [&](void* props_vs) {
|
|
|
|
auto props = static_cast<TableProperties*>(props_vs);
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
// update table property session_id to a different one, which
|
|
|
|
// changes unique ID
|
2022-05-19 18:04:21 +00:00
|
|
|
props->db_session_id = DBImpl::GenerateDbSessionId(nullptr);
|
|
|
|
});
|
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
|
|
|
|
// generate a few SSTs
|
|
|
|
for (int i = 0; i < kNumSst; i++) {
|
|
|
|
for (int j = 0; j < 100; j++) {
|
|
|
|
ASSERT_OK(Put(Key(i * 10 + j), "value"));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
|
|
|
|
|
|
|
// Reopen with verification should report corruption
|
|
|
|
options.verify_sst_unique_id_in_manifest = true;
|
|
|
|
auto s = TryReopen(options);
|
|
|
|
ASSERT_TRUE(s.IsCorruption());
|
|
|
|
|
|
|
|
// Reopen without verification should be fine
|
|
|
|
options.verify_sst_unique_id_in_manifest = false;
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
// test compaction generated Sst
|
|
|
|
for (int i = kNumSst; i < kLevel0Trigger; i++) {
|
|
|
|
for (int j = 0; j < 100; j++) {
|
|
|
|
ASSERT_OK(Put(Key(i * 10 + j), "value"));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
|
|
|
ASSERT_OK(dbfull()->TEST_WaitForCompact());
|
|
|
|
|
|
|
|
ASSERT_EQ("0,1", FilesPerLevel(0));
|
|
|
|
|
|
|
|
// Reopen with verification should fail
|
|
|
|
options.verify_sst_unique_id_in_manifest = true;
|
|
|
|
s = TryReopen(options);
|
|
|
|
ASSERT_TRUE(s.IsCorruption());
|
2022-07-15 18:50:30 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
TEST_F(DBTest2, SstUniqueIdVerifyMultiCFs) {
|
|
|
|
const int kNumSst = 3;
|
|
|
|
const int kLevel0Trigger = 4;
|
|
|
|
auto options = CurrentOptions();
|
|
|
|
options.level0_file_num_compaction_trigger = kLevel0Trigger;
|
Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-08 05:52:42 +00:00
|
|
|
// Allow mismatch for now
|
|
|
|
options.verify_sst_unique_id_in_manifest = false;
|
2022-07-15 18:50:30 +00:00
|
|
|
|
|
|
|
CreateAndReopenWithCF({"one", "two"}, options);
|
|
|
|
|
|
|
|
// generate good SSTs
|
|
|
|
for (int cf_num : {0, 2}) {
|
|
|
|
for (int i = 0; i < kNumSst; i++) {
|
|
|
|
for (int j = 0; j < 100; j++) {
|
|
|
|
ASSERT_OK(Put(cf_num, Key(i * 10 + j), "value"));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush(cf_num));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// generate SSTs with bad unique id
|
|
|
|
SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"PropertyBlockBuilder::AddTableProperty:Start", [&](void* props_vs) {
|
|
|
|
auto props = static_cast<TableProperties*>(props_vs);
|
|
|
|
// update table property session_id to a different one
|
|
|
|
props->db_session_id = DBImpl::GenerateDbSessionId(nullptr);
|
|
|
|
});
|
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
for (int i = 0; i < kNumSst; i++) {
|
|
|
|
for (int j = 0; j < 100; j++) {
|
|
|
|
ASSERT_OK(Put(1, Key(i * 10 + j), "value"));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush(1));
|
|
|
|
}
|
|
|
|
|
|
|
|
// Reopen with verification should report corruption
|
|
|
|
options.verify_sst_unique_id_in_manifest = true;
|
|
|
|
auto s = TryReopenWithColumnFamilies({"default", "one", "two"}, options);
|
|
|
|
ASSERT_TRUE(s.IsCorruption());
|
2022-05-19 18:04:21 +00:00
|
|
|
}
|
|
|
|
|
2022-11-23 06:53:31 +00:00
|
|
|
TEST_F(DBTest2, BestEffortsRecoveryWithSstUniqueIdVerification) {
|
|
|
|
const auto tamper_with_uniq_id = [&](void* arg) {
|
|
|
|
auto props = static_cast<TableProperties*>(arg);
|
|
|
|
assert(props);
|
|
|
|
// update table property session_id to a different one
|
|
|
|
props->db_session_id = DBImpl::GenerateDbSessionId(nullptr);
|
|
|
|
};
|
|
|
|
|
|
|
|
const auto assert_db = [&](size_t expected_count,
|
|
|
|
const std::string& expected_v) {
|
|
|
|
std::unique_ptr<Iterator> it(db_->NewIterator(ReadOptions()));
|
|
|
|
size_t cnt = 0;
|
|
|
|
for (it->SeekToFirst(); it->Valid(); it->Next(), ++cnt) {
|
|
|
|
ASSERT_EQ(std::to_string(cnt), it->key());
|
|
|
|
ASSERT_EQ(expected_v, it->value());
|
|
|
|
}
|
2023-10-18 16:38:38 +00:00
|
|
|
EXPECT_OK(it->status());
|
2022-11-23 06:53:31 +00:00
|
|
|
ASSERT_EQ(expected_count, cnt);
|
|
|
|
};
|
|
|
|
|
|
|
|
const int num_l0_compaction_trigger = 8;
|
|
|
|
const int num_l0 = num_l0_compaction_trigger - 1;
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.level0_file_num_compaction_trigger = num_l0_compaction_trigger;
|
|
|
|
|
|
|
|
for (int k = 0; k < num_l0; ++k) {
|
|
|
|
// Allow mismatch for now
|
|
|
|
options.verify_sst_unique_id_in_manifest = false;
|
|
|
|
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
|
|
|
|
constexpr size_t num_keys_per_file = 10;
|
|
|
|
for (int i = 0; i < num_l0; ++i) {
|
|
|
|
for (size_t j = 0; j < num_keys_per_file; ++j) {
|
|
|
|
ASSERT_OK(Put(std::to_string(j), "v" + std::to_string(i)));
|
|
|
|
}
|
|
|
|
if (i == k) {
|
|
|
|
SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"PropertyBlockBuilder::AddTableProperty:Start",
|
|
|
|
tamper_with_uniq_id);
|
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
}
|
|
|
|
|
|
|
|
options.verify_sst_unique_id_in_manifest = true;
|
|
|
|
Status s = TryReopen(options);
|
|
|
|
ASSERT_TRUE(s.IsCorruption());
|
|
|
|
|
|
|
|
options.best_efforts_recovery = true;
|
|
|
|
Reopen(options);
|
|
|
|
assert_db(k == 0 ? 0 : num_keys_per_file, "v" + std::to_string(k - 1));
|
|
|
|
|
|
|
|
// Reopen with regular recovery
|
|
|
|
options.best_efforts_recovery = false;
|
|
|
|
Reopen(options);
|
|
|
|
assert_db(k == 0 ? 0 : num_keys_per_file, "v" + std::to_string(k - 1));
|
|
|
|
|
|
|
|
SyncPoint::GetInstance()->DisableProcessing();
|
|
|
|
SyncPoint::GetInstance()->ClearAllCallBacks();
|
|
|
|
|
|
|
|
for (size_t i = 0; i < num_keys_per_file; ++i) {
|
|
|
|
ASSERT_OK(Put(std::to_string(i), "v"));
|
|
|
|
}
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
Reopen(options);
|
|
|
|
{
|
|
|
|
for (size_t i = 0; i < num_keys_per_file; ++i) {
|
|
|
|
ASSERT_EQ("v", Get(std::to_string(i)));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-11-15 20:50:42 +00:00
|
|
|
TEST_F(DBTest2, GetLatestSeqAndTsForKey) {
|
|
|
|
Destroy(last_options_);
|
|
|
|
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.max_write_buffer_size_to_maintain = 64 << 10;
|
|
|
|
options.create_if_missing = true;
|
|
|
|
options.disable_auto_compactions = true;
|
2022-02-08 20:14:25 +00:00
|
|
|
options.comparator = test::BytewiseComparatorWithU64TsWrapper();
|
2021-11-15 20:50:42 +00:00
|
|
|
options.statistics = CreateDBStatistics();
|
|
|
|
|
|
|
|
Reopen(options);
|
|
|
|
|
|
|
|
constexpr uint64_t kTsU64Value = 12;
|
|
|
|
|
|
|
|
for (uint64_t key = 0; key < 100; ++key) {
|
Revise APIs related to user-defined timestamp (#8946)
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
2022-02-02 06:17:46 +00:00
|
|
|
std::string ts;
|
|
|
|
PutFixed64(&ts, kTsU64Value);
|
2021-11-15 20:50:42 +00:00
|
|
|
|
|
|
|
std::string key_str;
|
|
|
|
PutFixed64(&key_str, key);
|
|
|
|
std::reverse(key_str.begin(), key_str.end());
|
Revise APIs related to user-defined timestamp (#8946)
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
2022-02-02 06:17:46 +00:00
|
|
|
ASSERT_OK(db_->Put(WriteOptions(), key_str, ts, "value"));
|
2021-11-15 20:50:42 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
|
|
|
|
constexpr bool cache_only = true;
|
|
|
|
constexpr SequenceNumber lower_bound_seq = 0;
|
|
|
|
auto* cfhi = static_cast_with_check<ColumnFamilyHandleImpl>(
|
|
|
|
dbfull()->DefaultColumnFamily());
|
|
|
|
assert(cfhi);
|
|
|
|
assert(cfhi->cfd());
|
|
|
|
SuperVersion* sv = cfhi->cfd()->GetSuperVersion();
|
|
|
|
for (uint64_t key = 0; key < 100; ++key) {
|
|
|
|
std::string key_str;
|
|
|
|
PutFixed64(&key_str, key);
|
|
|
|
std::reverse(key_str.begin(), key_str.end());
|
|
|
|
std::string ts;
|
|
|
|
SequenceNumber seq = kMaxSequenceNumber;
|
|
|
|
bool found_record_for_key = false;
|
|
|
|
bool is_blob_index = false;
|
|
|
|
|
|
|
|
const Status s = dbfull()->GetLatestSequenceForKey(
|
|
|
|
sv, key_str, cache_only, lower_bound_seq, &seq, &ts,
|
|
|
|
&found_record_for_key, &is_blob_index);
|
|
|
|
ASSERT_OK(s);
|
|
|
|
std::string expected_ts;
|
|
|
|
PutFixed64(&expected_ts, kTsU64Value);
|
|
|
|
ASSERT_EQ(expected_ts, ts);
|
|
|
|
ASSERT_TRUE(found_record_for_key);
|
|
|
|
ASSERT_FALSE(is_blob_index);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Verify that no read to SST files.
|
|
|
|
ASSERT_EQ(0, options.statistics->getTickerCount(GET_HIT_L0));
|
|
|
|
}
|
|
|
|
|
Add `CompressionOptions::checksum` for enabling ZSTD checksum (#11666)
Summary:
Optionally enable zstd checksum flag (https://github.com/facebook/zstd/blob/d857369028d997c92ff1f1861a4d7f679a125464/lib/zstd.h#L428) to detect corruption during decompression. Main changes are in compression.h:
* User can set CompressionOptions::checksum to true to enable this feature.
* We enable this feature in ZSTD by setting the checksum flag in ZSTD compression context: `ZSTD_CCtx`.
* Uses `ZSTD_compress2()` to do compression since it supports frame parameter like the checksum flag. Compression level is also set in compression context as a flag.
* Error handling during decompression to propagate error message from ZSTD.
* Updated microbench to test read performance impact.
About compatibility, the current compression decoders should continue to work with the data created by the new compression API `ZSTD_compress2()`: https://github.com/facebook/zstd/issues/3711.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11666
Test Plan:
* Existing unit tests for zstd compression
* Add unit test `DBTest2.ZSTDChecksum` to test the corruption case
* Manually tested that compression levels, parallel compression, dictionary compression, index compression all work with the new ZSTD_compress2() API.
* Manually tested with `sst_dump --command=recompress` that different compression levels and dictionary compression settings all work.
* Manually tested compiling with older versions of ZSTD: v1.3.8, v1.1.0, v0.6.2.
* Perf impact: from public benchmark data: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html for checksum and https://github.com/facebook/zstd#benchmarks, if decompression is 1700MB/s and checksum computation is 70000MB/s, checksum computation is an additional ~2.4% time for decompression. Compression is slower and checksumming should be less noticeable.
* Microbench:
```
TEST_TMPDIR=/dev/shm ./branch_db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:1048576/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:0/compression_type:7/compression_checksum:1/no_blockcache:1/iterations:10000/threads:1 --benchmark_repetitions=100
Min out of 100 runs:
Main:
10390 10436 10456 10484 10499 10535 10544 10545 10565 10568
After this PR, checksum=false
10285 10397 10503 10508 10515 10557 10562 10635 10640 10660
After this PR, checksum=true
10827 10876 10925 10949 10971 11052 11061 11063 11100 11109
```
* db_bench:
```
Write perf
TEST_TMPDIR=/dev/shm/ ./db_bench_ichecksum --benchmarks=fillseq[-X10] --compression_type=zstd --num=10000000 --compression_checksum=..
[FillSeq checksum=0]
fillseq [AVG 10 runs] : 281635 (± 31711) ops/sec; 31.2 (± 3.5) MB/sec
fillseq [MEDIAN 10 runs] : 294027 ops/sec; 32.5 MB/sec
[FillSeq checksum=1]
fillseq [AVG 10 runs] : 286961 (± 34700) ops/sec; 31.7 (± 3.8) MB/sec
fillseq [MEDIAN 10 runs] : 283278 ops/sec; 31.3 MB/sec
Read perf
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=readrandom[-X20] --num=100000000 --reads=1000000 --use_existing_db=true --readonly=1
[Readrandom checksum=1]
readrandom [AVG 20 runs] : 360928 (± 3579) ops/sec; 4.0 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 362468 ops/sec; 4.0 MB/sec
[Readrandom checksum=0]
readrandom [AVG 20 runs] : 380365 (± 2384) ops/sec; 4.2 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 379800 ops/sec; 4.2 MB/sec
Compression
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=compress[-X20] --compression_type=zstd --num=100000000 --compression_checksum=1
checksum=1
compress [AVG 20 runs] : 54074 (± 634) ops/sec; 211.2 (± 2.5) MB/sec
compress [MEDIAN 20 runs] : 54396 ops/sec; 212.5 MB/sec
checksum=0
compress [AVG 20 runs] : 54598 (± 393) ops/sec; 213.3 (± 1.5) MB/sec
compress [MEDIAN 20 runs] : 54592 ops/sec; 213.3 MB/sec
Decompression:
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=uncompress[-X20] --compression_type=zstd --compression_checksum=1
checksum = 0
uncompress [AVG 20 runs] : 167499 (± 962) ops/sec; 654.3 (± 3.8) MB/sec
uncompress [MEDIAN 20 runs] : 167210 ops/sec; 653.2 MB/sec
checksum = 1
uncompress [AVG 20 runs] : 167980 (± 924) ops/sec; 656.2 (± 3.6) MB/sec
uncompress [MEDIAN 20 runs] : 168465 ops/sec; 658.1 MB/sec
```
Reviewed By: ajkr
Differential Revision: D48019378
Pulled By: cbi42
fbshipit-source-id: 674120c6e1853c2ced1436ac8138559d0204feba
2023-08-18 22:01:59 +00:00
|
|
|
#if defined(ZSTD_ADVANCED)
|
|
|
|
TEST_F(DBTest2, ZSTDChecksum) {
|
|
|
|
// Verify that corruption during decompression is caught.
|
|
|
|
Options options = CurrentOptions();
|
|
|
|
options.create_if_missing = true;
|
|
|
|
options.compression = kZSTD;
|
|
|
|
options.compression_opts.max_compressed_bytes_per_kb = 1024;
|
|
|
|
options.compression_opts.checksum = true;
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
Random rnd(33);
|
|
|
|
ASSERT_OK(Put(Key(0), rnd.RandomString(4 << 10)));
|
|
|
|
SyncPoint::GetInstance()->SetCallBack(
|
|
|
|
"BlockBasedTableBuilder::WriteBlock:TamperWithCompressedData",
|
|
|
|
[&](void* arg) {
|
|
|
|
std::string* output = static_cast<std::string*>(arg);
|
|
|
|
// https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#zstandard-frames
|
|
|
|
// Checksum is the last 4 bytes, corrupting that part in unit test is
|
|
|
|
// more controllable.
|
|
|
|
output->data()[output->size() - 1]++;
|
|
|
|
});
|
|
|
|
SyncPoint::GetInstance()->EnableProcessing();
|
|
|
|
ASSERT_OK(Flush());
|
|
|
|
PinnableSlice val;
|
|
|
|
Status s = Get(Key(0), &val);
|
|
|
|
ASSERT_TRUE(s.IsCorruption());
|
|
|
|
|
|
|
|
// Corruption caught during flush.
|
|
|
|
options.paranoid_file_checks = true;
|
|
|
|
DestroyAndReopen(options);
|
|
|
|
ASSERT_OK(Put(Key(0), rnd.RandomString(4 << 10)));
|
|
|
|
s = Flush();
|
|
|
|
ASSERT_TRUE(s.IsCorruption());
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|
2016-03-01 02:38:03 +00:00
|
|
|
|
|
|
|
int main(int argc, char** argv) {
|
2020-02-20 20:07:53 +00:00
|
|
|
ROCKSDB_NAMESPACE::port::InstallStackTraceHandler();
|
2016-03-01 02:38:03 +00:00
|
|
|
::testing::InitGoogleTest(&argc, argv);
|
2019-08-09 22:08:36 +00:00
|
|
|
RegisterCustomObjects(argc, argv);
|
2016-03-01 02:38:03 +00:00
|
|
|
return RUN_ALL_TESTS();
|
|
|
|
}
|