rocksdb/db/internal_stats.cc

2202 lines
87 KiB
C++
Raw Normal View History

// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
//
// Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
//
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#include "db/internal_stats.h"
Make the benchmark scripts configurable and add tests Summary: This makes run_flash_bench.sh configurable. Previously it was hardwired for 1B keys and tests ran for 12 hours each. That kept me from using it. This makes it configuable, adds more tests, makes the duration per-test configurable and refactors the test scripts. Adds the seekrandomwhilemerging test to db_bench which is the same as seekrandomwhilewriting except the writer thread does Merge rather than Put. Forces the stall-time column in compaction IO stats to use a fixed format (H:M:S) which makes it easier to scrape and parse. Also adds an option to AppendHumanMicros to force a fixed format. Sometimes automation and humans want different format. Calls thread->stats.AddBytes(bytes); in db_bench for more tests to get the MB/sec summary stats in the output at test end. Adds the average ingest rate to compaction IO stats. Output now looks like: https://gist.github.com/mdcallag/2bd64d18be1b93adc494 More information on the benchmark output is at https://gist.github.com/mdcallag/db43a58bd5ac624f01e1 For benchmark.sh changes default RocksDB configuration to reduce stalls: * min_level_to_compress from 2 to 3 * hard_rate_limit from 2 to 3 * max_grandparent_overlap_factor and max_bytes_for_level_multiplier from 10 to 8 * L0 file count triggers from 4,8,12 to 4,12,20 for (start,stall,stop) Task ID: #6596829 Blame Rev: Test Plan: run tools/run_flash_bench.sh Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin *PUBLIC* platform impact section - Bugzilla: # - end platform impact - Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D36075
2015-03-30 18:28:25 +00:00
#include <algorithm>
#include <cinttypes>
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
#include <cstddef>
#include <limits>
Introduce a blob file reader class (#7461) Summary: The patch adds a class called `BlobFileReader` that can be used to retrieve blobs using the information available in blob references (e.g. blob file number, offset, and size). This will come in handy when implementing blob support for `Get`, `MultiGet`, and iterators, and also for compaction/garbage collection. When a `BlobFileReader` object is created (using the factory method `Create`), it first checks whether the specified file is potentially valid by comparing the file size against the combined size of the blob file header and footer (files smaller than the threshold are considered malformed). Then, it opens the file, and reads and verifies the header and footer. The verification involves magic number/CRC checks as well as checking for unexpected header/footer fields, e.g. incorrect column family ID or TTL blob files. Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression type passed by the caller (because of the presence of the header and footer, the specified offset cannot be too close to the start/end of the file; also, the compression type has to match the one in the blob file header), and retrieves and potentially verifies and uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set, `BlobFileReader` reads the blob record header as well (as opposed to just the blob itself) and verifies the key/value size, the key itself, as well as the CRC of the blob record header and the key/value pair. In addition, the patch exposes the compression type from `BlobIndex` (both using an accessor and via `DebugString`), and adds a blob file read latency histogram to `InternalStats` that can be used with `BlobFileReader`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D23999219 Pulled By: ltamasi fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
2020-10-07 22:43:23 +00:00
#include <sstream>
#include <string>
#include <unordered_map>
#include <utility>
#include <vector>
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
#include "cache/cache_entry_roles.h"
#include "cache/cache_entry_stats.h"
#include "db/column_family.h"
#include "db/db_impl/db_impl.h"
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
#include "db/write_stall_stats.h"
#include "port/port.h"
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
#include "rocksdb/system_clock.h"
#include "rocksdb/table.h"
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
#include "table/block_based/cachable_entry.h"
Meta-internal folly integration with F14FastMap (#9546) Summary: Especially after updating to C++17, I don't see a compelling case for *requiring* any folly components in RocksDB. I was able to purge the existing hard dependencies, and it can be quite difficult to strip out non-trivial components from folly for use in RocksDB. (The prospect of doing that on F14 has changed my mind on the best approach here.) But this change creates an optional integration where we can plug in components from folly at compile time, starting here with F14FastMap to replace std::unordered_map when possible (probably no public APIs for example). I have replaced the biggest CPU users of std::unordered_map with compile-time pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set. USE_FOLLY is always set in the Meta-internal buck build, and a simulation of that is in the Makefile for public CI testing. A full folly build is not needed, but checking out the full folly repo is much simpler for getting the dependency, and anything else we might want to optionally integrate in the future. Some picky details: * I don't think the distributed mutex stuff is actually used, so it was easy to remove. * I implemented an alternative to `folly::constexpr_log2` (which is much easier in C++17 than C++11) so that I could pull out the hard dependencies on `ConstexprMath.h` * I had to add noexcept move constructors/operators to some types to make F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a macro to make that easier in some common cases. * Updated Meta-internal buck build to use folly F14Map (always) No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a production integration for open source users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546 Test Plan: CircleCI tests updated so that a couple of them use folly. Most internal unit & stress/crash tests updated to use Meta-internal latest folly. (Note: they should probably use buck but they currently use Makefile.) Example performance improvement: when filter partitions are pinned in cache, they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build a test that exercises that heavily. Build DB with ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters ``` and test with (simultaneous runs with & without folly, ~20 times each to see convergence) ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache ``` Average ops/s no folly: 26229.2 Average ops/s with folly: 26853.3 (+2.4%) Reviewed By: ajkr Differential Revision: D34181736 Pulled By: pdillinger fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
2022-04-13 14:34:01 +00:00
#include "util/hash_containers.h"
#include "util/string_util.h"
namespace ROCKSDB_NAMESPACE {
const std::map<LevelStatType, LevelStat> InternalStats::compaction_level_stats =
{
{LevelStatType::NUM_FILES, LevelStat{"NumFiles", "Files"}},
{LevelStatType::COMPACTED_FILES,
LevelStat{"CompactedFiles", "CompactedFiles"}},
{LevelStatType::SIZE_BYTES, LevelStat{"SizeBytes", "Size"}},
{LevelStatType::SCORE, LevelStat{"Score", "Score"}},
{LevelStatType::READ_GB, LevelStat{"ReadGB", "Read(GB)"}},
{LevelStatType::RN_GB, LevelStat{"RnGB", "Rn(GB)"}},
{LevelStatType::RNP1_GB, LevelStat{"Rnp1GB", "Rnp1(GB)"}},
{LevelStatType::WRITE_GB, LevelStat{"WriteGB", "Write(GB)"}},
{LevelStatType::W_NEW_GB, LevelStat{"WnewGB", "Wnew(GB)"}},
{LevelStatType::MOVED_GB, LevelStat{"MovedGB", "Moved(GB)"}},
{LevelStatType::WRITE_AMP, LevelStat{"WriteAmp", "W-Amp"}},
{LevelStatType::READ_MBPS, LevelStat{"ReadMBps", "Rd(MB/s)"}},
{LevelStatType::WRITE_MBPS, LevelStat{"WriteMBps", "Wr(MB/s)"}},
{LevelStatType::COMP_SEC, LevelStat{"CompSec", "Comp(sec)"}},
{LevelStatType::COMP_CPU_SEC,
LevelStat{"CompMergeCPU", "CompMergeCPU(sec)"}},
{LevelStatType::COMP_COUNT, LevelStat{"CompCount", "Comp(cnt)"}},
{LevelStatType::AVG_SEC, LevelStat{"AvgSec", "Avg(sec)"}},
{LevelStatType::KEY_IN, LevelStat{"KeyIn", "KeyIn"}},
{LevelStatType::KEY_DROP, LevelStat{"KeyDrop", "KeyDrop"}},
{LevelStatType::R_BLOB_GB, LevelStat{"RblobGB", "Rblob(GB)"}},
{LevelStatType::W_BLOB_GB, LevelStat{"WblobGB", "Wblob(GB)"}},
};
const std::map<InternalStats::InternalDBStatsType, DBStatInfo>
InternalStats::db_stats_type_to_info = {
{InternalStats::kIntStatsWalFileBytes,
DBStatInfo{"db.wal_bytes_written"}},
{InternalStats::kIntStatsWalFileSynced, DBStatInfo{"db.wal_syncs"}},
{InternalStats::kIntStatsBytesWritten,
DBStatInfo{"db.user_bytes_written"}},
{InternalStats::kIntStatsNumKeysWritten,
DBStatInfo{"db.user_keys_written"}},
{InternalStats::kIntStatsWriteDoneByOther,
DBStatInfo{"db.user_writes_by_other"}},
{InternalStats::kIntStatsWriteDoneBySelf,
DBStatInfo{"db.user_writes_by_self"}},
{InternalStats::kIntStatsWriteWithWal,
DBStatInfo{"db.user_writes_with_wal"}},
{InternalStats::kIntStatsWriteStallMicros,
DBStatInfo{"db.user_write_stall_micros"}},
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
{InternalStats::kIntStatsWriteBufferManagerLimitStopsCounts,
DBStatInfo{WriteStallStatsMapKeys::CauseConditionCount(
WriteStallCause::kWriteBufferManagerLimit,
WriteStallCondition::kStopped)}},
};
namespace {
const double kMB = 1048576.0;
const double kGB = kMB * 1024;
const double kMicrosInSec = 1000000.0;
void PrintLevelStatsHeader(char* buf, size_t len, const std::string& cf_name,
const std::string& group_by) {
int written_size =
snprintf(buf, len, "\n** Compaction Stats [%s] **\n", cf_name.c_str());
written_size = std::min(written_size, static_cast<int>(len));
auto hdr = [](LevelStatType t) {
return InternalStats::compaction_level_stats.at(t).header_name.c_str();
};
int line_size = snprintf(
buf + written_size, len - written_size,
"%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s "
"%s\n",
// Note that we skip COMPACTED_FILES and merge it with Files column
group_by.c_str(), hdr(LevelStatType::NUM_FILES),
hdr(LevelStatType::SIZE_BYTES), hdr(LevelStatType::SCORE),
hdr(LevelStatType::READ_GB), hdr(LevelStatType::RN_GB),
hdr(LevelStatType::RNP1_GB), hdr(LevelStatType::WRITE_GB),
hdr(LevelStatType::W_NEW_GB), hdr(LevelStatType::MOVED_GB),
hdr(LevelStatType::WRITE_AMP), hdr(LevelStatType::READ_MBPS),
hdr(LevelStatType::WRITE_MBPS), hdr(LevelStatType::COMP_SEC),
hdr(LevelStatType::COMP_CPU_SEC), hdr(LevelStatType::COMP_COUNT),
hdr(LevelStatType::AVG_SEC), hdr(LevelStatType::KEY_IN),
hdr(LevelStatType::KEY_DROP), hdr(LevelStatType::R_BLOB_GB),
hdr(LevelStatType::W_BLOB_GB));
written_size += line_size;
written_size = std::min(written_size, static_cast<int>(len));
snprintf(buf + written_size, len - written_size, "%s\n",
std::string(line_size, '-').c_str());
}
void PrepareLevelStats(std::map<LevelStatType, double>* level_stats,
int num_files, int being_compacted,
double total_file_size, double score, double w_amp,
const InternalStats::CompactionStats& stats) {
const uint64_t bytes_read = stats.bytes_read_non_output_levels +
stats.bytes_read_output_level +
stats.bytes_read_blob;
const uint64_t bytes_written = stats.bytes_written + stats.bytes_written_blob;
const int64_t bytes_new = stats.bytes_written - stats.bytes_read_output_level;
const double elapsed = (stats.micros + 1) / kMicrosInSec;
(*level_stats)[LevelStatType::NUM_FILES] = num_files;
(*level_stats)[LevelStatType::COMPACTED_FILES] = being_compacted;
(*level_stats)[LevelStatType::SIZE_BYTES] = total_file_size;
(*level_stats)[LevelStatType::SCORE] = score;
(*level_stats)[LevelStatType::READ_GB] = bytes_read / kGB;
(*level_stats)[LevelStatType::RN_GB] =
stats.bytes_read_non_output_levels / kGB;
(*level_stats)[LevelStatType::RNP1_GB] = stats.bytes_read_output_level / kGB;
(*level_stats)[LevelStatType::WRITE_GB] = stats.bytes_written / kGB;
(*level_stats)[LevelStatType::W_NEW_GB] = bytes_new / kGB;
(*level_stats)[LevelStatType::MOVED_GB] = stats.bytes_moved / kGB;
(*level_stats)[LevelStatType::WRITE_AMP] = w_amp;
(*level_stats)[LevelStatType::READ_MBPS] = bytes_read / kMB / elapsed;
(*level_stats)[LevelStatType::WRITE_MBPS] = bytes_written / kMB / elapsed;
(*level_stats)[LevelStatType::COMP_SEC] = stats.micros / kMicrosInSec;
(*level_stats)[LevelStatType::COMP_CPU_SEC] = stats.cpu_micros / kMicrosInSec;
(*level_stats)[LevelStatType::COMP_COUNT] = stats.count;
(*level_stats)[LevelStatType::AVG_SEC] =
stats.count == 0 ? 0 : stats.micros / kMicrosInSec / stats.count;
(*level_stats)[LevelStatType::KEY_IN] =
static_cast<double>(stats.num_input_records);
(*level_stats)[LevelStatType::KEY_DROP] =
static_cast<double>(stats.num_dropped_records);
(*level_stats)[LevelStatType::R_BLOB_GB] = stats.bytes_read_blob / kGB;
(*level_stats)[LevelStatType::W_BLOB_GB] = stats.bytes_written_blob / kGB;
}
void PrintLevelStats(char* buf, size_t len, const std::string& name,
const std::map<LevelStatType, double>& stat_value) {
snprintf(
buf, len,
"%4s " /* Level */
"%6d/%-3d " /* Files */
"%8s " /* Size */
"%5.1f " /* Score */
"%8.1f " /* Read(GB) */
"%7.1f " /* Rn(GB) */
"%8.1f " /* Rnp1(GB) */
"%9.1f " /* Write(GB) */
"%8.1f " /* Wnew(GB) */
"%9.1f " /* Moved(GB) */
"%5.1f " /* W-Amp */
"%8.1f " /* Rd(MB/s) */
"%8.1f " /* Wr(MB/s) */
"%9.2f " /* Comp(sec) */
"%17.2f " /* CompMergeCPU(sec) */
"%9d " /* Comp(cnt) */
"%8.3f " /* Avg(sec) */
"%7s " /* KeyIn */
"%6s " /* KeyDrop */
"%9.1f " /* Rblob(GB) */
"%9.1f\n", /* Wblob(GB) */
name.c_str(), static_cast<int>(stat_value.at(LevelStatType::NUM_FILES)),
static_cast<int>(stat_value.at(LevelStatType::COMPACTED_FILES)),
BytesToHumanString(
static_cast<uint64_t>(stat_value.at(LevelStatType::SIZE_BYTES)))
.c_str(),
stat_value.at(LevelStatType::SCORE),
stat_value.at(LevelStatType::READ_GB),
stat_value.at(LevelStatType::RN_GB),
stat_value.at(LevelStatType::RNP1_GB),
stat_value.at(LevelStatType::WRITE_GB),
stat_value.at(LevelStatType::W_NEW_GB),
stat_value.at(LevelStatType::MOVED_GB),
stat_value.at(LevelStatType::WRITE_AMP),
stat_value.at(LevelStatType::READ_MBPS),
stat_value.at(LevelStatType::WRITE_MBPS),
stat_value.at(LevelStatType::COMP_SEC),
stat_value.at(LevelStatType::COMP_CPU_SEC),
static_cast<int>(stat_value.at(LevelStatType::COMP_COUNT)),
stat_value.at(LevelStatType::AVG_SEC),
NumberToHumanString(
static_cast<std::int64_t>(stat_value.at(LevelStatType::KEY_IN)))
.c_str(),
NumberToHumanString(
static_cast<std::int64_t>(stat_value.at(LevelStatType::KEY_DROP)))
.c_str(),
stat_value.at(LevelStatType::R_BLOB_GB),
stat_value.at(LevelStatType::W_BLOB_GB));
}
void PrintLevelStats(char* buf, size_t len, const std::string& name,
int num_files, int being_compacted, double total_file_size,
double score, double w_amp,
const InternalStats::CompactionStats& stats) {
std::map<LevelStatType, double> level_stats;
PrepareLevelStats(&level_stats, num_files, being_compacted, total_file_size,
score, w_amp, stats);
PrintLevelStats(buf, len, name, level_stats);
}
// Assumes that trailing numbers represent an optional argument. This requires
// property names to not end with numbers.
std::pair<Slice, Slice> GetPropertyNameAndArg(const Slice& property) {
Slice name = property, arg = property;
size_t sfx_len = 0;
while (sfx_len < property.size() &&
isdigit(property[property.size() - sfx_len - 1])) {
++sfx_len;
}
name.remove_suffix(sfx_len);
arg.remove_prefix(property.size() - sfx_len);
return {name, arg};
}
} // anonymous namespace
static const std::string rocksdb_prefix = "rocksdb.";
static const std::string num_files_at_level_prefix = "num-files-at-level";
static const std::string compression_ratio_at_level_prefix =
"compression-ratio-at-level";
static const std::string allstats = "stats";
static const std::string sstables = "sstables";
static const std::string cfstats = "cfstats";
static const std::string cfstats_no_file_histogram =
"cfstats-no-file-histogram";
static const std::string cf_file_histogram = "cf-file-histogram";
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
static const std::string cf_write_stall_stats = "cf-write-stall-stats";
static const std::string dbstats = "dbstats";
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
static const std::string db_write_stall_stats = "db-write-stall-stats";
static const std::string levelstats = "levelstats";
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
static const std::string block_cache_entry_stats = "block-cache-entry-stats";
static const std::string fast_block_cache_entry_stats =
"fast-block-cache-entry-stats";
static const std::string num_immutable_mem_table = "num-immutable-mem-table";
Support saving history in memtable_list Summary: For transactions, we are using the memtables to validate that there are no write conflicts. But after flushing, we don't have any memtables, and transactions could fail to commit. So we want to someone keep around some extra history to use for conflict checking. In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit. After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure). It seems like the best place for this is abstracted inside the memtable_list. I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much. This diff adds a new parameter to control how much memtable history to keep around after flushing. However, it sounds like people aren't too fond of adding new parameters. So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers. This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit. (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached). So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit). However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions. Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests. Added testing in memtablelist_test and planning on adding more testing here. Reviewers: sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37443
2015-05-28 23:34:24 +00:00
static const std::string num_immutable_mem_table_flushed =
"num-immutable-mem-table-flushed";
static const std::string mem_table_flush_pending = "mem-table-flush-pending";
static const std::string compaction_pending = "compaction-pending";
static const std::string background_errors = "background-errors";
static const std::string cur_size_active_mem_table =
"cur-size-active-mem-table";
static const std::string cur_size_all_mem_tables = "cur-size-all-mem-tables";
static const std::string size_all_mem_tables = "size-all-mem-tables";
static const std::string num_entries_active_mem_table =
"num-entries-active-mem-table";
static const std::string num_entries_imm_mem_tables =
"num-entries-imm-mem-tables";
static const std::string num_deletes_active_mem_table =
"num-deletes-active-mem-table";
static const std::string num_deletes_imm_mem_tables =
"num-deletes-imm-mem-tables";
static const std::string estimate_num_keys = "estimate-num-keys";
static const std::string estimate_table_readers_mem =
"estimate-table-readers-mem";
static const std::string is_file_deletions_enabled =
"is-file-deletions-enabled";
static const std::string num_snapshots = "num-snapshots";
static const std::string oldest_snapshot_time = "oldest-snapshot-time";
static const std::string oldest_snapshot_sequence = "oldest-snapshot-sequence";
static const std::string num_live_versions = "num-live-versions";
static const std::string current_version_number =
"current-super-version-number";
static const std::string estimate_live_data_size = "estimate-live-data-size";
static const std::string min_log_number_to_keep_str = "min-log-number-to-keep";
static const std::string min_obsolete_sst_number_to_keep_str =
"min-obsolete-sst-number-to-keep";
static const std::string base_level_str = "base-level";
static const std::string total_sst_files_size = "total-sst-files-size";
static const std::string live_sst_files_size = "live-sst-files-size";
static const std::string obsolete_sst_files_size = "obsolete-sst-files-size";
static const std::string live_sst_files_size_at_temperature =
"live-sst-files-size-at-temperature";
static const std::string estimate_pending_comp_bytes =
"estimate-pending-compaction-bytes";
static const std::string aggregated_table_properties =
"aggregated-table-properties";
static const std::string aggregated_table_properties_at_level =
aggregated_table_properties + "-at-level";
static const std::string num_running_compactions = "num-running-compactions";
static const std::string num_running_flushes = "num-running-flushes";
static const std::string actual_delayed_write_rate =
"actual-delayed-write-rate";
static const std::string is_write_stopped = "is-write-stopped";
static const std::string estimate_oldest_key_time = "estimate-oldest-key-time";
static const std::string block_cache_capacity = "block-cache-capacity";
static const std::string block_cache_usage = "block-cache-usage";
static const std::string block_cache_pinned_usage = "block-cache-pinned-usage";
static const std::string options_statistics = "options-statistics";
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
static const std::string num_blob_files = "num-blob-files";
static const std::string blob_stats = "blob-stats";
static const std::string total_blob_file_size = "total-blob-file-size";
static const std::string live_blob_file_size = "live-blob-file-size";
static const std::string live_blob_file_garbage_size =
"live-blob-file-garbage-size";
static const std::string blob_cache_capacity = "blob-cache-capacity";
static const std::string blob_cache_usage = "blob-cache-usage";
static const std::string blob_cache_pinned_usage = "blob-cache-pinned-usage";
const std::string DB::Properties::kNumFilesAtLevelPrefix =
rocksdb_prefix + num_files_at_level_prefix;
const std::string DB::Properties::kCompressionRatioAtLevelPrefix =
rocksdb_prefix + compression_ratio_at_level_prefix;
const std::string DB::Properties::kStats = rocksdb_prefix + allstats;
const std::string DB::Properties::kSSTables = rocksdb_prefix + sstables;
const std::string DB::Properties::kCFStats = rocksdb_prefix + cfstats;
const std::string DB::Properties::kCFStatsNoFileHistogram =
rocksdb_prefix + cfstats_no_file_histogram;
const std::string DB::Properties::kCFFileHistogram =
rocksdb_prefix + cf_file_histogram;
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
const std::string DB::Properties::kCFWriteStallStats =
rocksdb_prefix + cf_write_stall_stats;
const std::string DB::Properties::kDBWriteStallStats =
rocksdb_prefix + db_write_stall_stats;
const std::string DB::Properties::kDBStats = rocksdb_prefix + dbstats;
const std::string DB::Properties::kLevelStats = rocksdb_prefix + levelstats;
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
const std::string DB::Properties::kBlockCacheEntryStats =
rocksdb_prefix + block_cache_entry_stats;
const std::string DB::Properties::kFastBlockCacheEntryStats =
rocksdb_prefix + fast_block_cache_entry_stats;
const std::string DB::Properties::kNumImmutableMemTable =
rocksdb_prefix + num_immutable_mem_table;
const std::string DB::Properties::kNumImmutableMemTableFlushed =
rocksdb_prefix + num_immutable_mem_table_flushed;
const std::string DB::Properties::kMemTableFlushPending =
rocksdb_prefix + mem_table_flush_pending;
const std::string DB::Properties::kCompactionPending =
rocksdb_prefix + compaction_pending;
const std::string DB::Properties::kNumRunningCompactions =
rocksdb_prefix + num_running_compactions;
const std::string DB::Properties::kNumRunningFlushes =
rocksdb_prefix + num_running_flushes;
const std::string DB::Properties::kBackgroundErrors =
rocksdb_prefix + background_errors;
const std::string DB::Properties::kCurSizeActiveMemTable =
rocksdb_prefix + cur_size_active_mem_table;
const std::string DB::Properties::kCurSizeAllMemTables =
rocksdb_prefix + cur_size_all_mem_tables;
const std::string DB::Properties::kSizeAllMemTables =
rocksdb_prefix + size_all_mem_tables;
const std::string DB::Properties::kNumEntriesActiveMemTable =
rocksdb_prefix + num_entries_active_mem_table;
const std::string DB::Properties::kNumEntriesImmMemTables =
rocksdb_prefix + num_entries_imm_mem_tables;
const std::string DB::Properties::kNumDeletesActiveMemTable =
rocksdb_prefix + num_deletes_active_mem_table;
const std::string DB::Properties::kNumDeletesImmMemTables =
rocksdb_prefix + num_deletes_imm_mem_tables;
const std::string DB::Properties::kEstimateNumKeys =
rocksdb_prefix + estimate_num_keys;
const std::string DB::Properties::kEstimateTableReadersMem =
rocksdb_prefix + estimate_table_readers_mem;
const std::string DB::Properties::kIsFileDeletionsEnabled =
rocksdb_prefix + is_file_deletions_enabled;
const std::string DB::Properties::kNumSnapshots =
rocksdb_prefix + num_snapshots;
const std::string DB::Properties::kOldestSnapshotTime =
rocksdb_prefix + oldest_snapshot_time;
const std::string DB::Properties::kOldestSnapshotSequence =
rocksdb_prefix + oldest_snapshot_sequence;
const std::string DB::Properties::kNumLiveVersions =
rocksdb_prefix + num_live_versions;
const std::string DB::Properties::kCurrentSuperVersionNumber =
rocksdb_prefix + current_version_number;
const std::string DB::Properties::kEstimateLiveDataSize =
rocksdb_prefix + estimate_live_data_size;
const std::string DB::Properties::kMinLogNumberToKeep =
rocksdb_prefix + min_log_number_to_keep_str;
const std::string DB::Properties::kMinObsoleteSstNumberToKeep =
rocksdb_prefix + min_obsolete_sst_number_to_keep_str;
const std::string DB::Properties::kTotalSstFilesSize =
rocksdb_prefix + total_sst_files_size;
const std::string DB::Properties::kLiveSstFilesSize =
rocksdb_prefix + live_sst_files_size;
const std::string DB::Properties::kObsoleteSstFilesSize =
rocksdb_prefix + obsolete_sst_files_size;
const std::string DB::Properties::kBaseLevel = rocksdb_prefix + base_level_str;
const std::string DB::Properties::kEstimatePendingCompactionBytes =
rocksdb_prefix + estimate_pending_comp_bytes;
const std::string DB::Properties::kAggregatedTableProperties =
rocksdb_prefix + aggregated_table_properties;
const std::string DB::Properties::kAggregatedTablePropertiesAtLevel =
rocksdb_prefix + aggregated_table_properties_at_level;
const std::string DB::Properties::kActualDelayedWriteRate =
rocksdb_prefix + actual_delayed_write_rate;
const std::string DB::Properties::kIsWriteStopped =
rocksdb_prefix + is_write_stopped;
const std::string DB::Properties::kEstimateOldestKeyTime =
rocksdb_prefix + estimate_oldest_key_time;
const std::string DB::Properties::kBlockCacheCapacity =
rocksdb_prefix + block_cache_capacity;
const std::string DB::Properties::kBlockCacheUsage =
rocksdb_prefix + block_cache_usage;
const std::string DB::Properties::kBlockCachePinnedUsage =
rocksdb_prefix + block_cache_pinned_usage;
const std::string DB::Properties::kOptionsStatistics =
rocksdb_prefix + options_statistics;
const std::string DB::Properties::kLiveSstFilesSizeAtTemperature =
rocksdb_prefix + live_sst_files_size_at_temperature;
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
const std::string DB::Properties::kNumBlobFiles =
rocksdb_prefix + num_blob_files;
const std::string DB::Properties::kBlobStats = rocksdb_prefix + blob_stats;
const std::string DB::Properties::kTotalBlobFileSize =
rocksdb_prefix + total_blob_file_size;
const std::string DB::Properties::kLiveBlobFileSize =
rocksdb_prefix + live_blob_file_size;
const std::string DB::Properties::kLiveBlobFileGarbageSize =
rocksdb_prefix + live_blob_file_garbage_size;
const std::string DB::Properties::kBlobCacheCapacity =
rocksdb_prefix + blob_cache_capacity;
const std::string DB::Properties::kBlobCacheUsage =
rocksdb_prefix + blob_cache_usage;
const std::string DB::Properties::kBlobCachePinnedUsage =
rocksdb_prefix + blob_cache_pinned_usage;
const std::string InternalStats::kPeriodicCFStats =
DB::Properties::kCFStats + ".periodic";
const int InternalStats::kMaxNoChangePeriodSinceDump = 8;
Meta-internal folly integration with F14FastMap (#9546) Summary: Especially after updating to C++17, I don't see a compelling case for *requiring* any folly components in RocksDB. I was able to purge the existing hard dependencies, and it can be quite difficult to strip out non-trivial components from folly for use in RocksDB. (The prospect of doing that on F14 has changed my mind on the best approach here.) But this change creates an optional integration where we can plug in components from folly at compile time, starting here with F14FastMap to replace std::unordered_map when possible (probably no public APIs for example). I have replaced the biggest CPU users of std::unordered_map with compile-time pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set. USE_FOLLY is always set in the Meta-internal buck build, and a simulation of that is in the Makefile for public CI testing. A full folly build is not needed, but checking out the full folly repo is much simpler for getting the dependency, and anything else we might want to optionally integrate in the future. Some picky details: * I don't think the distributed mutex stuff is actually used, so it was easy to remove. * I implemented an alternative to `folly::constexpr_log2` (which is much easier in C++17 than C++11) so that I could pull out the hard dependencies on `ConstexprMath.h` * I had to add noexcept move constructors/operators to some types to make F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a macro to make that easier in some common cases. * Updated Meta-internal buck build to use folly F14Map (always) No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a production integration for open source users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546 Test Plan: CircleCI tests updated so that a couple of them use folly. Most internal unit & stress/crash tests updated to use Meta-internal latest folly. (Note: they should probably use buck but they currently use Makefile.) Example performance improvement: when filter partitions are pinned in cache, they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build a test that exercises that heavily. Build DB with ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters ``` and test with (simultaneous runs with & without folly, ~20 times each to see convergence) ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache ``` Average ops/s no folly: 26229.2 Average ops/s with folly: 26853.3 (+2.4%) Reviewed By: ajkr Differential Revision: D34181736 Pulled By: pdillinger fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
2022-04-13 14:34:01 +00:00
const UnorderedMap<std::string, DBPropertyInfo>
InternalStats::ppt_name_to_info = {
{DB::Properties::kNumFilesAtLevelPrefix,
{false, &InternalStats::HandleNumFilesAtLevel, nullptr, nullptr,
nullptr}},
{DB::Properties::kCompressionRatioAtLevelPrefix,
{false, &InternalStats::HandleCompressionRatioAtLevelPrefix, nullptr,
nullptr, nullptr}},
{DB::Properties::kLevelStats,
{false, &InternalStats::HandleLevelStats, nullptr, nullptr, nullptr}},
{DB::Properties::kStats,
{false, &InternalStats::HandleStats, nullptr, nullptr, nullptr}},
{DB::Properties::kCFStats,
{false, &InternalStats::HandleCFStats, nullptr,
&InternalStats::HandleCFMapStats, nullptr}},
{InternalStats::kPeriodicCFStats,
{false, &InternalStats::HandleCFStatsPeriodic, nullptr, nullptr,
nullptr}},
{DB::Properties::kCFStatsNoFileHistogram,
{false, &InternalStats::HandleCFStatsNoFileHistogram, nullptr, nullptr,
nullptr}},
{DB::Properties::kCFFileHistogram,
{false, &InternalStats::HandleCFFileHistogram, nullptr, nullptr,
nullptr}},
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
{DB::Properties::kCFWriteStallStats,
{false, &InternalStats::HandleCFWriteStallStats, nullptr,
&InternalStats::HandleCFWriteStallStatsMap, nullptr}},
{DB::Properties::kDBStats,
{false, &InternalStats::HandleDBStats, nullptr,
&InternalStats::HandleDBMapStats, nullptr}},
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
{DB::Properties::kDBWriteStallStats,
{false, &InternalStats::HandleDBWriteStallStats, nullptr,
&InternalStats::HandleDBWriteStallStatsMap, nullptr}},
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
{DB::Properties::kBlockCacheEntryStats,
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
{true, &InternalStats::HandleBlockCacheEntryStats, nullptr,
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
&InternalStats::HandleBlockCacheEntryStatsMap, nullptr}},
{DB::Properties::kFastBlockCacheEntryStats,
{true, &InternalStats::HandleFastBlockCacheEntryStats, nullptr,
&InternalStats::HandleFastBlockCacheEntryStatsMap, nullptr}},
{DB::Properties::kSSTables,
{false, &InternalStats::HandleSsTables, nullptr, nullptr, nullptr}},
{DB::Properties::kAggregatedTableProperties,
{false, &InternalStats::HandleAggregatedTableProperties, nullptr,
aggregated-table-properties with GetMapProperty (#7779) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 15:59:08 +00:00
&InternalStats::HandleAggregatedTablePropertiesMap, nullptr}},
{DB::Properties::kAggregatedTablePropertiesAtLevel,
{false, &InternalStats::HandleAggregatedTablePropertiesAtLevel,
aggregated-table-properties with GetMapProperty (#7779) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 15:59:08 +00:00
nullptr, &InternalStats::HandleAggregatedTablePropertiesAtLevelMap,
nullptr}},
{DB::Properties::kNumImmutableMemTable,
{false, nullptr, &InternalStats::HandleNumImmutableMemTable, nullptr,
nullptr}},
{DB::Properties::kNumImmutableMemTableFlushed,
{false, nullptr, &InternalStats::HandleNumImmutableMemTableFlushed,
nullptr, nullptr}},
{DB::Properties::kMemTableFlushPending,
{false, nullptr, &InternalStats::HandleMemTableFlushPending, nullptr,
nullptr}},
{DB::Properties::kCompactionPending,
{false, nullptr, &InternalStats::HandleCompactionPending, nullptr,
nullptr}},
{DB::Properties::kBackgroundErrors,
{false, nullptr, &InternalStats::HandleBackgroundErrors, nullptr,
nullptr}},
{DB::Properties::kCurSizeActiveMemTable,
{false, nullptr, &InternalStats::HandleCurSizeActiveMemTable, nullptr,
nullptr}},
{DB::Properties::kCurSizeAllMemTables,
{false, nullptr, &InternalStats::HandleCurSizeAllMemTables, nullptr,
nullptr}},
{DB::Properties::kSizeAllMemTables,
{false, nullptr, &InternalStats::HandleSizeAllMemTables, nullptr,
nullptr}},
{DB::Properties::kNumEntriesActiveMemTable,
{false, nullptr, &InternalStats::HandleNumEntriesActiveMemTable,
nullptr, nullptr}},
{DB::Properties::kNumEntriesImmMemTables,
{false, nullptr, &InternalStats::HandleNumEntriesImmMemTables, nullptr,
nullptr}},
{DB::Properties::kNumDeletesActiveMemTable,
{false, nullptr, &InternalStats::HandleNumDeletesActiveMemTable,
nullptr, nullptr}},
{DB::Properties::kNumDeletesImmMemTables,
{false, nullptr, &InternalStats::HandleNumDeletesImmMemTables, nullptr,
nullptr}},
{DB::Properties::kEstimateNumKeys,
{false, nullptr, &InternalStats::HandleEstimateNumKeys, nullptr,
nullptr}},
{DB::Properties::kEstimateTableReadersMem,
{true, nullptr, &InternalStats::HandleEstimateTableReadersMem, nullptr,
nullptr}},
{DB::Properties::kIsFileDeletionsEnabled,
{false, nullptr, &InternalStats::HandleIsFileDeletionsEnabled, nullptr,
nullptr}},
{DB::Properties::kNumSnapshots,
{false, nullptr, &InternalStats::HandleNumSnapshots, nullptr,
nullptr}},
{DB::Properties::kOldestSnapshotTime,
{false, nullptr, &InternalStats::HandleOldestSnapshotTime, nullptr,
nullptr}},
{DB::Properties::kOldestSnapshotSequence,
{false, nullptr, &InternalStats::HandleOldestSnapshotSequence, nullptr,
nullptr}},
{DB::Properties::kNumLiveVersions,
{false, nullptr, &InternalStats::HandleNumLiveVersions, nullptr,
nullptr}},
{DB::Properties::kCurrentSuperVersionNumber,
{false, nullptr, &InternalStats::HandleCurrentSuperVersionNumber,
nullptr, nullptr}},
{DB::Properties::kEstimateLiveDataSize,
{true, nullptr, &InternalStats::HandleEstimateLiveDataSize, nullptr,
nullptr}},
{DB::Properties::kMinLogNumberToKeep,
{false, nullptr, &InternalStats::HandleMinLogNumberToKeep, nullptr,
nullptr}},
{DB::Properties::kMinObsoleteSstNumberToKeep,
{false, nullptr, &InternalStats::HandleMinObsoleteSstNumberToKeep,
nullptr, nullptr}},
{DB::Properties::kBaseLevel,
{false, nullptr, &InternalStats::HandleBaseLevel, nullptr, nullptr}},
{DB::Properties::kTotalSstFilesSize,
{false, nullptr, &InternalStats::HandleTotalSstFilesSize, nullptr,
nullptr}},
{DB::Properties::kLiveSstFilesSize,
{false, nullptr, &InternalStats::HandleLiveSstFilesSize, nullptr,
nullptr}},
{DB::Properties::kLiveSstFilesSizeAtTemperature,
Add manifest fix-up utility for file temperatures (#9683) Summary: The goal of this change is to allow changes to the "current" (in FileSystem) file temperatures to feed back into DB metadata, so that they can inform decisions and stats reporting. In part because of modular code factoring, it doesn't seem easy to do this automagically, where opening an SST file and observing current Temperature different from expected would trigger a change in metadata and DB manifest write (essentially giving the deep read path access to the write path). It is also difficult to do this while the DB is open because of the limitations of LogAndApply. This change allows updating file temperature metadata on a closed DB using an experimental utility function UpdateManifestForFilesState() or `ldb update_manifest --update_temperatures`. This should suffice for "migration" scenarios where outside tooling has placed or re-arranged DB files into a (different) tiered configuration without going through RocksDB itself (currently, only compaction can change temperature metadata). Some details: * Refactored and added unit test for `ldb unsafe_remove_sst_file` because of shared functionality * Pulled in autovector.h changes from https://github.com/facebook/rocksdb/issues/9546 to fix SuperVersionContext move constructor (related to an older draft of this change) Possible follow-up work: * Support updating manifest with file checksums, such as when a new checksum function is used and want existing DB metadata updated for it. * It's possible that for some repair scenarios, lighter weight than full repair, we might want to support UpdateManifestForFilesState() to modify critical file details like size or checksum using same algorithm. But let's make sure these are differentiated from modifying file details in ways that don't suspect corruption (or require extreme trust). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9683 Test Plan: unit tests added Reviewed By: jay-zhuang Differential Revision: D34798828 Pulled By: pdillinger fbshipit-source-id: cfd83e8fb10761d8c9e7f9c020d68c9106a95554
2022-03-18 23:35:51 +00:00
{false, &InternalStats::HandleLiveSstFilesSizeAtTemperature, nullptr,
nullptr, nullptr}},
{DB::Properties::kObsoleteSstFilesSize,
{false, nullptr, &InternalStats::HandleObsoleteSstFilesSize, nullptr,
nullptr}},
{DB::Properties::kEstimatePendingCompactionBytes,
{false, nullptr, &InternalStats::HandleEstimatePendingCompactionBytes,
nullptr, nullptr}},
{DB::Properties::kNumRunningFlushes,
{false, nullptr, &InternalStats::HandleNumRunningFlushes, nullptr,
nullptr}},
{DB::Properties::kNumRunningCompactions,
{false, nullptr, &InternalStats::HandleNumRunningCompactions, nullptr,
nullptr}},
{DB::Properties::kActualDelayedWriteRate,
{false, nullptr, &InternalStats::HandleActualDelayedWriteRate, nullptr,
nullptr}},
{DB::Properties::kIsWriteStopped,
{false, nullptr, &InternalStats::HandleIsWriteStopped, nullptr,
nullptr}},
{DB::Properties::kEstimateOldestKeyTime,
{false, nullptr, &InternalStats::HandleEstimateOldestKeyTime, nullptr,
nullptr}},
{DB::Properties::kBlockCacheCapacity,
{false, nullptr, &InternalStats::HandleBlockCacheCapacity, nullptr,
nullptr}},
{DB::Properties::kBlockCacheUsage,
{false, nullptr, &InternalStats::HandleBlockCacheUsage, nullptr,
nullptr}},
{DB::Properties::kBlockCachePinnedUsage,
{false, nullptr, &InternalStats::HandleBlockCachePinnedUsage, nullptr,
nullptr}},
{DB::Properties::kOptionsStatistics,
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
{true, nullptr, nullptr, nullptr,
&DBImpl::GetPropertyHandleOptionsStatistics}},
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
{DB::Properties::kNumBlobFiles,
{false, nullptr, &InternalStats::HandleNumBlobFiles, nullptr,
nullptr}},
{DB::Properties::kBlobStats,
{false, &InternalStats::HandleBlobStats, nullptr, nullptr, nullptr}},
{DB::Properties::kTotalBlobFileSize,
{false, nullptr, &InternalStats::HandleTotalBlobFileSize, nullptr,
nullptr}},
{DB::Properties::kLiveBlobFileSize,
{false, nullptr, &InternalStats::HandleLiveBlobFileSize, nullptr,
nullptr}},
{DB::Properties::kLiveBlobFileGarbageSize,
{false, nullptr, &InternalStats::HandleLiveBlobFileGarbageSize,
nullptr, nullptr}},
{DB::Properties::kBlobCacheCapacity,
{false, nullptr, &InternalStats::HandleBlobCacheCapacity, nullptr,
nullptr}},
{DB::Properties::kBlobCacheUsage,
{false, nullptr, &InternalStats::HandleBlobCacheUsage, nullptr,
nullptr}},
{DB::Properties::kBlobCachePinnedUsage,
{false, nullptr, &InternalStats::HandleBlobCachePinnedUsage, nullptr,
nullptr}},
};
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
InternalStats::InternalStats(int num_levels, SystemClock* clock,
ColumnFamilyData* cfd)
: db_stats_{},
cf_stats_value_{},
cf_stats_count_{},
comp_stats_(num_levels),
comp_stats_by_pri_(Env::Priority::TOTAL),
file_read_latency_(num_levels),
has_cf_change_since_dump_(true),
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
bg_error_count_(0),
number_levels_(num_levels),
clock_(clock),
cfd_(cfd),
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
started_at_(clock->NowMicros()) {
Cache* block_cache = GetBlockCacheForStats();
if (block_cache) {
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
// Extract or create stats collector. Could fail in rare cases.
Status s = CacheEntryStatsCollector<CacheEntryRoleStats>::GetShared(
block_cache, clock_, &cache_entry_stats_collector_);
if (s.ok()) {
assert(cache_entry_stats_collector_);
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
} else {
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
assert(!cache_entry_stats_collector_);
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
}
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
}
}
void InternalStats::TEST_GetCacheEntryRoleStats(CacheEntryRoleStats* stats,
bool foreground) {
CollectCacheEntryStats(foreground);
if (cache_entry_stats_collector_) {
cache_entry_stats_collector_->GetStats(stats);
}
}
void InternalStats::CollectCacheEntryStats(bool foreground) {
// This function is safe to call from any thread because
// cache_entry_stats_collector_ field is const after constructor
// and ->GetStats does its own synchronization, which also suffices for
// cache_entry_stats_.
if (!cache_entry_stats_collector_) {
return; // nothing to do (e.g. no block cache)
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
}
Pin CacheEntryStatsCollector to fix performance bug (#8385) Summary: If the block Cache is full with strict_capacity_limit=false, then our CacheEntryStatsCollector could be immediately evicted on release, so iterating through column families with shared block cache could trigger re-scan for each CF. This change fixes that problem by pinning the CacheEntryStatsCollector from InternalStats so that it's not evicted. I had originally thought that this object could participate in LRU like everything else, but even though a re-load+re-scan only touches memory, it can be orders of magnitude more expensive than other cache misses. One service in Facebook has scans that take ~20s over 100GB block cache that is mostly 4KB entries. (The up-side of this bug and https://github.com/facebook/rocksdb/issues/8369 is that we had a natural experiment on the effect on some service metrics even with block cache scans running continuously in the background--a kind of worst case scenario. Metrics like latency were not affected enough to trigger warnings.) Other smaller fixes: 20s is already a sizable portion of 600s stats dump period, or 180s default max age to force re-scan, so added logic to ensure that (for each block cache) we don't spend more than 0.2% of our background thread time scanning it. Nevertheless, "foreground" requests for cache entry stats (calls to `db->GetMapProperty(DB::Properties::kBlockCacheEntryStats)`) are permitted to consume more CPU. Renamed field to cache_entry_stats_ to match code style. This change is intended for patching in 6.21 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8385 Test Plan: unit test expanded to cover new logic (detect regression), some manual testing with db_bench Reviewed By: ajkr Differential Revision: D29042759 Pulled By: pdillinger fbshipit-source-id: 236faa902397f50038c618f50fbc8cf3f277308c
2021-06-14 15:14:21 +00:00
// For "background" collections, strictly cap the collection time by
// expanding effective cache TTL. For foreground, be more aggressive about
// getting latest data.
int min_interval_seconds = foreground ? 10 : 180;
// 1/500 = max of 0.2% of one CPU thread
int min_interval_factor = foreground ? 10 : 500;
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
cache_entry_stats_collector_->CollectStats(min_interval_seconds,
min_interval_factor);
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
}
Major Cache refactoring, CPU efficiency improvement (#10975) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2023-01-11 22:20:40 +00:00
std::function<void(const Slice& key, Cache::ObjectPtr value, size_t charge,
const Cache::CacheItemHelper* helper)>
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
InternalStats::CacheEntryRoleStats::GetEntryCallback() {
Major Cache refactoring, CPU efficiency improvement (#10975) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2023-01-11 22:20:40 +00:00
return [&](const Slice& /*key*/, Cache::ObjectPtr /*value*/, size_t charge,
const Cache::CacheItemHelper* helper) -> void {
size_t role_idx =
static_cast<size_t>(helper ? helper->role : CacheEntryRole::kMisc);
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
entry_counts[role_idx]++;
total_charges[role_idx] += charge;
};
}
void InternalStats::CacheEntryRoleStats::BeginCollection(
Cache* cache, SystemClock*, uint64_t start_time_micros) {
Clear();
last_start_time_micros_ = start_time_micros;
++collection_count;
std::ostringstream str;
str << cache->Name() << "@" << static_cast<void*>(cache) << "#"
<< port::GetProcessID();
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
cache_id = str.str();
cache_capacity = cache->GetCapacity();
Revamp, optimize new experimental clock cache (#10626) Summary: * Consolidates most metadata into a single word per slot so that more can be accomplished with a single atomic update. In the common case, Lookup was previously about 4 atomic updates, now just 1 atomic update. Common case Release was previously 1 atomic read + 1 atomic update, now just 1 atomic update. * Eliminate spins / waits / yields, which likely threaten some "lock free" benefits. Compare-exchange loops are only used in explicit Erase, and strict_capacity_limit=true Insert. Eviction uses opportunistic compare- exchange. * Relaxes some aggressiveness and guarantees. For example, * Duplicate Inserts will sometimes go undetected and the shadow duplicate will age out with eviction. * In many cases, the older Inserted value for a given cache key will be kept (i.e. Insert does not support overwrite). * Entries explicitly erased (rather than evicted) might not be freed immediately in some rare cases. * With strict_capacity_limit=false, capacity limit is not tracked/enforced as precisely as LRUCache, but is self-correcting and should only deviate by a very small number of extra or fewer entries. * Use smaller "computed default" number of cache shards in many cases, because benefits to larger usage tracking / eviction pools outweigh the small cost of more lock-free atomic contention. The improvement in CPU and I/O is dramatic in some limit-memory cases. * Even without the sharding change, the eviction algorithm is likely more effective than LRU overall because it's more stateful, even though the "hot path" state tracking for it is essentially free with ref counting. It is like a generalized CLOCK with aging (see code comments). I don't have performance numbers showing a specific improvement, but in theory, for a Poisson access pattern to each block, keeping some state allows better estimation of time to next access (Poisson interval) than strict LRU. The bounded randomness in CLOCK can also reduce "cliff" effect for repeated range scans approaching and exceeding cache size. ## Hot path algorithm comparison Rough descriptions, focusing on number and kind of atomic operations: * Old `Lookup()` (2-5 atomic updates per probe): ``` Loop: Increment internal ref count at slot If possible hit: Check flags atomic (and non-atomic fields) If cache hit: Three distinct updates to 'flags' atomic Increment refs for internal-to-external Return Decrement internal ref count while atomic read 'displacements' > 0 ``` * New `Lookup()` (1-2 atomic updates per probe): ``` Loop: Increment acquire counter in meta word (optimistic) If visible entry (already read meta word): If match (read non-atomic fields): Return Else: Decrement acquire counter in meta word Else if invisible entry (rare, already read meta word): Decrement acquire counter in meta word while atomic read 'displacements' > 0 ``` * Old `Release()` (1 atomic update, conditional on atomic read, rarely more): ``` Read atomic ref count If last reference and invisible (rare): Use CAS etc. to remove Return Else: Decrement ref count ``` * New `Release()` (1 unconditional atomic update, rarely more): ``` Increment release counter in meta word If last reference and invisible (rare): Use CAS etc. to remove Return ``` ## Performance test setup Build DB with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 ``` Test with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics ``` Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations: base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6) folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry) gt_clock: experimental ClockCache before this change new_clock: experimental ClockCache with this change ## Performance test results First test "hot path" read performance, with block cache large enough for whole DB: 4181MB 1thread base -> kops/s: 47.761 4181MB 1thread folly -> kops/s: 45.877 4181MB 1thread gt_clock -> kops/s: 51.092 4181MB 1thread new_clock -> kops/s: 53.944 4181MB 16thread base -> kops/s: 284.567 4181MB 16thread folly -> kops/s: 249.015 4181MB 16thread gt_clock -> kops/s: 743.762 4181MB 16thread new_clock -> kops/s: 861.821 4181MB 24thread base -> kops/s: 303.415 4181MB 24thread folly -> kops/s: 266.548 4181MB 24thread gt_clock -> kops/s: 975.706 4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944) 4181MB 32thread base -> kops/s: 311.251 4181MB 32thread folly -> kops/s: 274.952 4181MB 32thread gt_clock -> kops/s: 1045.98 4181MB 32thread new_clock -> kops/s: 1370.38 4181MB 48thread base -> kops/s: 310.504 4181MB 48thread folly -> kops/s: 268.322 4181MB 48thread gt_clock -> kops/s: 1195.65 4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944) 4181MB 64thread base -> kops/s: 307.839 4181MB 64thread folly -> kops/s: 272.172 4181MB 64thread gt_clock -> kops/s: 1204.47 4181MB 64thread new_clock -> kops/s: 1615.37 4181MB 128thread base -> kops/s: 310.934 4181MB 128thread folly -> kops/s: 267.468 4181MB 128thread gt_clock -> kops/s: 1188.75 4181MB 128thread new_clock -> kops/s: 1595.46 Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x. Now test a large block cache with low miss ratio, but some eviction is required: 1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23 1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43 1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4 1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56 1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59 1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8 1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89 1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45 1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98 1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91 1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26 1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63 610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137 610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996 610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934 610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5 610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402 610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742 610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062 610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453 610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457 610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426 610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273 610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812 The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.) Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc. 233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371 233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293 233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844 233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461 233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227 233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738 233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688 233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402 233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84 233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785 233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94 233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016 89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086 89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984 89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441 89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754 89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812 89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418 89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422 89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293 89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43 89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824 89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32 89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223 ^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.) Even smaller cache size: 34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914 34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281 34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523 34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125 34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48 34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531 34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465 34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793 34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484 34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457 34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41 34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52 As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn: 13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328 13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633 13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684 13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383 13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492 13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863 13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121 13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758 13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539 13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098 13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77 13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27 gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention: 13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852 13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516 13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688 13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707 13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57 13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219 13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871 13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626 Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN Reviewed By: anand1976 Differential Revision: D39368406 Pulled By: pdillinger fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2022-09-16 07:24:11 +00:00
cache_usage = cache->GetUsage();
table_size = cache->GetTableAddressCount();
occupancy = cache->GetOccupancyCount();
hash_seed = cache->GetHashSeed();
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
}
void InternalStats::CacheEntryRoleStats::EndCollection(
Cache*, SystemClock*, uint64_t end_time_micros) {
last_end_time_micros_ = end_time_micros;
}
void InternalStats::CacheEntryRoleStats::SkippedCollection() {
++copies_of_last_collection;
}
uint64_t InternalStats::CacheEntryRoleStats::GetLastDurationMicros() const {
if (last_end_time_micros_ > last_start_time_micros_) {
return last_end_time_micros_ - last_start_time_micros_;
} else {
return 0U;
}
}
std::string InternalStats::CacheEntryRoleStats::ToString(
SystemClock* clock) const {
std::ostringstream str;
str << "Block cache " << cache_id
<< " capacity: " << BytesToHumanString(cache_capacity)
<< " seed: " << hash_seed << " usage: " << BytesToHumanString(cache_usage)
Revamp, optimize new experimental clock cache (#10626) Summary: * Consolidates most metadata into a single word per slot so that more can be accomplished with a single atomic update. In the common case, Lookup was previously about 4 atomic updates, now just 1 atomic update. Common case Release was previously 1 atomic read + 1 atomic update, now just 1 atomic update. * Eliminate spins / waits / yields, which likely threaten some "lock free" benefits. Compare-exchange loops are only used in explicit Erase, and strict_capacity_limit=true Insert. Eviction uses opportunistic compare- exchange. * Relaxes some aggressiveness and guarantees. For example, * Duplicate Inserts will sometimes go undetected and the shadow duplicate will age out with eviction. * In many cases, the older Inserted value for a given cache key will be kept (i.e. Insert does not support overwrite). * Entries explicitly erased (rather than evicted) might not be freed immediately in some rare cases. * With strict_capacity_limit=false, capacity limit is not tracked/enforced as precisely as LRUCache, but is self-correcting and should only deviate by a very small number of extra or fewer entries. * Use smaller "computed default" number of cache shards in many cases, because benefits to larger usage tracking / eviction pools outweigh the small cost of more lock-free atomic contention. The improvement in CPU and I/O is dramatic in some limit-memory cases. * Even without the sharding change, the eviction algorithm is likely more effective than LRU overall because it's more stateful, even though the "hot path" state tracking for it is essentially free with ref counting. It is like a generalized CLOCK with aging (see code comments). I don't have performance numbers showing a specific improvement, but in theory, for a Poisson access pattern to each block, keeping some state allows better estimation of time to next access (Poisson interval) than strict LRU. The bounded randomness in CLOCK can also reduce "cliff" effect for repeated range scans approaching and exceeding cache size. ## Hot path algorithm comparison Rough descriptions, focusing on number and kind of atomic operations: * Old `Lookup()` (2-5 atomic updates per probe): ``` Loop: Increment internal ref count at slot If possible hit: Check flags atomic (and non-atomic fields) If cache hit: Three distinct updates to 'flags' atomic Increment refs for internal-to-external Return Decrement internal ref count while atomic read 'displacements' > 0 ``` * New `Lookup()` (1-2 atomic updates per probe): ``` Loop: Increment acquire counter in meta word (optimistic) If visible entry (already read meta word): If match (read non-atomic fields): Return Else: Decrement acquire counter in meta word Else if invisible entry (rare, already read meta word): Decrement acquire counter in meta word while atomic read 'displacements' > 0 ``` * Old `Release()` (1 atomic update, conditional on atomic read, rarely more): ``` Read atomic ref count If last reference and invisible (rare): Use CAS etc. to remove Return Else: Decrement ref count ``` * New `Release()` (1 unconditional atomic update, rarely more): ``` Increment release counter in meta word If last reference and invisible (rare): Use CAS etc. to remove Return ``` ## Performance test setup Build DB with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 ``` Test with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics ``` Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations: base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6) folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry) gt_clock: experimental ClockCache before this change new_clock: experimental ClockCache with this change ## Performance test results First test "hot path" read performance, with block cache large enough for whole DB: 4181MB 1thread base -> kops/s: 47.761 4181MB 1thread folly -> kops/s: 45.877 4181MB 1thread gt_clock -> kops/s: 51.092 4181MB 1thread new_clock -> kops/s: 53.944 4181MB 16thread base -> kops/s: 284.567 4181MB 16thread folly -> kops/s: 249.015 4181MB 16thread gt_clock -> kops/s: 743.762 4181MB 16thread new_clock -> kops/s: 861.821 4181MB 24thread base -> kops/s: 303.415 4181MB 24thread folly -> kops/s: 266.548 4181MB 24thread gt_clock -> kops/s: 975.706 4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944) 4181MB 32thread base -> kops/s: 311.251 4181MB 32thread folly -> kops/s: 274.952 4181MB 32thread gt_clock -> kops/s: 1045.98 4181MB 32thread new_clock -> kops/s: 1370.38 4181MB 48thread base -> kops/s: 310.504 4181MB 48thread folly -> kops/s: 268.322 4181MB 48thread gt_clock -> kops/s: 1195.65 4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944) 4181MB 64thread base -> kops/s: 307.839 4181MB 64thread folly -> kops/s: 272.172 4181MB 64thread gt_clock -> kops/s: 1204.47 4181MB 64thread new_clock -> kops/s: 1615.37 4181MB 128thread base -> kops/s: 310.934 4181MB 128thread folly -> kops/s: 267.468 4181MB 128thread gt_clock -> kops/s: 1188.75 4181MB 128thread new_clock -> kops/s: 1595.46 Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x. Now test a large block cache with low miss ratio, but some eviction is required: 1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23 1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43 1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4 1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56 1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59 1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8 1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89 1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45 1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98 1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91 1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26 1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63 610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137 610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996 610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934 610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5 610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402 610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742 610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062 610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453 610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457 610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426 610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273 610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812 The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.) Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc. 233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371 233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293 233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844 233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461 233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227 233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738 233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688 233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402 233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84 233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785 233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94 233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016 89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086 89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984 89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441 89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754 89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812 89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418 89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422 89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293 89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43 89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824 89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32 89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223 ^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.) Even smaller cache size: 34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914 34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281 34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523 34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125 34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48 34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531 34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465 34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793 34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484 34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457 34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41 34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52 As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn: 13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328 13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633 13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684 13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383 13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492 13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863 13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121 13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758 13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539 13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098 13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77 13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27 gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention: 13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852 13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516 13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688 13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707 13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57 13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219 13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871 13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626 Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN Reviewed By: anand1976 Differential Revision: D39368406 Pulled By: pdillinger fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2022-09-16 07:24:11 +00:00
<< " table_size: " << table_size << " occupancy: " << occupancy
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
<< " collections: " << collection_count
<< " last_copies: " << copies_of_last_collection
<< " last_secs: " << (GetLastDurationMicros() / 1000000.0)
<< " secs_since: "
<< ((clock->NowMicros() - last_end_time_micros_) / 1000000U) << "\n";
str << "Block cache entry stats(count,size,portion):";
for (size_t i = 0; i < kNumCacheEntryRoles; ++i) {
if (entry_counts[i] > 0) {
str << " " << kCacheEntryRoleToCamelString[i] << "(" << entry_counts[i]
<< "," << BytesToHumanString(total_charges[i]) << ","
<< (100.0 * total_charges[i] / cache_capacity) << "%)";
}
}
str << "\n";
return str.str();
}
void InternalStats::CacheEntryRoleStats::ToMap(
std::map<std::string, std::string>* values, SystemClock* clock) const {
values->clear();
auto& v = *values;
v[BlockCacheEntryStatsMapKeys::CacheId()] = cache_id;
v[BlockCacheEntryStatsMapKeys::CacheCapacityBytes()] =
std::to_string(cache_capacity);
v[BlockCacheEntryStatsMapKeys::LastCollectionDurationSeconds()] =
std::to_string(GetLastDurationMicros() / 1000000.0);
v[BlockCacheEntryStatsMapKeys::LastCollectionAgeSeconds()] =
std::to_string((clock->NowMicros() - last_end_time_micros_) / 1000000U);
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
for (size_t i = 0; i < kNumCacheEntryRoles; ++i) {
auto role = static_cast<CacheEntryRole>(i);
v[BlockCacheEntryStatsMapKeys::EntryCount(role)] =
std::to_string(entry_counts[i]);
v[BlockCacheEntryStatsMapKeys::UsedBytes(role)] =
std::to_string(total_charges[i]);
v[BlockCacheEntryStatsMapKeys::UsedPercent(role)] =
std::to_string(100.0 * total_charges[i] / cache_capacity);
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
}
}
bool InternalStats::HandleBlockCacheEntryStatsInternal(std::string* value,
bool fast) {
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
if (!cache_entry_stats_collector_) {
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
return false;
}
CollectCacheEntryStats(!fast /* foreground */);
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
CacheEntryRoleStats stats;
cache_entry_stats_collector_->GetStats(&stats);
*value = stats.ToString(clock_);
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
return true;
}
bool InternalStats::HandleBlockCacheEntryStatsMapInternal(
std::map<std::string, std::string>* values, bool fast) {
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
if (!cache_entry_stats_collector_) {
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
return false;
}
CollectCacheEntryStats(!fast /* foreground */);
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
CacheEntryRoleStats stats;
cache_entry_stats_collector_->GetStats(&stats);
stats.ToMap(values, clock_);
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
return true;
}
bool InternalStats::HandleBlockCacheEntryStats(std::string* value,
Slice /*suffix*/) {
return HandleBlockCacheEntryStatsInternal(value, false /* fast */);
}
bool InternalStats::HandleBlockCacheEntryStatsMap(
std::map<std::string, std::string>* values, Slice /*suffix*/) {
return HandleBlockCacheEntryStatsMapInternal(values, false /* fast */);
}
bool InternalStats::HandleFastBlockCacheEntryStats(std::string* value,
Slice /*suffix*/) {
return HandleBlockCacheEntryStatsInternal(value, true /* fast */);
}
bool InternalStats::HandleFastBlockCacheEntryStatsMap(
std::map<std::string, std::string>* values, Slice /*suffix*/) {
return HandleBlockCacheEntryStatsMapInternal(values, true /* fast */);
}
bool InternalStats::HandleLiveSstFilesSizeAtTemperature(std::string* value,
Slice suffix) {
uint64_t temperature;
bool ok = ConsumeDecimalNumber(&suffix, &temperature) && suffix.empty();
if (!ok) {
return false;
}
uint64_t size = 0;
const auto* vstorage = cfd_->current()->storage_info();
for (int level = 0; level < vstorage->num_levels(); level++) {
for (const auto& file_meta : vstorage->LevelFiles(level)) {
if (static_cast<uint8_t>(file_meta->temperature) == temperature) {
size += file_meta->fd.GetFileSize();
}
}
}
*value = std::to_string(size);
return true;
}
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
bool InternalStats::HandleNumBlobFiles(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
assert(value);
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
assert(cfd_);
const auto* current = cfd_->current();
assert(current);
const auto* vstorage = current->storage_info();
assert(vstorage);
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
const auto& blob_files = vstorage->GetBlobFiles();
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
*value = blob_files.size();
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
return true;
}
bool InternalStats::HandleBlobStats(std::string* value, Slice /*suffix*/) {
assert(value);
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
assert(cfd_);
const auto* current = cfd_->current();
assert(current);
const auto* vstorage = current->storage_info();
assert(vstorage);
const auto blob_st = vstorage->GetBlobStats();
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
std::ostringstream oss;
oss << "Number of blob files: " << vstorage->GetBlobFiles().size()
<< "\nTotal size of blob files: " << blob_st.total_file_size
<< "\nTotal size of garbage in blob files: " << blob_st.total_garbage_size
<< "\nBlob file space amplification: " << blob_st.space_amp << '\n';
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
value->append(oss.str());
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
return true;
}
bool InternalStats::HandleTotalBlobFileSize(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
assert(value);
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
assert(cfd_);
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
*value = cfd_->GetTotalBlobFileSize();
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
return true;
}
bool InternalStats::HandleLiveBlobFileSize(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
assert(value);
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
assert(cfd_);
const auto* current = cfd_->current();
assert(current);
const auto* vstorage = current->storage_info();
assert(vstorage);
*value = vstorage->GetBlobStats().total_file_size;
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
Add DB properties for BlobDB (#8734) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
2021-09-08 19:19:01 +00:00
return true;
}
bool InternalStats::HandleLiveBlobFileGarbageSize(uint64_t* value,
DBImpl* /*db*/,
Version* /*version*/) {
assert(value);
assert(cfd_);
const auto* current = cfd_->current();
assert(current);
const auto* vstorage = current->storage_info();
assert(vstorage);
*value = vstorage->GetBlobStats().total_garbage_size;
return true;
}
Cache* InternalStats::GetBlobCacheForStats() {
return cfd_->ioptions()->blob_cache.get();
}
bool InternalStats::HandleBlobCacheCapacity(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
Cache* blob_cache = GetBlobCacheForStats();
if (blob_cache) {
*value = static_cast<uint64_t>(blob_cache->GetCapacity());
return true;
}
return false;
}
bool InternalStats::HandleBlobCacheUsage(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
Cache* blob_cache = GetBlobCacheForStats();
if (blob_cache) {
*value = static_cast<uint64_t>(blob_cache->GetUsage());
return true;
}
return false;
}
bool InternalStats::HandleBlobCachePinnedUsage(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
Cache* blob_cache = GetBlobCacheForStats();
if (blob_cache) {
*value = static_cast<uint64_t>(blob_cache->GetPinnedUsage());
return true;
}
return false;
}
const DBPropertyInfo* GetPropertyInfo(const Slice& property) {
std::string ppt_name = GetPropertyNameAndArg(property).first.ToString();
auto ppt_info_iter = InternalStats::ppt_name_to_info.find(ppt_name);
if (ppt_info_iter == InternalStats::ppt_name_to_info.end()) {
return nullptr;
}
return &ppt_info_iter->second;
}
bool InternalStats::GetStringProperty(const DBPropertyInfo& property_info,
const Slice& property,
std::string* value) {
assert(value != nullptr);
assert(property_info.handle_string != nullptr);
Slice arg = GetPropertyNameAndArg(property).second;
return (this->*(property_info.handle_string))(value, arg);
}
bool InternalStats::GetMapProperty(const DBPropertyInfo& property_info,
aggregated-table-properties with GetMapProperty (#7779) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 15:59:08 +00:00
const Slice& property,
std::map<std::string, std::string>* value) {
assert(value != nullptr);
assert(property_info.handle_map != nullptr);
aggregated-table-properties with GetMapProperty (#7779) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 15:59:08 +00:00
Slice arg = GetPropertyNameAndArg(property).second;
return (this->*(property_info.handle_map))(value, arg);
}
bool InternalStats::GetIntProperty(const DBPropertyInfo& property_info,
uint64_t* value, DBImpl* db) {
assert(value != nullptr);
assert(property_info.handle_int != nullptr &&
!property_info.need_out_of_mutex);
db->mutex_.AssertHeld();
return (this->*(property_info.handle_int))(value, db, nullptr /* version */);
}
bool InternalStats::GetIntPropertyOutOfMutex(
const DBPropertyInfo& property_info, Version* version, uint64_t* value) {
assert(value != nullptr);
assert(property_info.handle_int != nullptr &&
property_info.need_out_of_mutex);
return (this->*(property_info.handle_int))(value, nullptr /* db */, version);
}
bool InternalStats::HandleNumFilesAtLevel(std::string* value, Slice suffix) {
uint64_t level;
const auto* vstorage = cfd_->current()->storage_info();
bool ok = ConsumeDecimalNumber(&suffix, &level) && suffix.empty();
if (!ok || static_cast<int>(level) >= number_levels_) {
return false;
} else {
char buf[100];
snprintf(buf, sizeof(buf), "%d",
vstorage->NumLevelFiles(static_cast<int>(level)));
*value = buf;
return true;
}
}
bool InternalStats::HandleCompressionRatioAtLevelPrefix(std::string* value,
Slice suffix) {
uint64_t level;
const auto* vstorage = cfd_->current()->storage_info();
bool ok = ConsumeDecimalNumber(&suffix, &level) && suffix.empty();
if (!ok || level >= static_cast<uint64_t>(number_levels_)) {
return false;
}
*value = std::to_string(
vstorage->GetEstimatedCompressionRatioAtLevel(static_cast<int>(level)));
return true;
}
bool InternalStats::HandleLevelStats(std::string* value, Slice /*suffix*/) {
char buf[1000];
const auto* vstorage = cfd_->current()->storage_info();
snprintf(buf, sizeof(buf),
"Level Files Size(MB)\n"
"--------------------\n");
value->append(buf);
for (int level = 0; level < number_levels_; level++) {
snprintf(buf, sizeof(buf), "%3d %8d %8.0f\n", level,
vstorage->NumLevelFiles(level),
vstorage->NumLevelBytes(level) / kMB);
value->append(buf);
}
return true;
}
bool InternalStats::HandleStats(std::string* value, Slice suffix) {
if (!HandleCFStats(value, suffix)) {
return false;
}
if (!HandleDBStats(value, suffix)) {
return false;
}
return true;
}
bool InternalStats::HandleCFMapStats(
aggregated-table-properties with GetMapProperty (#7779) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 15:59:08 +00:00
std::map<std::string, std::string>* cf_stats, Slice /*suffix*/) {
DumpCFMapStats(cf_stats);
return true;
}
bool InternalStats::HandleCFStats(std::string* value, Slice /*suffix*/) {
DumpCFStats(value);
return true;
}
bool InternalStats::HandleCFStatsPeriodic(std::string* value,
Slice /*suffix*/) {
bool has_change = has_cf_change_since_dump_;
if (!has_change) {
// If file histogram changes, there is activity in this period too.
uint64_t new_histogram_num = 0;
for (int level = 0; level < number_levels_; level++) {
new_histogram_num += file_read_latency_[level].num();
}
new_histogram_num += blob_file_read_latency_.num();
if (new_histogram_num != last_histogram_num) {
has_change = true;
last_histogram_num = new_histogram_num;
}
}
if (has_change) {
no_cf_change_period_since_dump_ = 0;
has_cf_change_since_dump_ = false;
} else if (no_cf_change_period_since_dump_++ > 0) {
// Not ready to sync
if (no_cf_change_period_since_dump_ == kMaxNoChangePeriodSinceDump) {
// Next periodic, we need to dump stats even if there is no change.
no_cf_change_period_since_dump_ = 0;
}
return true;
}
DumpCFStatsNoFileHistogram(/*is_periodic=*/true, value);
DumpCFFileHistogram(value);
return true;
}
bool InternalStats::HandleCFStatsNoFileHistogram(std::string* value,
Slice /*suffix*/) {
DumpCFStatsNoFileHistogram(/*is_periodic=*/false, value);
return true;
}
bool InternalStats::HandleCFFileHistogram(std::string* value,
Slice /*suffix*/) {
DumpCFFileHistogram(value);
return true;
}
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
bool InternalStats::HandleCFWriteStallStats(std::string* value,
Slice /*suffix*/) {
DumpCFStatsWriteStall(value);
return true;
}
bool InternalStats::HandleCFWriteStallStatsMap(
std::map<std::string, std::string>* value, Slice /*suffix*/) {
DumpCFMapStatsWriteStall(value);
return true;
}
bool InternalStats::HandleDBMapStats(
std::map<std::string, std::string>* db_stats, Slice /*suffix*/) {
DumpDBMapStats(db_stats);
return true;
}
bool InternalStats::HandleDBStats(std::string* value, Slice /*suffix*/) {
DumpDBStats(value);
return true;
}
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
bool InternalStats::HandleDBWriteStallStats(std::string* value,
Slice /*suffix*/) {
DumpDBStatsWriteStall(value);
return true;
}
bool InternalStats::HandleDBWriteStallStatsMap(
std::map<std::string, std::string>* value, Slice /*suffix*/) {
DumpDBMapStatsWriteStall(value);
return true;
}
bool InternalStats::HandleSsTables(std::string* value, Slice /*suffix*/) {
auto* current = cfd_->current();
*value = current->DebugString(true, true);
return true;
}
bool InternalStats::HandleAggregatedTableProperties(std::string* value,
Slice /*suffix*/) {
std::shared_ptr<const TableProperties> tp;
Group SST write in flush, compaction and db open with new stats (#11910) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
// TODO: plumb Env::IOActivity, Env::IOPriority
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 16:07:18 +00:00
const ReadOptions read_options;
auto s = cfd_->current()->GetAggregatedTableProperties(read_options, &tp);
if (!s.ok()) {
return false;
}
*value = tp->ToString();
return true;
}
aggregated-table-properties with GetMapProperty (#7779) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 15:59:08 +00:00
static std::map<std::string, std::string> MapUint64ValuesToString(
const std::map<std::string, uint64_t>& from) {
std::map<std::string, std::string> to;
for (const auto& e : from) {
to[e.first] = std::to_string(e.second);
aggregated-table-properties with GetMapProperty (#7779) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 15:59:08 +00:00
}
return to;
}
bool InternalStats::HandleAggregatedTablePropertiesMap(
std::map<std::string, std::string>* values, Slice /*suffix*/) {
std::shared_ptr<const TableProperties> tp;
Group SST write in flush, compaction and db open with new stats (#11910) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
// TODO: plumb Env::IOActivity, Env::IOPriority
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 16:07:18 +00:00
const ReadOptions read_options;
auto s = cfd_->current()->GetAggregatedTableProperties(read_options, &tp);
aggregated-table-properties with GetMapProperty (#7779) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 15:59:08 +00:00
if (!s.ok()) {
return false;
}
*values = MapUint64ValuesToString(tp->GetAggregatablePropertiesAsMap());
return true;
}
bool InternalStats::HandleAggregatedTablePropertiesAtLevel(std::string* values,
Slice suffix) {
uint64_t level;
bool ok = ConsumeDecimalNumber(&suffix, &level) && suffix.empty();
if (!ok || static_cast<int>(level) >= number_levels_) {
return false;
}
std::shared_ptr<const TableProperties> tp;
Group SST write in flush, compaction and db open with new stats (#11910) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
// TODO: plumb Env::IOActivity, Env::IOPriority
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 16:07:18 +00:00
const ReadOptions read_options;
auto s = cfd_->current()->GetAggregatedTableProperties(
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 16:07:18 +00:00
read_options, &tp, static_cast<int>(level));
if (!s.ok()) {
return false;
}
aggregated-table-properties with GetMapProperty (#7779) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 15:59:08 +00:00
*values = tp->ToString();
return true;
}
bool InternalStats::HandleAggregatedTablePropertiesAtLevelMap(
std::map<std::string, std::string>* values, Slice suffix) {
uint64_t level;
bool ok = ConsumeDecimalNumber(&suffix, &level) && suffix.empty();
if (!ok || static_cast<int>(level) >= number_levels_) {
return false;
}
std::shared_ptr<const TableProperties> tp;
Group SST write in flush, compaction and db open with new stats (#11910) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
// TODO: plumb Env::IOActivity, Env::IOPriority
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 16:07:18 +00:00
const ReadOptions read_options;
aggregated-table-properties with GetMapProperty (#7779) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 15:59:08 +00:00
auto s = cfd_->current()->GetAggregatedTableProperties(
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 16:07:18 +00:00
read_options, &tp, static_cast<int>(level));
aggregated-table-properties with GetMapProperty (#7779) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
2020-12-19 15:59:08 +00:00
if (!s.ok()) {
return false;
}
*values = MapUint64ValuesToString(tp->GetAggregatablePropertiesAsMap());
return true;
}
bool InternalStats::HandleNumImmutableMemTable(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
*value = cfd_->imm()->NumNotFlushed();
return true;
}
bool InternalStats::HandleNumImmutableMemTableFlushed(uint64_t* value,
DBImpl* /*db*/,
Version* /*version*/) {
*value = cfd_->imm()->NumFlushed();
return true;
}
bool InternalStats::HandleMemTableFlushPending(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
*value = (cfd_->imm()->IsFlushPending() ? 1 : 0);
return true;
}
bool InternalStats::HandleNumRunningFlushes(uint64_t* value, DBImpl* db,
Version* /*version*/) {
*value = db->num_running_flushes();
return true;
}
bool InternalStats::HandleCompactionPending(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
// 1 if the system already determines at least one compaction is needed.
// 0 otherwise,
const auto* vstorage = cfd_->current()->storage_info();
*value = (cfd_->compaction_picker()->NeedsCompaction(vstorage) ? 1 : 0);
return true;
}
bool InternalStats::HandleNumRunningCompactions(uint64_t* value, DBImpl* db,
Version* /*version*/) {
*value = db->num_running_compactions_;
return true;
}
bool InternalStats::HandleBackgroundErrors(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
// Accumulated number of errors in background flushes or compactions.
*value = GetBackgroundErrorCount();
return true;
}
bool InternalStats::HandleCurSizeActiveMemTable(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
// Current size of the active memtable
// Using ApproximateMemoryUsageFast to avoid the need for synchronization
*value = cfd_->mem()->ApproximateMemoryUsageFast();
return true;
}
bool InternalStats::HandleCurSizeAllMemTables(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
// Current size of the active memtable + immutable memtables
// Using ApproximateMemoryUsageFast to avoid the need for synchronization
*value = cfd_->mem()->ApproximateMemoryUsageFast() +
cfd_->imm()->ApproximateUnflushedMemTablesMemoryUsage();
return true;
}
bool InternalStats::HandleSizeAllMemTables(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
// Using ApproximateMemoryUsageFast to avoid the need for synchronization
*value = cfd_->mem()->ApproximateMemoryUsageFast() +
cfd_->imm()->ApproximateMemoryUsage();
return true;
}
bool InternalStats::HandleNumEntriesActiveMemTable(uint64_t* value,
DBImpl* /*db*/,
Version* /*version*/) {
// Current number of entires in the active memtable
2024-11-05 00:09:34 +00:00
*value = cfd_->mem()->NumEntries();
return true;
}
bool InternalStats::HandleNumEntriesImmMemTables(uint64_t* value,
DBImpl* /*db*/,
Version* /*version*/) {
// Current number of entries in the immutable memtables
*value = cfd_->imm()->current()->GetTotalNumEntries();
return true;
}
bool InternalStats::HandleNumDeletesActiveMemTable(uint64_t* value,
DBImpl* /*db*/,
Version* /*version*/) {
// Current number of entires in the active memtable
2024-11-05 00:09:34 +00:00
*value = cfd_->mem()->NumDeletion();
return true;
}
bool InternalStats::HandleNumDeletesImmMemTables(uint64_t* value,
DBImpl* /*db*/,
Version* /*version*/) {
// Current number of entries in the immutable memtables
*value = cfd_->imm()->current()->GetTotalNumDeletes();
return true;
}
bool InternalStats::HandleEstimateNumKeys(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
// Estimate number of entries in the column family:
// Use estimated entries in tables + total entries in memtables.
const auto* vstorage = cfd_->current()->storage_info();
2024-11-05 00:09:34 +00:00
uint64_t estimate_keys = cfd_->mem()->NumEntries() +
cfd_->imm()->current()->GetTotalNumEntries() +
vstorage->GetEstimatedActiveKeys();
uint64_t estimate_deletes =
2024-11-05 00:09:34 +00:00
cfd_->mem()->NumDeletion() + cfd_->imm()->current()->GetTotalNumDeletes();
*value = estimate_keys > estimate_deletes * 2
? estimate_keys - (estimate_deletes * 2)
: 0;
return true;
}
bool InternalStats::HandleNumSnapshots(uint64_t* value, DBImpl* db,
Version* /*version*/) {
*value = db->snapshots().count();
return true;
}
bool InternalStats::HandleOldestSnapshotTime(uint64_t* value, DBImpl* db,
Version* /*version*/) {
*value = static_cast<uint64_t>(db->snapshots().GetOldestSnapshotTime());
return true;
}
bool InternalStats::HandleOldestSnapshotSequence(uint64_t* value, DBImpl* db,
Version* /*version*/) {
*value = static_cast<uint64_t>(db->snapshots().GetOldestSnapshotSequence());
return true;
}
bool InternalStats::HandleNumLiveVersions(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
*value = cfd_->GetNumLiveVersions();
return true;
}
bool InternalStats::HandleCurrentSuperVersionNumber(uint64_t* value,
DBImpl* /*db*/,
Version* /*version*/) {
*value = cfd_->GetSuperVersionNumber();
return true;
}
bool InternalStats::HandleIsFileDeletionsEnabled(uint64_t* value, DBImpl* db,
Version* /*version*/) {
First step towards handling MANIFEST write error (#6949) Summary: This PR provides preliminary support for handling IO error during MANIFEST write. File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted. One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach. If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled. Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled. Possible future directions: - Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added. Test plan (dev server): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949 Reviewed By: anand1976 Differential Revision: D22026020 Pulled By: riversand963 fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
*value = db->IsFileDeletionsEnabled() ? 1 : 0;
return true;
}
bool InternalStats::HandleBaseLevel(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
const auto* vstorage = cfd_->current()->storage_info();
*value = vstorage->base_level();
return true;
}
bool InternalStats::HandleTotalSstFilesSize(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
*value = cfd_->GetTotalSstFilesSize();
return true;
}
bool InternalStats::HandleLiveSstFilesSize(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
*value = cfd_->GetLiveSstFilesSize();
return true;
}
bool InternalStats::HandleObsoleteSstFilesSize(uint64_t* value, DBImpl* db,
Version* /*version*/) {
*value = db->GetObsoleteSstFilesSize();
return true;
}
bool InternalStats::HandleEstimatePendingCompactionBytes(uint64_t* value,
DBImpl* /*db*/,
Version* /*version*/) {
const auto* vstorage = cfd_->current()->storage_info();
*value = vstorage->estimated_compaction_needed_bytes();
return true;
}
bool InternalStats::HandleEstimateTableReadersMem(uint64_t* value,
DBImpl* /*db*/,
Version* version) {
Group SST write in flush, compaction and db open with new stats (#11910) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
// TODO: plumb Env::IOActivity, Env::IOPriority
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 16:07:18 +00:00
const ReadOptions read_options;
*value = (version == nullptr)
? 0
: version->GetMemoryUsageByTableReaders(read_options);
return true;
}
bool InternalStats::HandleEstimateLiveDataSize(uint64_t* value, DBImpl* /*db*/,
Fix VersionStorageInfo::EstimateLiveDataSize seg fault Summary: `HandleEstimateLiveDataSize`'s `need_out_of_mutex` is true https://github.com/facebook/rocksdb/blob/402b7aa07f0e6da4c1f0216ff2b2e50fd2e5eaac/db/internal_stats.cc#L412-L413 so , is will ref a `SuperVersion` https://github.com/facebook/rocksdb/blob/402b7aa07f0e6da4c1f0216ff2b2e50fd2e5eaac/db/db_impl.cc#L1896-L1908 so , the param `version` of `InternalStats::HandleEstimateLiveDataSize` is safe , but `cfd_->current()` is not safe ! https://github.com/facebook/rocksdb/blob/402b7aa07f0e6da4c1f0216ff2b2e50fd2e5eaac/db/internal_stats.cc#L790-L795 the `cfd_->current()` maybe invalid ... here's mongo-rocks crash backtrace ``` mongod(mongo::printStackTrace(std::basic_ostream<char, std::char_traits<char> >&)+0x41) [0x7fe3a3137c51] mongod(+0x2152E89) [0x7fe3a3136e89] mongod(+0x21534F6) [0x7fe3a31374f6] libpthread.so.0(+0xF5E0) [0x7fe39f5e45e0] mongod(rocksdb::InternalKeyComparator::Compare(rocksdb::Slice const&, rocksdb::Slice const&) const+0x17) [0x7fe3a22375a7] mongod(rocksdb::VersionStorageInfo::EstimateLiveDataSize() const+0x3AA) [0x7fe3a228daba] mongod(rocksdb::InternalStats::HandleEstimateLiveDataSize(unsigned long*, rocksdb::DBImpl*, rocksdb::Version*)+0x20) [0x7fe3a2250d70] mongod(rocksdb::DBImpl::GetIntPropertyInternal(rocksdb::ColumnFamilyData*, rocksdb::DBPropertyInfo const&, bool, unsigned long*)+0xEF) [0x7fe3a21e3dbf] ``` Closes https://github.com/facebook/rocksdb/pull/3912 Differential Revision: D8179944 Pulled By: yiwu-arbug fbshipit-source-id: 26f314a8f98f4c2dc4348745d759f26f0e8d95e1
2018-05-28 18:13:17 +00:00
Version* version) {
const auto* vstorage = version->storage_info();
*value = vstorage->EstimateLiveDataSize();
return true;
}
bool InternalStats::HandleMinLogNumberToKeep(uint64_t* value, DBImpl* db,
Version* /*version*/) {
*value = db->MinLogNumberToKeep();
return true;
}
bool InternalStats::HandleMinObsoleteSstNumberToKeep(uint64_t* value,
DBImpl* db,
Version* /*version*/) {
*value = db->MinObsoleteSstNumberToKeep();
return true;
}
bool InternalStats::HandleActualDelayedWriteRate(uint64_t* value, DBImpl* db,
Version* /*version*/) {
const WriteController& wc = db->write_controller();
if (!wc.NeedsDelay()) {
*value = 0;
} else {
*value = wc.delayed_write_rate();
}
return true;
}
bool InternalStats::HandleIsWriteStopped(uint64_t* value, DBImpl* db,
Version* /*version*/) {
*value = db->write_controller().IsStopped() ? 1 : 0;
return true;
}
bool InternalStats::HandleEstimateOldestKeyTime(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
// TODO(yiwu): The property is currently available for fifo compaction
// with allow_compaction = false. This is because we don't propagate
// oldest_key_time on compaction.
if (cfd_->ioptions()->compaction_style != kCompactionStyleFIFO ||
cfd_->GetCurrentMutableCFOptions()
->compaction_options_fifo.allow_compaction) {
return false;
}
Group SST write in flush, compaction and db open with new stats (#11910) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
// TODO: plumb Env::IOActivity, Env::IOPriority
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 16:07:18 +00:00
const ReadOptions read_options;
TablePropertiesCollection collection;
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 16:07:18 +00:00
auto s = cfd_->current()->GetPropertiesOfAllTables(read_options, &collection);
if (!s.ok()) {
return false;
}
*value = std::numeric_limits<uint64_t>::max();
for (auto& p : collection) {
*value = std::min(*value, p.second->oldest_key_time);
if (*value == 0) {
break;
}
}
if (*value > 0) {
*value = std::min({cfd_->mem()->ApproximateOldestKeyTime(),
cfd_->imm()->ApproximateOldestKeyTime(), *value});
}
return *value > 0 && *value < std::numeric_limits<uint64_t>::max();
}
Cache* InternalStats::GetBlockCacheForStats() {
Refactor `table_factory` into MutableCFOptions (#13077) Summary: This is setting up for a fix to a data race in SetOptions on BlockBasedTableOptions (BBTO), https://github.com/facebook/rocksdb/issues/10079 The race will be fixed by replacing `table_factory` with a modified copy whenever we want to modify a BBTO field. An argument could be made that this change creates more entaglement between features (e.g. BlobSource <-> MutableCFOptions), rather than (conceptually) minimizing the dependencies of each feature, but * Most of these things already depended on ImmutableOptions * Historically there has been a lot of plumbing (and possible small CPU overhead) involved in adding features that need to reach a lot of places, like `block_protection_bytes_per_key`. Keeping those wrapped up in options simplifies that. * SuperVersion management generally takes care of lifetime management of MutableCFOptions, so is not that difficult. (Crash test agrees so far.) There are some FIXME places where it is known to be unsafe to replace `block_cache` unless/until we handle shared_ptr tracking properly. HOWEVER, replacing `block_cache` is generally dubious, at least while existing users of the old block cache (e.g. table readers) can continue indefinitely. The change to cf_options.cc is essentially just moving code (not changing). I'm not concerned about the performance of copying another shared_ptr with MutableCFOptions, but I left a note about considering an improvement if more shared_ptr are added to it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13077 Test Plan: existing tests, crash test. Unit test DBOptionsTest.GetLatestCFOptions updated with some temporary logic. MemoryTest required some refactoring (simplification) for the change. Reviewed By: cbi42 Differential Revision: D64546903 Pulled By: pdillinger fbshipit-source-id: 69ae97ce5cf4c01b58edc4c5d4687eb1e5bf5855
2024-10-17 21:13:20 +00:00
// NOTE: called in startup before GetCurrentMutableCFOptions() is ready
auto* table_factory = cfd_->GetLatestMutableCFOptions()->table_factory.get();
assert(table_factory != nullptr);
Refactor `table_factory` into MutableCFOptions (#13077) Summary: This is setting up for a fix to a data race in SetOptions on BlockBasedTableOptions (BBTO), https://github.com/facebook/rocksdb/issues/10079 The race will be fixed by replacing `table_factory` with a modified copy whenever we want to modify a BBTO field. An argument could be made that this change creates more entaglement between features (e.g. BlobSource <-> MutableCFOptions), rather than (conceptually) minimizing the dependencies of each feature, but * Most of these things already depended on ImmutableOptions * Historically there has been a lot of plumbing (and possible small CPU overhead) involved in adding features that need to reach a lot of places, like `block_protection_bytes_per_key`. Keeping those wrapped up in options simplifies that. * SuperVersion management generally takes care of lifetime management of MutableCFOptions, so is not that difficult. (Crash test agrees so far.) There are some FIXME places where it is known to be unsafe to replace `block_cache` unless/until we handle shared_ptr tracking properly. HOWEVER, replacing `block_cache` is generally dubious, at least while existing users of the old block cache (e.g. table readers) can continue indefinitely. The change to cf_options.cc is essentially just moving code (not changing). I'm not concerned about the performance of copying another shared_ptr with MutableCFOptions, but I left a note about considering an improvement if more shared_ptr are added to it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13077 Test Plan: existing tests, crash test. Unit test DBOptionsTest.GetLatestCFOptions updated with some temporary logic. MemoryTest required some refactoring (simplification) for the change. Reviewed By: cbi42 Differential Revision: D64546903 Pulled By: pdillinger fbshipit-source-id: 69ae97ce5cf4c01b58edc4c5d4687eb1e5bf5855
2024-10-17 21:13:20 +00:00
// FIXME: need to a shared_ptr if/when block_cache is going to be mutable
return table_factory->GetOptions<Cache>(TableFactory::kBlockCacheOpts());
}
bool InternalStats::HandleBlockCacheCapacity(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
Cache* block_cache = GetBlockCacheForStats();
if (block_cache) {
*value = static_cast<uint64_t>(block_cache->GetCapacity());
return true;
}
return false;
}
bool InternalStats::HandleBlockCacheUsage(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
Cache* block_cache = GetBlockCacheForStats();
if (block_cache) {
*value = static_cast<uint64_t>(block_cache->GetUsage());
return true;
}
return false;
}
bool InternalStats::HandleBlockCachePinnedUsage(uint64_t* value, DBImpl* /*db*/,
Version* /*version*/) {
Cache* block_cache = GetBlockCacheForStats();
if (block_cache) {
*value = static_cast<uint64_t>(block_cache->GetPinnedUsage());
return true;
}
return false;
}
void InternalStats::DumpDBMapStats(
std::map<std::string, std::string>* db_stats) {
for (int i = 0; i < static_cast<int>(kIntStatsNumMax); ++i) {
InternalDBStatsType type = static_cast<InternalDBStatsType>(i);
(*db_stats)[db_stats_type_to_info.at(type).property_name] =
std::to_string(GetDBStats(type));
}
double seconds_up = (clock_->NowMicros() - started_at_) / kMicrosInSec;
(*db_stats)["db.uptime"] = std::to_string(seconds_up);
}
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
void InternalStats::DumpDBStats(std::string* value) {
char buf[1000];
// DB-level stats, only available from default column family
double seconds_up = (clock_->NowMicros() - started_at_) / kMicrosInSec;
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
double interval_seconds_up = seconds_up - db_stats_snapshot_.seconds_up;
snprintf(buf, sizeof(buf),
"\n** DB Stats **\nUptime(secs): %.1f total, %.1f interval\n",
seconds_up, interval_seconds_up);
value->append(buf);
// Cumulative
uint64_t user_bytes_written =
GetDBStats(InternalStats::kIntStatsBytesWritten);
uint64_t num_keys_written =
GetDBStats(InternalStats::kIntStatsNumKeysWritten);
uint64_t write_other = GetDBStats(InternalStats::kIntStatsWriteDoneByOther);
uint64_t write_self = GetDBStats(InternalStats::kIntStatsWriteDoneBySelf);
uint64_t wal_bytes = GetDBStats(InternalStats::kIntStatsWalFileBytes);
uint64_t wal_synced = GetDBStats(InternalStats::kIntStatsWalFileSynced);
uint64_t write_with_wal = GetDBStats(InternalStats::kIntStatsWriteWithWal);
uint64_t write_stall_micros =
GetDBStats(InternalStats::kIntStatsWriteStallMicros);
const int kHumanMicrosLen = 32;
char human_micros[kHumanMicrosLen];
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
// Data
// writes: total number of write requests.
// keys: total number of key updates issued by all the write requests
// commit groups: number of group commits issued to the DB. Each group can
// contain one or more writes.
// so writes/keys is the average number of put in multi-put or put
// writes/groups is the average group commit size.
//
// The format is the same for interval stats.
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
snprintf(buf, sizeof(buf),
"Cumulative writes: %s writes, %s keys, %s commit groups, "
"%.1f writes per commit group, ingest: %.2f GB, %.2f MB/s\n",
Make the benchmark scripts configurable and add tests Summary: This makes run_flash_bench.sh configurable. Previously it was hardwired for 1B keys and tests ran for 12 hours each. That kept me from using it. This makes it configuable, adds more tests, makes the duration per-test configurable and refactors the test scripts. Adds the seekrandomwhilemerging test to db_bench which is the same as seekrandomwhilewriting except the writer thread does Merge rather than Put. Forces the stall-time column in compaction IO stats to use a fixed format (H:M:S) which makes it easier to scrape and parse. Also adds an option to AppendHumanMicros to force a fixed format. Sometimes automation and humans want different format. Calls thread->stats.AddBytes(bytes); in db_bench for more tests to get the MB/sec summary stats in the output at test end. Adds the average ingest rate to compaction IO stats. Output now looks like: https://gist.github.com/mdcallag/2bd64d18be1b93adc494 More information on the benchmark output is at https://gist.github.com/mdcallag/db43a58bd5ac624f01e1 For benchmark.sh changes default RocksDB configuration to reduce stalls: * min_level_to_compress from 2 to 3 * hard_rate_limit from 2 to 3 * max_grandparent_overlap_factor and max_bytes_for_level_multiplier from 10 to 8 * L0 file count triggers from 4,8,12 to 4,12,20 for (start,stall,stop) Task ID: #6596829 Blame Rev: Test Plan: run tools/run_flash_bench.sh Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin *PUBLIC* platform impact section - Bugzilla: # - end platform impact - Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D36075
2015-03-30 18:28:25 +00:00
NumberToHumanString(write_other + write_self).c_str(),
NumberToHumanString(num_keys_written).c_str(),
NumberToHumanString(write_self).c_str(),
(write_other + write_self) /
std::max(1.0, static_cast<double>(write_self)),
user_bytes_written / kGB,
user_bytes_written / kMB / std::max(seconds_up, 0.001));
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
value->append(buf);
// WAL
snprintf(buf, sizeof(buf),
Make the benchmark scripts configurable and add tests Summary: This makes run_flash_bench.sh configurable. Previously it was hardwired for 1B keys and tests ran for 12 hours each. That kept me from using it. This makes it configuable, adds more tests, makes the duration per-test configurable and refactors the test scripts. Adds the seekrandomwhilemerging test to db_bench which is the same as seekrandomwhilewriting except the writer thread does Merge rather than Put. Forces the stall-time column in compaction IO stats to use a fixed format (H:M:S) which makes it easier to scrape and parse. Also adds an option to AppendHumanMicros to force a fixed format. Sometimes automation and humans want different format. Calls thread->stats.AddBytes(bytes); in db_bench for more tests to get the MB/sec summary stats in the output at test end. Adds the average ingest rate to compaction IO stats. Output now looks like: https://gist.github.com/mdcallag/2bd64d18be1b93adc494 More information on the benchmark output is at https://gist.github.com/mdcallag/db43a58bd5ac624f01e1 For benchmark.sh changes default RocksDB configuration to reduce stalls: * min_level_to_compress from 2 to 3 * hard_rate_limit from 2 to 3 * max_grandparent_overlap_factor and max_bytes_for_level_multiplier from 10 to 8 * L0 file count triggers from 4,8,12 to 4,12,20 for (start,stall,stop) Task ID: #6596829 Blame Rev: Test Plan: run tools/run_flash_bench.sh Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin *PUBLIC* platform impact section - Bugzilla: # - end platform impact - Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D36075
2015-03-30 18:28:25 +00:00
"Cumulative WAL: %s writes, %s syncs, "
"%.2f writes per sync, written: %.2f GB, %.2f MB/s\n",
NumberToHumanString(write_with_wal).c_str(),
NumberToHumanString(wal_synced).c_str(),
write_with_wal / std::max(1.0, static_cast<double>(wal_synced)),
wal_bytes / kGB, wal_bytes / kMB / std::max(seconds_up, 0.001));
Make the benchmark scripts configurable and add tests Summary: This makes run_flash_bench.sh configurable. Previously it was hardwired for 1B keys and tests ran for 12 hours each. That kept me from using it. This makes it configuable, adds more tests, makes the duration per-test configurable and refactors the test scripts. Adds the seekrandomwhilemerging test to db_bench which is the same as seekrandomwhilewriting except the writer thread does Merge rather than Put. Forces the stall-time column in compaction IO stats to use a fixed format (H:M:S) which makes it easier to scrape and parse. Also adds an option to AppendHumanMicros to force a fixed format. Sometimes automation and humans want different format. Calls thread->stats.AddBytes(bytes); in db_bench for more tests to get the MB/sec summary stats in the output at test end. Adds the average ingest rate to compaction IO stats. Output now looks like: https://gist.github.com/mdcallag/2bd64d18be1b93adc494 More information on the benchmark output is at https://gist.github.com/mdcallag/db43a58bd5ac624f01e1 For benchmark.sh changes default RocksDB configuration to reduce stalls: * min_level_to_compress from 2 to 3 * hard_rate_limit from 2 to 3 * max_grandparent_overlap_factor and max_bytes_for_level_multiplier from 10 to 8 * L0 file count triggers from 4,8,12 to 4,12,20 for (start,stall,stop) Task ID: #6596829 Blame Rev: Test Plan: run tools/run_flash_bench.sh Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin *PUBLIC* platform impact section - Bugzilla: # - end platform impact - Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D36075
2015-03-30 18:28:25 +00:00
value->append(buf);
// Stall
AppendHumanMicros(write_stall_micros, human_micros, kHumanMicrosLen, true);
snprintf(buf, sizeof(buf), "Cumulative stall: %s, %.1f percent\n",
Make the benchmark scripts configurable and add tests Summary: This makes run_flash_bench.sh configurable. Previously it was hardwired for 1B keys and tests ran for 12 hours each. That kept me from using it. This makes it configuable, adds more tests, makes the duration per-test configurable and refactors the test scripts. Adds the seekrandomwhilemerging test to db_bench which is the same as seekrandomwhilewriting except the writer thread does Merge rather than Put. Forces the stall-time column in compaction IO stats to use a fixed format (H:M:S) which makes it easier to scrape and parse. Also adds an option to AppendHumanMicros to force a fixed format. Sometimes automation and humans want different format. Calls thread->stats.AddBytes(bytes); in db_bench for more tests to get the MB/sec summary stats in the output at test end. Adds the average ingest rate to compaction IO stats. Output now looks like: https://gist.github.com/mdcallag/2bd64d18be1b93adc494 More information on the benchmark output is at https://gist.github.com/mdcallag/db43a58bd5ac624f01e1 For benchmark.sh changes default RocksDB configuration to reduce stalls: * min_level_to_compress from 2 to 3 * hard_rate_limit from 2 to 3 * max_grandparent_overlap_factor and max_bytes_for_level_multiplier from 10 to 8 * L0 file count triggers from 4,8,12 to 4,12,20 for (start,stall,stop) Task ID: #6596829 Blame Rev: Test Plan: run tools/run_flash_bench.sh Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin *PUBLIC* platform impact section - Bugzilla: # - end platform impact - Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D36075
2015-03-30 18:28:25 +00:00
human_micros,
// 10000 = divide by 1M to get secs, then multiply by 100 for pct
write_stall_micros / 10000.0 / std::max(seconds_up, 0.001));
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
value->append(buf);
// Interval
uint64_t interval_write_other = write_other - db_stats_snapshot_.write_other;
uint64_t interval_write_self = write_self - db_stats_snapshot_.write_self;
uint64_t interval_num_keys_written =
num_keys_written - db_stats_snapshot_.num_keys_written;
snprintf(
buf, sizeof(buf),
"Interval writes: %s writes, %s keys, %s commit groups, "
"%.1f writes per commit group, ingest: %.2f MB, %.2f MB/s\n",
NumberToHumanString(interval_write_other + interval_write_self).c_str(),
NumberToHumanString(interval_num_keys_written).c_str(),
NumberToHumanString(interval_write_self).c_str(),
static_cast<double>(interval_write_other + interval_write_self) /
std::max(1.0, static_cast<double>(interval_write_self)),
(user_bytes_written - db_stats_snapshot_.ingest_bytes) / kMB,
(user_bytes_written - db_stats_snapshot_.ingest_bytes) / kMB /
std::max(interval_seconds_up, 0.001)),
value->append(buf);
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
uint64_t interval_write_with_wal =
write_with_wal - db_stats_snapshot_.write_with_wal;
uint64_t interval_wal_synced = wal_synced - db_stats_snapshot_.wal_synced;
uint64_t interval_wal_bytes = wal_bytes - db_stats_snapshot_.wal_bytes;
snprintf(buf, sizeof(buf),
"Interval WAL: %s writes, %s syncs, "
"%.2f writes per sync, written: %.2f GB, %.2f MB/s\n",
NumberToHumanString(interval_write_with_wal).c_str(),
NumberToHumanString(interval_wal_synced).c_str(),
interval_write_with_wal /
std::max(1.0, static_cast<double>(interval_wal_synced)),
interval_wal_bytes / kGB,
interval_wal_bytes / kMB / std::max(interval_seconds_up, 0.001));
Make the benchmark scripts configurable and add tests Summary: This makes run_flash_bench.sh configurable. Previously it was hardwired for 1B keys and tests ran for 12 hours each. That kept me from using it. This makes it configuable, adds more tests, makes the duration per-test configurable and refactors the test scripts. Adds the seekrandomwhilemerging test to db_bench which is the same as seekrandomwhilewriting except the writer thread does Merge rather than Put. Forces the stall-time column in compaction IO stats to use a fixed format (H:M:S) which makes it easier to scrape and parse. Also adds an option to AppendHumanMicros to force a fixed format. Sometimes automation and humans want different format. Calls thread->stats.AddBytes(bytes); in db_bench for more tests to get the MB/sec summary stats in the output at test end. Adds the average ingest rate to compaction IO stats. Output now looks like: https://gist.github.com/mdcallag/2bd64d18be1b93adc494 More information on the benchmark output is at https://gist.github.com/mdcallag/db43a58bd5ac624f01e1 For benchmark.sh changes default RocksDB configuration to reduce stalls: * min_level_to_compress from 2 to 3 * hard_rate_limit from 2 to 3 * max_grandparent_overlap_factor and max_bytes_for_level_multiplier from 10 to 8 * L0 file count triggers from 4,8,12 to 4,12,20 for (start,stall,stop) Task ID: #6596829 Blame Rev: Test Plan: run tools/run_flash_bench.sh Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin *PUBLIC* platform impact section - Bugzilla: # - end platform impact - Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D36075
2015-03-30 18:28:25 +00:00
value->append(buf);
// Stall
AppendHumanMicros(write_stall_micros - db_stats_snapshot_.write_stall_micros,
human_micros, kHumanMicrosLen, true);
snprintf(buf, sizeof(buf), "Interval stall: %s, %.1f percent\n", human_micros,
Make the benchmark scripts configurable and add tests Summary: This makes run_flash_bench.sh configurable. Previously it was hardwired for 1B keys and tests ran for 12 hours each. That kept me from using it. This makes it configuable, adds more tests, makes the duration per-test configurable and refactors the test scripts. Adds the seekrandomwhilemerging test to db_bench which is the same as seekrandomwhilewriting except the writer thread does Merge rather than Put. Forces the stall-time column in compaction IO stats to use a fixed format (H:M:S) which makes it easier to scrape and parse. Also adds an option to AppendHumanMicros to force a fixed format. Sometimes automation and humans want different format. Calls thread->stats.AddBytes(bytes); in db_bench for more tests to get the MB/sec summary stats in the output at test end. Adds the average ingest rate to compaction IO stats. Output now looks like: https://gist.github.com/mdcallag/2bd64d18be1b93adc494 More information on the benchmark output is at https://gist.github.com/mdcallag/db43a58bd5ac624f01e1 For benchmark.sh changes default RocksDB configuration to reduce stalls: * min_level_to_compress from 2 to 3 * hard_rate_limit from 2 to 3 * max_grandparent_overlap_factor and max_bytes_for_level_multiplier from 10 to 8 * L0 file count triggers from 4,8,12 to 4,12,20 for (start,stall,stop) Task ID: #6596829 Blame Rev: Test Plan: run tools/run_flash_bench.sh Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin *PUBLIC* platform impact section - Bugzilla: # - end platform impact - Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D36075
2015-03-30 18:28:25 +00:00
// 10000 = divide by 1M to get secs, then multiply by 100 for pct
(write_stall_micros - db_stats_snapshot_.write_stall_micros) /
10000.0 / std::max(interval_seconds_up, 0.001));
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
value->append(buf);
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
std::string write_stall_stats;
DumpDBStatsWriteStall(&write_stall_stats);
value->append(write_stall_stats);
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
db_stats_snapshot_.seconds_up = seconds_up;
db_stats_snapshot_.ingest_bytes = user_bytes_written;
db_stats_snapshot_.write_other = write_other;
db_stats_snapshot_.write_self = write_self;
db_stats_snapshot_.num_keys_written = num_keys_written;
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
db_stats_snapshot_.wal_bytes = wal_bytes;
db_stats_snapshot_.wal_synced = wal_synced;
db_stats_snapshot_.write_with_wal = write_with_wal;
db_stats_snapshot_.write_stall_micros = write_stall_micros;
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
}
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
void InternalStats::DumpDBMapStatsWriteStall(
std::map<std::string, std::string>* value) {
constexpr uint32_t max_db_scope_write_stall_cause =
static_cast<uint32_t>(WriteStallCause::kDBScopeWriteStallCauseEnumMax);
for (uint32_t i =
max_db_scope_write_stall_cause - kNumDBScopeWriteStallCauses;
i < max_db_scope_write_stall_cause; ++i) {
for (uint32_t j = 0;
j < static_cast<uint32_t>(WriteStallCondition::kNormal); ++j) {
WriteStallCause cause = static_cast<WriteStallCause>(i);
WriteStallCondition condition = static_cast<WriteStallCondition>(j);
InternalStats::InternalDBStatsType internal_db_stat =
InternalDBStat(cause, condition);
if (internal_db_stat == InternalStats::kIntStatsNumMax) {
continue;
}
std::string name =
WriteStallStatsMapKeys::CauseConditionCount(cause, condition);
uint64_t stat =
db_stats_[static_cast<std::size_t>(internal_db_stat)].load(
std::memory_order_relaxed);
(*value)[name] = std::to_string(stat);
}
}
}
void InternalStats::DumpDBStatsWriteStall(std::string* value) {
assert(value);
std::map<std::string, std::string> write_stall_stats_map;
DumpDBMapStatsWriteStall(&write_stall_stats_map);
std::ostringstream str;
str << "Write Stall (count): ";
Fix write stall stats dump format (#11445) Summary: I noticed in https://github.com/facebook/rocksdb/issues/11426 there is a missing line break. Before this PR the output looked like ``` $ ./db_bench -stats_per_interval=1 -stats_interval=100000 ... Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 0, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 0, l0-file-count-limit-stops: 0, memtable-limit-delays: 0, memtable-limit-stops: 0, pending-compaction-bytes-delays: 0, pending-compaction-bytes-stops: 0, total-delays: 0, total-stops: 0, Block cache LRUCache@0x7f8695831b50#2766536 capacity: 32.00 MB seed: 1155354975 usage: 0.09 KB table_size: 1024 occupancy: 1 collections: 1 last_copies: 0 last_secs: 9.3e-05 secs_since: 2 ... Write Stall (count): write-buffer-manager-limit-stops: 0, num-running-compactions: 0 ... ``` After this PR it looks like ``` $ ./db_bench -stats_per_interval=1 -stats_interval=100000 ... Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 0, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 0, l0-file-count-limit-stops: 0, memtable-limit-delays: 0, memtable-limit-stops: 0, pending-compaction-bytes-delays: 0, pending-compaction-bytes-stops: 0, total-delays: 0, total-stops: 0 Block cache LRUCache@0x7f8e0d231b50#2736585 capacity: 32.00 MB seed: 920433955 usage: 0.09 KB table_size: 1024 occupancy: 1 collections: 1 last_copies: 1 last_secs: 6.5e-05 secs_since: 4 ... Write Stall (count): write-buffer-manager-limit-stops: 0 num-running-compactions: 0 ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11445 Reviewed By: hx235 Differential Revision: D45844752 Pulled By: ajkr fbshipit-source-id: 1c708cb05b6e270922ac2fa95f5d011f273347eb
2023-05-15 18:47:17 +00:00
for (auto write_stall_stats_map_iter = write_stall_stats_map.begin();
write_stall_stats_map_iter != write_stall_stats_map.end();
write_stall_stats_map_iter++) {
const auto& name_and_stat = *write_stall_stats_map_iter;
str << name_and_stat.first << ": " << name_and_stat.second;
if (std::next(write_stall_stats_map_iter) == write_stall_stats_map.end()) {
str << "\n";
} else {
str << ", ";
}
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
}
*value = str.str();
}
/**
* Dump Compaction Level stats to a map of stat name with "compaction." prefix
* to value in double as string. The level in stat name is represented with
* a prefix "Lx" where "x" is the level number. A special level "Sum"
* represents the sum of a stat for all levels.
* The result also contains IO stall counters which keys start with "io_stalls."
* and values represent uint64 encoded as strings.
*/
void InternalStats::DumpCFMapStats(
std::map<std::string, std::string>* cf_stats) {
const VersionStorageInfo* vstorage = cfd_->current()->storage_info();
CompactionStats compaction_stats_sum;
std::map<int, std::map<LevelStatType, double>> levels_stats;
DumpCFMapStats(vstorage, &levels_stats, &compaction_stats_sum);
for (auto const& level_ent : levels_stats) {
auto level_str =
level_ent.first == -1 ? "Sum" : "L" + std::to_string(level_ent.first);
for (auto const& stat_ent : level_ent.second) {
auto stat_type = stat_ent.first;
auto key_str =
"compaction." + level_str + "." +
InternalStats::compaction_level_stats.at(stat_type).property_name;
(*cf_stats)[key_str] = std::to_string(stat_ent.second);
}
}
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
DumpCFMapStatsWriteStall(cf_stats);
}
void InternalStats::DumpCFMapStats(
const VersionStorageInfo* vstorage,
std::map<int, std::map<LevelStatType, double>>* levels_stats,
CompactionStats* compaction_stats_sum) {
assert(vstorage);
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
int num_levels_to_check =
Log correct compaction score for Universal Compaction (#11487) Summary: currently 0 is incorrectly logged as the compaction score for L0 when num_levels > 1. This PR fixes the issue to log the correct score. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11487 Test Plan: ``` ./db_bench --benchmarks=fillrandom --max_background_jobs=8 --num=1000000 --compaction_style=1 --stats_dump_period_sec=20 --num_levels=7 --write_buffer_size=1048576 grep "L0 " /tmp/rocksdbtest-543376/dbbench/LOG before: ** Compaction Stats [default] ** Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 9.9 0.42 0.33 9 0.046 0 0 0.0 0.0 L0 3/1 1.37 MB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.1 0.6 9.6 3.76 3.03 76 0.050 34K 140 0.0 0.0 L0 2/0 2.26 MB 0.0 0.0 0.0 0.0 0.1 0.1 0.0 1.6 3.2 8.2 12.59 11.17 163 0.077 619K 5499 0.0 0.0 after: compaction scores are non-zero L0 0/0 0.00 KB 0.8 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 9.6 0.43 0.34 9 0.048 0 0 0.0 0.0 L0 2/1 937.08 KB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.1 0.6 9.3 3.85 3.07 75 0.051 34K 165 0.0 0.0 L0 2/2 1.82 MB 1.0 0.0 0.0 0.0 0.1 0.1 0.0 1.6 3.0 8.0 12.45 10.99 160 0.078 577K 5399 0.0 0.0 ``` Reviewed By: ajkr Differential Revision: D46293993 Pulled By: cbi42 fbshipit-source-id: 19753f7df68c5f54a84c4ed52794f83e510c9721
2023-06-01 22:36:19 +00:00
(cfd_->ioptions()->compaction_style == kCompactionStyleLevel)
? vstorage->num_levels() - 1
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
: 1;
// Compaction scores are sorted based on its value. Restore them to the
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
// level order
std::vector<double> compaction_score(number_levels_, 0);
for (int i = 0; i < num_levels_to_check; ++i) {
compaction_score[vstorage->CompactionScoreLevel(i)] =
vstorage->CompactionScore(i);
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
}
// Count # of files being compacted for each level
std::vector<int> files_being_compacted(number_levels_, 0);
for (int level = 0; level < number_levels_; ++level) {
for (auto* f : vstorage->LevelFiles(level)) {
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
if (f->being_compacted) {
++files_being_compacted[level];
}
}
}
int total_files = 0;
int total_files_being_compacted = 0;
double total_file_size = 0;
uint64_t flush_ingest = cf_stats_value_[BYTES_FLUSHED];
uint64_t add_file_ingest = cf_stats_value_[BYTES_INGESTED_ADD_FILE];
uint64_t curr_ingest = flush_ingest + add_file_ingest;
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
for (int level = 0; level < number_levels_; level++) {
int files = vstorage->NumLevelFiles(level);
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
total_files += files;
total_files_being_compacted += files_being_compacted[level];
if (comp_stats_[level].micros > 0 || comp_stats_[level].cpu_micros > 0 ||
files > 0) {
compaction_stats_sum->Add(comp_stats_[level]);
total_file_size += vstorage->NumLevelBytes(level);
uint64_t input_bytes;
if (level == 0) {
input_bytes = curr_ingest;
} else {
input_bytes = comp_stats_[level].bytes_read_non_output_levels +
comp_stats_[level].bytes_read_blob;
}
double w_amp =
(input_bytes == 0)
? 0.0
: static_cast<double>(comp_stats_[level].bytes_written +
comp_stats_[level].bytes_written_blob) /
input_bytes;
std::map<LevelStatType, double> level_stats;
PrepareLevelStats(&level_stats, files, files_being_compacted[level],
static_cast<double>(vstorage->NumLevelBytes(level)),
compaction_score[level], w_amp, comp_stats_[level]);
(*levels_stats)[level] = level_stats;
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
}
}
// Cumulative summary
double w_amp = (0 == curr_ingest)
? 0.0
: (compaction_stats_sum->bytes_written +
compaction_stats_sum->bytes_written_blob) /
static_cast<double>(curr_ingest);
// Stats summary across levels
std::map<LevelStatType, double> sum_stats;
PrepareLevelStats(&sum_stats, total_files, total_files_being_compacted,
total_file_size, 0, w_amp, *compaction_stats_sum);
(*levels_stats)[-1] = sum_stats; // -1 is for the Sum level
}
void InternalStats::DumpCFMapStatsByPriority(
std::map<int, std::map<LevelStatType, double>>* priorities_stats) {
for (size_t priority = 0; priority < comp_stats_by_pri_.size(); priority++) {
if (comp_stats_by_pri_[priority].micros > 0) {
std::map<LevelStatType, double> priority_stats;
PrepareLevelStats(&priority_stats, 0 /* num_files */,
0 /* being_compacted */, 0 /* total_file_size */,
0 /* compaction_score */, 0 /* w_amp */,
comp_stats_by_pri_[priority]);
(*priorities_stats)[static_cast<int>(priority)] = priority_stats;
}
}
}
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
void InternalStats::DumpCFMapStatsWriteStall(
std::map<std::string, std::string>* value) {
uint64_t total_delays = 0;
uint64_t total_stops = 0;
constexpr uint32_t max_cf_scope_write_stall_cause =
static_cast<uint32_t>(WriteStallCause::kCFScopeWriteStallCauseEnumMax);
for (uint32_t i =
max_cf_scope_write_stall_cause - kNumCFScopeWriteStallCauses;
i < max_cf_scope_write_stall_cause; ++i) {
for (uint32_t j = 0;
j < static_cast<uint32_t>(WriteStallCondition::kNormal); ++j) {
WriteStallCause cause = static_cast<WriteStallCause>(i);
WriteStallCondition condition = static_cast<WriteStallCondition>(j);
InternalStats::InternalCFStatsType internal_cf_stat =
InternalCFStat(cause, condition);
if (internal_cf_stat == InternalStats::INTERNAL_CF_STATS_ENUM_MAX) {
continue;
}
std::string name =
WriteStallStatsMapKeys::CauseConditionCount(cause, condition);
uint64_t stat =
cf_stats_count_[static_cast<std::size_t>(internal_cf_stat)];
(*value)[name] = std::to_string(stat);
if (condition == WriteStallCondition::kDelayed) {
total_delays += stat;
} else if (condition == WriteStallCondition::kStopped) {
total_stops += stat;
}
}
}
(*value)[WriteStallStatsMapKeys::
CFL0FileCountLimitDelaysWithOngoingCompaction()] =
std::to_string(
cf_stats_count_[L0_FILE_COUNT_LIMIT_DELAYS_WITH_ONGOING_COMPACTION]);
(*value)[WriteStallStatsMapKeys::
CFL0FileCountLimitStopsWithOngoingCompaction()] =
std::to_string(
cf_stats_count_[L0_FILE_COUNT_LIMIT_STOPS_WITH_ONGOING_COMPACTION]);
(*value)[WriteStallStatsMapKeys::TotalStops()] = std::to_string(total_stops);
(*value)[WriteStallStatsMapKeys::TotalDelays()] =
std::to_string(total_delays);
}
void InternalStats::DumpCFStatsWriteStall(std::string* value,
uint64_t* total_stall_count) {
assert(value);
std::map<std::string, std::string> write_stall_stats_map;
DumpCFMapStatsWriteStall(&write_stall_stats_map);
std::ostringstream str;
str << "Write Stall (count): ";
Fix write stall stats dump format (#11445) Summary: I noticed in https://github.com/facebook/rocksdb/issues/11426 there is a missing line break. Before this PR the output looked like ``` $ ./db_bench -stats_per_interval=1 -stats_interval=100000 ... Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 0, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 0, l0-file-count-limit-stops: 0, memtable-limit-delays: 0, memtable-limit-stops: 0, pending-compaction-bytes-delays: 0, pending-compaction-bytes-stops: 0, total-delays: 0, total-stops: 0, Block cache LRUCache@0x7f8695831b50#2766536 capacity: 32.00 MB seed: 1155354975 usage: 0.09 KB table_size: 1024 occupancy: 1 collections: 1 last_copies: 0 last_secs: 9.3e-05 secs_since: 2 ... Write Stall (count): write-buffer-manager-limit-stops: 0, num-running-compactions: 0 ... ``` After this PR it looks like ``` $ ./db_bench -stats_per_interval=1 -stats_interval=100000 ... Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 0, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 0, l0-file-count-limit-stops: 0, memtable-limit-delays: 0, memtable-limit-stops: 0, pending-compaction-bytes-delays: 0, pending-compaction-bytes-stops: 0, total-delays: 0, total-stops: 0 Block cache LRUCache@0x7f8e0d231b50#2736585 capacity: 32.00 MB seed: 920433955 usage: 0.09 KB table_size: 1024 occupancy: 1 collections: 1 last_copies: 1 last_secs: 6.5e-05 secs_since: 4 ... Write Stall (count): write-buffer-manager-limit-stops: 0 num-running-compactions: 0 ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11445 Reviewed By: hx235 Differential Revision: D45844752 Pulled By: ajkr fbshipit-source-id: 1c708cb05b6e270922ac2fa95f5d011f273347eb
2023-05-15 18:47:17 +00:00
for (auto write_stall_stats_map_iter = write_stall_stats_map.begin();
write_stall_stats_map_iter != write_stall_stats_map.end();
write_stall_stats_map_iter++) {
const auto& name_and_stat = *write_stall_stats_map_iter;
str << name_and_stat.first << ": " << name_and_stat.second;
if (std::next(write_stall_stats_map_iter) == write_stall_stats_map.end()) {
str << "\n";
} else {
str << ", ";
}
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
}
if (total_stall_count) {
*total_stall_count =
ParseUint64(
write_stall_stats_map[WriteStallStatsMapKeys::TotalStops()]) +
ParseUint64(
write_stall_stats_map[WriteStallStatsMapKeys::TotalDelays()]);
if (*total_stall_count > 0) {
str << "interval: " << *total_stall_count - cf_stats_snapshot_.stall_count
<< " total count\n";
}
}
*value = str.str();
}
void InternalStats::DumpCFStats(std::string* value) {
DumpCFStatsNoFileHistogram(/*is_periodic=*/false, value);
DumpCFFileHistogram(value);
}
void InternalStats::DumpCFStatsNoFileHistogram(bool is_periodic,
std::string* value) {
char buf[2000];
// Per-ColumnFamily stats
PrintLevelStatsHeader(buf, sizeof(buf), cfd_->GetName(), "Level");
value->append(buf);
// Print stats for each level
const VersionStorageInfo* vstorage = cfd_->current()->storage_info();
std::map<int, std::map<LevelStatType, double>> levels_stats;
CompactionStats compaction_stats_sum;
DumpCFMapStats(vstorage, &levels_stats, &compaction_stats_sum);
for (int l = 0; l < number_levels_; ++l) {
if (levels_stats.find(l) != levels_stats.end()) {
PrintLevelStats(buf, sizeof(buf), "L" + std::to_string(l),
levels_stats[l]);
value->append(buf);
}
}
// Print sum of level stats
PrintLevelStats(buf, sizeof(buf), "Sum", levels_stats[-1]);
value->append(buf);
uint64_t flush_ingest = cf_stats_value_[BYTES_FLUSHED];
uint64_t add_file_ingest = cf_stats_value_[BYTES_INGESTED_ADD_FILE];
uint64_t ingest_files_addfile = cf_stats_value_[INGESTED_NUM_FILES_TOTAL];
uint64_t ingest_l0_files_addfile =
cf_stats_value_[INGESTED_LEVEL0_NUM_FILES_TOTAL];
uint64_t ingest_keys_addfile = cf_stats_value_[INGESTED_NUM_KEYS_TOTAL];
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
// Interval summary
uint64_t interval_flush_ingest =
flush_ingest - cf_stats_snapshot_.ingest_bytes_flush;
uint64_t interval_add_file_inget =
add_file_ingest - cf_stats_snapshot_.ingest_bytes_addfile;
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
uint64_t interval_ingest =
interval_flush_ingest + interval_add_file_inget + 1;
CompactionStats interval_stats(compaction_stats_sum);
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
interval_stats.Subtract(cf_stats_snapshot_.comp_stats);
double w_amp =
(interval_stats.bytes_written + interval_stats.bytes_written_blob) /
static_cast<double>(interval_ingest);
PrintLevelStats(buf, sizeof(buf), "Int", 0, 0, 0, 0, w_amp, interval_stats);
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
value->append(buf);
PrintLevelStatsHeader(buf, sizeof(buf), cfd_->GetName(), "Priority");
value->append(buf);
std::map<int, std::map<LevelStatType, double>> priorities_stats;
DumpCFMapStatsByPriority(&priorities_stats);
for (size_t priority = 0; priority < comp_stats_by_pri_.size(); ++priority) {
if (priorities_stats.find(static_cast<int>(priority)) !=
priorities_stats.end()) {
PrintLevelStats(
buf, sizeof(buf),
Env::PriorityToString(static_cast<Env::Priority>(priority)),
priorities_stats[static_cast<int>(priority)]);
value->append(buf);
}
}
const auto blob_st = vstorage->GetBlobStats();
snprintf(buf, sizeof(buf),
"\nBlob file count: %" ROCKSDB_PRIszt
", total size: %.1f GB, garbage size: %.1f GB, space amp: %.1f\n\n",
vstorage->GetBlobFiles().size(), blob_st.total_file_size / kGB,
blob_st.total_garbage_size / kGB, blob_st.space_amp);
value->append(buf);
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
uint64_t now_micros = clock_->NowMicros();
double seconds_up = (now_micros - started_at_) / kMicrosInSec;
double interval_seconds_up = seconds_up - cf_stats_snapshot_.seconds_up;
snprintf(buf, sizeof(buf), "Uptime(secs): %.1f total, %.1f interval\n",
seconds_up, interval_seconds_up);
value->append(buf);
snprintf(buf, sizeof(buf), "Flush(GB): cumulative %.3f, interval %.3f\n",
flush_ingest / kGB, interval_flush_ingest / kGB);
value->append(buf);
snprintf(buf, sizeof(buf), "AddFile(GB): cumulative %.3f, interval %.3f\n",
add_file_ingest / kGB, interval_add_file_inget / kGB);
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
value->append(buf);
uint64_t interval_ingest_files_addfile =
ingest_files_addfile - cf_stats_snapshot_.ingest_files_addfile;
snprintf(buf, sizeof(buf),
"AddFile(Total Files): cumulative %" PRIu64 ", interval %" PRIu64
"\n",
ingest_files_addfile, interval_ingest_files_addfile);
value->append(buf);
uint64_t interval_ingest_l0_files_addfile =
ingest_l0_files_addfile - cf_stats_snapshot_.ingest_l0_files_addfile;
snprintf(buf, sizeof(buf),
2016-09-22 03:22:09 +00:00
"AddFile(L0 Files): cumulative %" PRIu64 ", interval %" PRIu64 "\n",
ingest_l0_files_addfile, interval_ingest_l0_files_addfile);
value->append(buf);
uint64_t interval_ingest_keys_addfile =
ingest_keys_addfile - cf_stats_snapshot_.ingest_keys_addfile;
2016-09-22 03:22:09 +00:00
snprintf(buf, sizeof(buf),
"AddFile(Keys): cumulative %" PRIu64 ", interval %" PRIu64 "\n",
ingest_keys_addfile, interval_ingest_keys_addfile);
value->append(buf);
// Compact
uint64_t compact_bytes_read = 0;
uint64_t compact_bytes_write = 0;
uint64_t compact_micros = 0;
for (int level = 0; level < number_levels_; level++) {
compact_bytes_read += comp_stats_[level].bytes_read_output_level +
comp_stats_[level].bytes_read_non_output_levels +
comp_stats_[level].bytes_read_blob;
compact_bytes_write += comp_stats_[level].bytes_written +
comp_stats_[level].bytes_written_blob;
compact_micros += comp_stats_[level].micros;
}
snprintf(buf, sizeof(buf),
"Cumulative compaction: %.2f GB write, %.2f MB/s write, "
"%.2f GB read, %.2f MB/s read, %.1f seconds\n",
compact_bytes_write / kGB,
compact_bytes_write / kMB / std::max(seconds_up, 0.001),
compact_bytes_read / kGB,
compact_bytes_read / kMB / std::max(seconds_up, 0.001),
compact_micros / kMicrosInSec);
value->append(buf);
// Compaction interval
uint64_t interval_compact_bytes_write =
compact_bytes_write - cf_stats_snapshot_.compact_bytes_write;
uint64_t interval_compact_bytes_read =
compact_bytes_read - cf_stats_snapshot_.compact_bytes_read;
uint64_t interval_compact_micros =
compact_micros - cf_stats_snapshot_.compact_micros;
snprintf(
buf, sizeof(buf),
"Interval compaction: %.2f GB write, %.2f MB/s write, "
"%.2f GB read, %.2f MB/s read, %.1f seconds\n",
interval_compact_bytes_write / kGB,
interval_compact_bytes_write / kMB / std::max(interval_seconds_up, 0.001),
interval_compact_bytes_read / kGB,
interval_compact_bytes_read / kMB / std::max(interval_seconds_up, 0.001),
interval_compact_micros / kMicrosInSec);
value->append(buf);
snprintf(buf, sizeof(buf),
"Estimated pending compaction bytes: %" PRIu64 "\n",
vstorage->estimated_compaction_needed_bytes());
value->append(buf);
if (is_periodic) {
cf_stats_snapshot_.compact_bytes_write = compact_bytes_write;
cf_stats_snapshot_.compact_bytes_read = compact_bytes_read;
cf_stats_snapshot_.compact_micros = compact_micros;
}
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300) Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 16:51:58 +00:00
std::string write_stall_stats;
uint64_t total_stall_count;
DumpCFStatsWriteStall(&write_stall_stats, &total_stall_count);
value->append(write_stall_stats);
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
if (is_periodic) {
cf_stats_snapshot_.seconds_up = seconds_up;
cf_stats_snapshot_.ingest_bytes_flush = flush_ingest;
cf_stats_snapshot_.ingest_bytes_addfile = add_file_ingest;
cf_stats_snapshot_.ingest_files_addfile = ingest_files_addfile;
cf_stats_snapshot_.ingest_l0_files_addfile = ingest_l0_files_addfile;
cf_stats_snapshot_.ingest_keys_addfile = ingest_keys_addfile;
cf_stats_snapshot_.comp_stats = compaction_stats_sum;
cf_stats_snapshot_.stall_count = total_stall_count;
}
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
Don't hold DB mutex for block cache entry stat scans (#8538) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 21:12:06 +00:00
// Do not gather cache entry stats during CFStats because DB
// mutex is held. Only dump last cached collection (rely on DB
// periodic stats dump to update)
if (cache_entry_stats_collector_) {
CacheEntryRoleStats stats;
// thread safe
cache_entry_stats_collector_->GetStats(&stats);
constexpr uint64_t kDayInMicros = uint64_t{86400} * 1000000U;
// Skip if stats are extremely old (> 1 day, incl not yet populated)
if (now_micros - stats.last_end_time_micros_ < kDayInMicros) {
value->append(stats.ToString(clock_));
}
Use deleters to label cache entries and collect stats (#8297) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
}
make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown ** DB Stats ** Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 19:57:29 +00:00
}
void InternalStats::DumpCFFileHistogram(std::string* value) {
Introduce a blob file reader class (#7461) Summary: The patch adds a class called `BlobFileReader` that can be used to retrieve blobs using the information available in blob references (e.g. blob file number, offset, and size). This will come in handy when implementing blob support for `Get`, `MultiGet`, and iterators, and also for compaction/garbage collection. When a `BlobFileReader` object is created (using the factory method `Create`), it first checks whether the specified file is potentially valid by comparing the file size against the combined size of the blob file header and footer (files smaller than the threshold are considered malformed). Then, it opens the file, and reads and verifies the header and footer. The verification involves magic number/CRC checks as well as checking for unexpected header/footer fields, e.g. incorrect column family ID or TTL blob files. Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression type passed by the caller (because of the presence of the header and footer, the specified offset cannot be too close to the start/end of the file; also, the compression type has to match the one in the blob file header), and retrieves and potentially verifies and uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set, `BlobFileReader` reads the blob record header as well (as opposed to just the blob itself) and verifies the key/value size, the key itself, as well as the CRC of the blob record header and the key/value pair. In addition, the patch exposes the compression type from `BlobIndex` (both using an accessor and via `DebugString`), and adds a blob file read latency histogram to `InternalStats` that can be used with `BlobFileReader`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D23999219 Pulled By: ltamasi fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
2020-10-07 22:43:23 +00:00
assert(value);
assert(cfd_);
std::ostringstream oss;
oss << "\n** File Read Latency Histogram By Level [" << cfd_->GetName()
<< "] **\n";
for (int level = 0; level < number_levels_; level++) {
if (!file_read_latency_[level].Empty()) {
Introduce a blob file reader class (#7461) Summary: The patch adds a class called `BlobFileReader` that can be used to retrieve blobs using the information available in blob references (e.g. blob file number, offset, and size). This will come in handy when implementing blob support for `Get`, `MultiGet`, and iterators, and also for compaction/garbage collection. When a `BlobFileReader` object is created (using the factory method `Create`), it first checks whether the specified file is potentially valid by comparing the file size against the combined size of the blob file header and footer (files smaller than the threshold are considered malformed). Then, it opens the file, and reads and verifies the header and footer. The verification involves magic number/CRC checks as well as checking for unexpected header/footer fields, e.g. incorrect column family ID or TTL blob files. Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression type passed by the caller (because of the presence of the header and footer, the specified offset cannot be too close to the start/end of the file; also, the compression type has to match the one in the blob file header), and retrieves and potentially verifies and uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set, `BlobFileReader` reads the blob record header as well (as opposed to just the blob itself) and verifies the key/value size, the key itself, as well as the CRC of the blob record header and the key/value pair. In addition, the patch exposes the compression type from `BlobIndex` (both using an accessor and via `DebugString`), and adds a blob file read latency histogram to `InternalStats` that can be used with `BlobFileReader`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D23999219 Pulled By: ltamasi fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
2020-10-07 22:43:23 +00:00
oss << "** Level " << level << " read latency histogram (micros):\n"
<< file_read_latency_[level].ToString() << '\n';
}
}
Introduce a blob file reader class (#7461) Summary: The patch adds a class called `BlobFileReader` that can be used to retrieve blobs using the information available in blob references (e.g. blob file number, offset, and size). This will come in handy when implementing blob support for `Get`, `MultiGet`, and iterators, and also for compaction/garbage collection. When a `BlobFileReader` object is created (using the factory method `Create`), it first checks whether the specified file is potentially valid by comparing the file size against the combined size of the blob file header and footer (files smaller than the threshold are considered malformed). Then, it opens the file, and reads and verifies the header and footer. The verification involves magic number/CRC checks as well as checking for unexpected header/footer fields, e.g. incorrect column family ID or TTL blob files. Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression type passed by the caller (because of the presence of the header and footer, the specified offset cannot be too close to the start/end of the file; also, the compression type has to match the one in the blob file header), and retrieves and potentially verifies and uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set, `BlobFileReader` reads the blob record header as well (as opposed to just the blob itself) and verifies the key/value size, the key itself, as well as the CRC of the blob record header and the key/value pair. In addition, the patch exposes the compression type from `BlobIndex` (both using an accessor and via `DebugString`), and adds a blob file read latency histogram to `InternalStats` that can be used with `BlobFileReader`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D23999219 Pulled By: ltamasi fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
2020-10-07 22:43:23 +00:00
if (!blob_file_read_latency_.Empty()) {
oss << "** Blob file read latency histogram (micros):\n"
<< blob_file_read_latency_.ToString() << '\n';
}
value->append(oss.str());
}
namespace {
class SumPropertyAggregator : public IntPropertyAggregator {
public:
SumPropertyAggregator() : aggregated_value_(0) {}
virtual ~SumPropertyAggregator() override = default;
void Add(ColumnFamilyData* cfd, uint64_t value) override {
(void)cfd;
aggregated_value_ += value;
}
uint64_t Aggregate() const override { return aggregated_value_; }
private:
uint64_t aggregated_value_;
};
// A block cache may be shared by multiple column families.
// BlockCachePropertyAggregator ensures that the same cache is only added once.
class BlockCachePropertyAggregator : public IntPropertyAggregator {
public:
BlockCachePropertyAggregator() = default;
virtual ~BlockCachePropertyAggregator() override = default;
void Add(ColumnFamilyData* cfd, uint64_t value) override {
Refactor `table_factory` into MutableCFOptions (#13077) Summary: This is setting up for a fix to a data race in SetOptions on BlockBasedTableOptions (BBTO), https://github.com/facebook/rocksdb/issues/10079 The race will be fixed by replacing `table_factory` with a modified copy whenever we want to modify a BBTO field. An argument could be made that this change creates more entaglement between features (e.g. BlobSource <-> MutableCFOptions), rather than (conceptually) minimizing the dependencies of each feature, but * Most of these things already depended on ImmutableOptions * Historically there has been a lot of plumbing (and possible small CPU overhead) involved in adding features that need to reach a lot of places, like `block_protection_bytes_per_key`. Keeping those wrapped up in options simplifies that. * SuperVersion management generally takes care of lifetime management of MutableCFOptions, so is not that difficult. (Crash test agrees so far.) There are some FIXME places where it is known to be unsafe to replace `block_cache` unless/until we handle shared_ptr tracking properly. HOWEVER, replacing `block_cache` is generally dubious, at least while existing users of the old block cache (e.g. table readers) can continue indefinitely. The change to cf_options.cc is essentially just moving code (not changing). I'm not concerned about the performance of copying another shared_ptr with MutableCFOptions, but I left a note about considering an improvement if more shared_ptr are added to it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13077 Test Plan: existing tests, crash test. Unit test DBOptionsTest.GetLatestCFOptions updated with some temporary logic. MemoryTest required some refactoring (simplification) for the change. Reviewed By: cbi42 Differential Revision: D64546903 Pulled By: pdillinger fbshipit-source-id: 69ae97ce5cf4c01b58edc4c5d4687eb1e5bf5855
2024-10-17 21:13:20 +00:00
auto* table_factory =
cfd->GetCurrentMutableCFOptions()->table_factory.get();
assert(table_factory != nullptr);
Cache* cache =
table_factory->GetOptions<Cache>(TableFactory::kBlockCacheOpts());
if (cache != nullptr) {
block_cache_properties_.emplace(cache, value);
}
}
uint64_t Aggregate() const override {
uint64_t sum = 0;
for (const auto& p : block_cache_properties_) {
sum += p.second;
}
return sum;
}
private:
std::unordered_map<Cache*, uint64_t> block_cache_properties_;
};
} // anonymous namespace
std::unique_ptr<IntPropertyAggregator> CreateIntPropertyAggregator(
const Slice& property) {
if (property == DB::Properties::kBlockCacheCapacity ||
property == DB::Properties::kBlockCacheUsage ||
property == DB::Properties::kBlockCachePinnedUsage) {
return std::make_unique<BlockCachePropertyAggregator>();
} else {
return std::make_unique<SumPropertyAggregator>();
}
}
} // namespace ROCKSDB_NAMESPACE