Summary: As titled. Also added the kBaseLevel string, which was missing earlier.
Test Plan: built
Reviewers: yhchiang, anthony, rven, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53271
Summary:
There were just these two properties that didn't have any named
constant.
Test Plan:
build and below test
$ ./db_properties_test --gtest_filter=DBPropertiesTest.NumImmutableMemTable
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53103
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.
Test Plan:
1. Add a new unit test.
2. Run db_bench with
./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics
based on hard drive and see stopping is avoided with the commit.
Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor
Reviewed By: igor
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52047
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.
Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.
Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51117
Summary:
Add rocksdb.num-running-compactions and rocksdb.num-running-flushes
to GetIntProperty() that reports the number of currently running
compactions / flushes.
Test Plan: augmented existing tests in db_test
Reviewers: igor, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48693
Summary:
This patch fixes#7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
Summary: Add an option to stop writes if compaction lefts behind. If estimated pending compaction bytes is more than threshold specified by options.hard_pending_compaction_bytes_liimt, writes will stop until compactions are cleared to under the threshold.
Test Plan: Add unit test DBTest.HardLimit
Reviewers: rven, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45999
Summary:
Although there are currently counters to keep track of the
stall caused by having too many L0 files, there is no distinction as
to whether when that stall occurs either (A) L0-L1 compaction is taking
place to try and mitigate it, or (B) no L0-L1 compaction has been scheduled
at the moment. This diff adds a counter for (A) so that the nature of L0
stalls can be better understood.
Test Plan: make all && make check
Reviewers: sdong, igor, anthony, noetzli, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba
Differential Revision: https://reviews.facebook.net/D46749
Summary:
This diff is a collection of cleanups that were initially part of D43179.
Additionally it adds a unified way of defining key-value maps that use a
Comparator for sorting (this was previously implemented in four different
places).
Test Plan: make clean check all
Reviewers: rven, anthony, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45993
Summary:
This patch adds "rocksdb.aggregated-table-properties"
and "rocksdb.aggregated-table-properties-at-levelN", the former
returns the aggreated table properties of a column family,
while the later returns the aggregated table properties
of the specified level N.
Test Plan: Added tests in db_test
Reviewers: igor, sdong, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45087
Summary: As title
Test Plan: make check
Reviewers: yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45447
Summary:
Add a counter of estimated bytes the DB needs to compact for all the compactions to finish. Expose it as a DB Property.
In the future, we can use threshold of this counter to replace soft rate limit and hard rate limit. A single threshold of estimated compaction debt in bytes will be easier for users to reason about when should slow down and stopping than more abstract soft and hard rate limits.
Test Plan: Add unit tests
Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, anthony, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44205
Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions
Test Plan: Unittests for the new property
Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44799
Summary:
Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns
the memory usage by those memtables which have not yet been flushed.
This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"),
which includes the memory usage by all the memtables, includes those
have been flushed but pinned by iterators.
Test Plan: Added a test in db_test
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D44229
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
Summary:
While working on single delete support for db_bench, I realized that
db_bench/db_stress contain a bunch of duplicate code related to
copmression and found some typos. This patch removes duplicate code,
typos and a redundant #ifndef in internal_stats.cc.
Test Plan: make db_stress && make db_bench && ./db_bench --benchmarks=compress,uncompress
Reviewers: yhchiang, sdong, rven, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43965
Summary:
Fixes T6548822. Added a new function for estimating the size of the live data
as proposed in the task. The value can be accessed through the property
rocksdb.estimate-live-data-size.
Test Plan:
There are two unit tests in version_set_test and a simple test in db_test.
make version_set_test && ./version_set_test;
make db_test && ./db_test gtest_filter=GetProperty
Reviewers: rven, igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41493
Summary:
Universal compaction can involves in multiple levels. However,
the current implementation of bytes_readn and bytes_readnp1
(and some other stats with postfix `n` and `np1`) assumes compaction
can only have two levels.
This patch fixes this bug and redefines bytes_readn and bytes_readnp1:
* bytes_readnp1: the number of bytes read in the compaction output level.
* bytes_readn: the total number of bytes read minus bytes_readnp1
Test Plan: Add a test in compaction_job_stats_test
Reviewers: igor, sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40239
Summary:
For transactions, we are using the memtables to validate that there are no write conflicts. But after flushing, we don't have any memtables, and transactions could fail to commit. So we want to someone keep around some extra history to use for conflict checking. In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.
After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure). It seems like the best place for this is abstracted inside the memtable_list. I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.
This diff adds a new parameter to control how much memtable history to keep around after flushing. However, it sounds like people aren't too fond of adding new parameters. So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers. This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit. (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached). So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).
However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.
Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests. Added testing in memtablelist_test and planning on adding more testing here.
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37443
Summary: We have a bug where we don't report the last level's files as being compacted. This fixes it.
Test Plan: See the fix in action here: https://phabricator.fb.com/P19845738
Reviewers: MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38727
Summary:
See https://gist.github.com/mdcallag/89ebb2b8cbd331854865 for the IO stats.
I added "Cumulative compaction:" and "Interval compaction:" lines. The IO rates
can be confusing. Rates fro per-level stats lines, Wr(MB/s) & Rd(MB/s), are computed
using the duration of the compaction job. If the job reads 10MB, writes 9MB and the job
(IO & merging) takes 1 second then the rates are 10MB/s for read and 9MB/s for writes.
The IO rates in the Cumulative compaction line uses the total uptime. The IO rates in the
Interval compaction line uses the interval uptime. So these Cumalative & Interval
compaction IO rates cannot be compared to the per-level IO rates. But both forms of
the rates are useful for debugging perf.
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D38667
Summary:
Fix the make unity build. The local stats variable name was shadowing a
global stats variable.
Test Plan:
Run the build
OPT=-DTRAVIS V=1 make unity
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D36285
Summary:
This makes run_flash_bench.sh configurable. Previously it was hardwired for 1B keys and tests
ran for 12 hours each. That kept me from using it. This makes it configuable, adds more tests,
makes the duration per-test configurable and refactors the test scripts.
Adds the seekrandomwhilemerging test to db_bench which is the same as seekrandomwhilewriting except
the writer thread does Merge rather than Put.
Forces the stall-time column in compaction IO stats to use a fixed format (H:M:S) which makes
it easier to scrape and parse. Also adds an option to AppendHumanMicros to force a fixed format.
Sometimes automation and humans want different format.
Calls thread->stats.AddBytes(bytes); in db_bench for more tests to get the MB/sec summary
stats in the output at test end.
Adds the average ingest rate to compaction IO stats. Output now looks like:
https://gist.github.com/mdcallag/2bd64d18be1b93adc494
More information on the benchmark output is at https://gist.github.com/mdcallag/db43a58bd5ac624f01e1
For benchmark.sh changes default RocksDB configuration to reduce stalls:
* min_level_to_compress from 2 to 3
* hard_rate_limit from 2 to 3
* max_grandparent_overlap_factor and max_bytes_for_level_multiplier from 10 to 8
* L0 file count triggers from 4,8,12 to 4,12,20 for (start,stall,stop)
Task ID: #6596829
Blame Rev:
Test Plan:
run tools/run_flash_bench.sh
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D36075
Summary:
Assign the string properties to const string variables under the
DB::Properties namespace. This helps catch typos during compilation and
also consolidates the property definition in one place.
Test Plan: Run rocksdb unit tests
Reviewers: sdong, yoshinorim, igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D35991
Summary: Add a DB property for number of deletions in memtables. It can sometimes help people debug slowness because of too many deletes.
Test Plan: Add test cases.
Reviewers: rven, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D35247
Summary:
The output did not have space for 6-digit file counts or for 3-digit
counts of files being compacted. This adds space for that while preserving
existing alignment. See https://gist.github.com/mdcallag/0a61c6a18dd467224c11
Task ID: #
Blame Rev:
Test Plan:
run db_bench, look at output
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D35091
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
Summary: Add a DB property about live versions. It can be helpful to figure out whether there are files not live but not yet deleted, in some use cases.
Test Plan: make all check
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33327
Summary:
Add a counter in SnapshotList to show number of snapshots. Also a unix timestamp in every snapshot.
Add two DB Properties to return number of snapshots and timestamp of the oldest one.
Test Plan: Add unit test checking
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D29919
Summary: Block internal_stats in ROCKSDB_LITE.
Test Plan: make OPT=-DROCKSDB_LITE shared_lib
Reviewers: ljin, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29541
Summary:
In some environment such as android, the c++ library does not have
std::to_string. This path adds rocksdb::ToString(), which wraps std::to_string
when std::to_string is not available, and implements std::to_string
in the other case.
Test Plan:
make dbg -j32
./db_test
make clean
make dbg OPT=-DOS_ANDROID -j32
./db_test
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29181
Summary:
The very last reference happens in DBImpl::GetOptions()
I built with both DBImpl::GetOptions() and ColumnFamilyData::options() commented out
Test Plan: make all check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29073
Summary: As a short-term fix, let's go back to previous way of calculating NeedsCompaction(). SIGSEGV happens because NeedsCompaction() can happen before super_version (and thus MutableCFOptions) is initialized.
Test Plan: make check
Reviewers: ljin, sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28875
Summary:
Move NeedsCompaction() from VersionStorageInfo to CompactionPicker
to allow different compaction strategy to have their own way to
determine whether doing compaction is necessary.
When compaction style is set to kCompactionStyleNone, then
NeedsCompaction() will always return false.
Test Plan:
export ROCKSDB_TESTS=Compact
./db_test
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28719
Summary:
1. fix possible overflow of the two stats by using uint64_t
2. use a similar source of data to calculate RecordDrop. Previous one is not correct.
Test Plan: See outputs of db_bench settings, and the results look reasonable
Reviewers: MarkCallaghan, ljin, igor
Reviewed By: igor
Subscribers: rven, leveldb, yhchiang, dhruba
Differential Revision: https://reviews.facebook.net/D28155
Summary: Enforce the accessier naming convention in functions in version_set.h
Test Plan: make all check
Reviewers: ljin, yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D28143
Summary: as title
Test Plan: make release
Reviewers: sdong, yhchiang, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28053
Summary: now we use %8d for RecordIn and %10d for RecordDrop, which is far too small for some use cases. Extend both of them to %12d.
Test Plan: run one test in db_test and see the LOG file.
Reviewers: igor, MarkCallaghan, ljin
Reviewed By: ljin
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D28041
Summary:
Rename Version::Builder to VersionBuilder and expose its definition to a header.
Make VerisonBuilder not reference Version or ColumnFamilyData, only working with VersionStorageInfo.
Add version_builder_test which has a simple test.
Test Plan: make all check
Reviewers: rven, yhchiang, igor, ljin
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D27969
Summary:
Make compaction picker easier to test.
The basic idea is to separate a minimum subcomponent of Version to VersionStorageInfo, which just responsible to LSM tree. A stub VersionStorageInfo can then be easily created and passed into compaction picker so that we can check the outputs.
It now passes most tests. Still two things need to be done:
(1) deal with the FIFO compaction's file size.
(2) write an example test to make sure the interface can do the job.
Add a compaction_picker_test to make sure compaction picker codes can be easily unit tested.
Test Plan:
Pass all unit tests and compaction_picker_test
Reviewers: yhchiang, rven, igor, ljin
Reviewed By: ljin
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D27639