Commit graph

73 commits

Author SHA1 Message Date
Yi Wu bf937cf15b Add "rocksdb.live-sst-files-size" DB property
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548

Differential Revision: D7116939

Pulled By: yiwu-arbug

fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
2018-03-01 18:01:10 -08:00
Yi Wu 66a2c44ef4 Add DB::Properties::kEstimateOldestKeyTime
Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.

My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842

Differential Revision: D5770600

Pulled By: yiwu-arbug

fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
2017-10-23 15:27:27 -07:00
Andrew Kryczka 3cd7ea2e8a rename stall-related internal stats
Summary:
Some of these names, like `MEMTABLE_COMPACTION`, did not mean anything. Tried to give them descriptive names.
Closes https://github.com/facebook/rocksdb/pull/2852

Differential Revision: D5782822

Pulled By: ajkr

fbshipit-source-id: f2695c4124af4073da4492d7135bae2411220f3a
2017-09-07 18:26:18 -07:00
Artem Danilov 8a6708f5f2 Extend property map with compaction stats
Summary:
This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch.
Closes https://github.com/facebook/rocksdb/pull/2794

Differential Revision: D5717676

Pulled By: Tema

fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a
2017-08-30 15:26:55 -07:00
Siying Dong 3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
Maysam Yabandeh 7604b463b5 Update the AddDBStats in LITE
Summary: Closes https://github.com/facebook/rocksdb/pull/2525

Differential Revision: D5356859

Pulled By: maysamyabandeh

fbshipit-source-id: f593adad2a8aab12dcd6ab25db076eca51d30d34
2017-06-30 10:56:50 -07:00
Maysam Yabandeh e9f91a5176 Add a fetch_add variation to AddDBStats
Summary:
AddDBStats is in two steps of load and store, which is more efficient than fetch_add. This is however not thread-safe. Currently we have to protect concurrent access to AddDBStats with a mutex which is less efficient that fetch_add.

This patch adds the option to do fetch_add when AddDBStats. The results for my 2pc benchmark on sysbench is:
- vanilla: 68618 tps
- removing mutex on AddDBStats (unsafe): 69767 tps
- fetch_add for all AddDBStats: 69200 tps
- fetch_add only for concurrently access AddDBStats (this patch): 69579 tps
Closes https://github.com/facebook/rocksdb/pull/2505

Differential Revision: D5330656

Pulled By: maysamyabandeh

fbshipit-source-id: af64d7bee135b0e86b4fac323a4f9d9113eaa383
2017-06-29 17:12:19 -07:00
Siying Dong d616ebea23 Add GPLv2 as an alternative license.
Summary: Closes https://github.com/facebook/rocksdb/pull/2226

Differential Revision: D4967547

Pulled By: siying

fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4
2017-04-27 18:06:12 -07:00
Tomas Kolda 04d58970cb AIX and Solaris Sparc Support
Summary:
Replacement of #2147

The change was squashed due to a lot of conflicts.
Closes https://github.com/facebook/rocksdb/pull/2194

Differential Revision: D4929799

Pulled By: siying

fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581
2017-04-21 20:48:04 -07:00
Siying Dong c49d704656 Add DB:ResetStats()
Summary:
Add a function to allow users to reset internal stats without restarting the DB.
Closes https://github.com/facebook/rocksdb/pull/2167

Differential Revision: D4907939

Pulled By: siying

fbshipit-source-id: ab2dd85b88aabe9380da7485320a1d460d3e1f68
2017-04-18 16:56:48 -07:00
Siying Dong 8f47a97512 File level histogram should be printed per CF, not per DB
Summary:
Currently level histogram is only printed out for DB stats and for default CF. This is confusing. Change to print for every CF instead.
Closes https://github.com/facebook/rocksdb/pull/2126

Differential Revision: D4865373

Pulled By: siying

fbshipit-source-id: 1c853e0ac66e00120ee931cabc9daf69ccc2d577
2017-04-11 08:42:03 -07:00
Islam AbdelRahman c50e3750dc Use a human readable size for level report
Summary:
Current
```
** Compaction Stats [default] **
Level    Files   Size(MB} Score Read(GB}  Rn(GB} Rnp1(GB} Write(GB} Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt} Avg(sec} KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      2/0      49.02   0.5      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0     76.1         1         2    0.322       0      0
 Sum      2/0      49.02   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     76.1         1         2    0.322       0      0
 Int      0/0       0.00   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     76.1         1         2    0.322       0      0
```

New
```
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn Key
Closes https://github.com/facebook/rocksdb/pull/2055

Differential Revision: D4804576

Pulled By: IslamAbdelRahman

fbshipit-source-id: 719be6a
2017-04-05 17:24:19 -07:00
Siying Dong 67d7623794 Expose the stalling information through DB::GetProperty()
Summary:
Add two DB properties: rocksdb.actual_delayed_write_rate and rocksdb.is_write_stooped, for people to know whether current writes are being throttled.
Closes https://github.com/facebook/rocksdb/pull/2043

Differential Revision: D4782975

Pulled By: siying

fbshipit-source-id: 6b2f5cf
2017-03-29 11:54:20 -07:00
Maysam Yabandeh c4a37dcb44 Print the missed last layer in cfstats
Summary:
Printing compaction stats used to operate on two variable:
number_levels_: for printing the layer
num_levels_to_check: for updating the compaction score

After this commit: 361010d447 these two are mixed up and as a result the last layer might not be printed out: https://fb.facebook.com/groups/rocksdb.internal/permalink/1315716625143616/

number_levels_ was used to decide which layers to print: 672300f47f/db/internal_stats.cc (L753) but after the patch it is based on the return value of DumpCFMapStats 361010d447/db/internal_stats.cc (L929) which returns num_levels_to_check: 361010d447/db/internal_stats.cc (L917)
Closes https://github.com/facebook/rocksdb/pull/1853

Differential Revision: D4529280

Pulled By: maysamyabandeh

fbshipit-source-id: 3fd9448
2017-02-08 10:39:15 -08:00
Siying Dong 438f22bc56 Fix bug of Checkpoint loses recent transactions with 2PC
Summary:
If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files.
Closes https://github.com/facebook/rocksdb/pull/1724

Differential Revision: D4368319

Pulled By: siying

fbshipit-source-id: cc2c746
2016-12-28 12:24:16 -08:00
Maysam Yabandeh 361010d447 Exporting compaction stats in the form of a map
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477

Differential Revision: D4149836

Pulled By: maysamyabandeh

fbshipit-source-id: b3df19f
2016-11-11 20:54:14 -08:00
Islam AbdelRahman abc0ae462b Add AddFile() InternalStats for Total files/L0 files/total keys ingested
Summary:
Report more information about the ingested files in CF InternalStats
- Total files
- Total L0 files
- Total keys

There was also noticed that we were reporting files that failed to ingest, fix this bug

Test Plan: print stats in tests

Reviewers: sdong, andrewkr, lightmark

Reviewed By: lightmark

Subscribers: jkedgar, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D63039
2016-09-21 14:24:08 -07:00
Islam AbdelRahman 30a24f2d3d Add InternalStats and logging for AddFile()
Summary:
We dont report the bytes that we ingested from AddFile which make the write amplification numbers incorrect
Update InternalStats and add logging for AddFile()

Test Plan: Make sure the code compile and existing tests pass

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59763
2016-06-16 16:21:41 -07:00
sdong 7b78d623f7 Shouldn't report default column family's compaction stats as DB compaction stats
Summary:
Now we collect compaction stats per column family, but report default colum family's stat as compaction stats for DB.
Fix it by reporting compaction stats per column family instead.

Test Plan: Run db_bench with --num_column_families=4 and see the number fixed.

Reviewers: IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57063
2016-04-25 13:56:59 -07:00
Andrew Kryczka 73a847ef89 Add per-level compression ratio property
Summary:
This is needed so we can measure compression ratio improvements
achieved by D52287.

The property compares raw data size against the total file size for a given
level. If the level is empty it should return 0.0.

Test Plan: new unit test

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56967
2016-04-20 18:46:54 -07:00
sdong 294bdf9ee2 Change Property name from "rocksdb.current_version_number" to "rocksdb.current-super-version-number"
Summary: I realized I again is wrong about the naming convention. Let me change it to the correct one.

Test Plan: Run unit tests.

Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55041
2016-03-04 18:15:29 -08:00
sdong 432f3adf2c Add DB Property "rocksdb.current_version_number"
Summary: Add a DB Property "rocksdb.current_version_number" for users to monitor version changes and stale iterators.

Test Plan: Add a unit test.

Reviewers: andrewkr, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54927
2016-03-01 10:55:40 -08:00
Baraa Hamodi 21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
Andrew Kryczka 284aa613a7 Eliminate duplicated property constants
Summary:
Before this diff, there were duplicated constants to refer to properties (user-
facing API had strings and InternalStats had an enum). I noticed these were
inconsistent in terms of which constants are provided, names of constants, and
documentation of constants. Overall it seemed annoying/error-prone to maintain
these duplicated constants.

So, this diff gets rid of InternalStats's constants and replaces them with a map
keyed on the user-facing constant. The value in that map contains a function
pointer to get the property value, so we don't need to do string matching while
holding db->mutex_. This approach has a side benefit of making many small
handler functions rather than a giant switch-statement.

Test Plan: db_properties_test passes, running "make commit-prereq -j32"

Reviewers: sdong, yhchiang, kradhakrishnan, IslamAbdelRahman, rven, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53253
2016-02-02 19:14:56 -08:00
Nathan Bronson 7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
sdong d72b31774e Slowdown when writing to the last write buffer
Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.

Test Plan:
1. Add a new unit test.
2. Run db_bench with

./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics

based on hard drive and see stopping is avoided with the commit.

Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52047
2015-12-17 10:49:08 -08:00
sdong 56e77f0967 Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.

Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51117
2015-12-09 18:22:45 -08:00
Yueh-Hsuan Chiang ad471453e8 Allow GetProperty to report the number of currently running flushes / compactions.
Summary:
Add rocksdb.num-running-compactions and rocksdb.num-running-flushes
to GetIntProperty() that reports the number of currently running
compactions / flushes.

Test Plan: augmented existing tests in db_test

Reviewers: igor, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48693
2015-10-17 00:16:36 -07:00
Yueh-Hsuan Chiang 2b925ccb43 Correct the comments in db/internal_stats.h
Summary: Correct the comments in db/internal_stats.h

Test Plan: no code change

Reviewers: igor, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48675
2015-10-13 17:23:40 -07:00
sdong 5de807ac16 Add options.hard_pending_compaction_bytes_limit to stop writes if compaction lagging behind
Summary: Add an option to stop writes if compaction lefts behind. If estimated pending compaction bytes is more than threshold specified by options.hard_pending_compaction_bytes_liimt, writes will stop until compactions are cleared to under the threshold.

Test Plan: Add unit test DBTest.HardLimit

Reviewers: rven, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45999
2015-09-14 12:51:16 -07:00
Ari Ekmekji 03ddce9a01 Add counters for L0 stall while L0-L1 compaction is taking place
Summary:
Although there are currently counters to keep track of the
stall caused by having too many L0 files, there is no distinction as
to whether when that stall occurs either (A) L0-L1 compaction is taking
place to try and mitigate it, or (B) no L0-L1 compaction has been scheduled
at the moment. This diff adds a counter for (A) so that the nature of L0
stalls can be better understood.

Test Plan: make all && make check

Reviewers: sdong, igor, anthony, noetzli, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D46749
2015-09-14 11:03:37 -07:00
Yueh-Hsuan Chiang 6996de87af Expose per-level aggregated table properties via GetProperty()
Summary:
This patch adds "rocksdb.aggregated-table-properties"
and "rocksdb.aggregated-table-properties-at-levelN", the former
returns the aggreated table properties of a column family,
while the later returns the aggregated table properties
of the specified level N.

Test Plan: Added tests in db_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45087
2015-08-25 12:03:54 -07:00
sdong 07d2d34160 Add a counter about estimated pending compaction bytes
Summary:
Add a counter of estimated bytes the DB needs to compact for all the compactions to finish. Expose it as a DB Property.
In the future, we can use threshold of this counter to replace soft rate limit and hard rate limit. A single threshold of estimated compaction debt in bytes will be easier for users to reason about when should slow down and stopping than more abstract soft and hard rate limits.

Test Plan: Add unit tests

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44205
2015-08-20 22:17:10 -07:00
Islam AbdelRahman 027ca5b2cd Total SST files size DB Property
Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions

Test Plan: Unittests for the new property

Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44799
2015-08-20 11:47:19 -07:00
Yueh-Hsuan Chiang df79eafcb3 Introduce GetIntProperty("rocksdb.size-all-mem-tables")
Summary:
Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns
the memory usage by those memtables which have not yet been flushed.

This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"),
which includes the memory usage by all the memtables, includes those
have been flushed but pinned by iterators.

Test Plan: Added a test in db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44229
2015-08-19 13:32:09 -07:00
sdong 72613657f0 Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.

Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected

Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44193
2015-08-14 17:32:42 -07:00
Andres Notzli 06aebca592 Report live data size estimate
Summary:
Fixes T6548822. Added a new function for estimating the size of the live data
as proposed in the task. The value can be accessed through the property
rocksdb.estimate-live-data-size.

Test Plan:
There are two unit tests in version_set_test and a simple test in db_test.
make version_set_test && ./version_set_test;
make db_test && ./db_test gtest_filter=GetProperty

Reviewers: rven, igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41493
2015-07-21 21:33:20 -07:00
Yueh-Hsuan Chiang bb1c74ce18 Fixed a bug of CompactionStats in multi-level universal compaction case
Summary:
Universal compaction can involves in multiple levels.  However,
the current implementation of bytes_readn and bytes_readnp1
(and some other stats with postfix `n` and `np1`) assumes compaction
can only have two levels.

This patch fixes this bug and redefines bytes_readn and bytes_readnp1:
* bytes_readnp1: the number of bytes read in the compaction output level.
* bytes_readn: the total number of bytes read minus bytes_readnp1

Test Plan: Add a test in compaction_job_stats_test

Reviewers: igor, sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40239
2015-06-17 23:40:34 -07:00
agiardullo c815351038 Support saving history in memtable_list
Summary:
For transactions, we are using the memtables to validate that there are no write conflicts.  But after flushing, we don't have any memtables, and transactions could fail to commit.  So we want to someone keep around some extra history to use for conflict checking.  In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.

After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure).  It seems like the best place for this is abstracted inside the memtable_list.  I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.

This diff adds a new parameter to control how much memtable history to keep around after flushing.  However, it sounds like people aren't too fond of adding new parameters.  So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers.  This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit.  (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached).  So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).

However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.

Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests.  Added testing in memtablelist_test and planning on adding more testing here.

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37443
2015-05-28 16:34:24 -07:00
Mark Callaghan 944043d683 Add --wal_bytes_per_sync for db_bench and more IO stats
Summary:
See https://gist.github.com/mdcallag/89ebb2b8cbd331854865 for the IO stats.
I added "Cumulative compaction:" and "Interval compaction:" lines. The IO rates
can be confusing. Rates fro per-level stats lines, Wr(MB/s) & Rd(MB/s), are computed
using the duration of the compaction job. If the job reads 10MB, writes 9MB and the job
(IO & merging) takes 1 second then the rates are 10MB/s for read and 9MB/s for writes.
The IO rates in the Cumulative compaction line uses the total uptime. The IO rates in the
Interval compaction line uses the interval uptime. So these Cumalative & Interval
compaction IO rates cannot be compared to the per-level IO rates. But both forms of
the rates are useful for debugging perf.

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D38667
2015-05-19 16:19:30 -07:00
Yueh-Hsuan Chiang d2a056241a Fix a compilation error in ROCKSDB_LITE in db/internal_stats.h
Summary:
Fix a compilation error in ROCKSDB_LITE in db/internal_stats.h

Other compilation errors will be fixed in a separate diff.

Test Plan: make OPT=-DROCKSDB_LITE

Reviewers: sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36807
2015-04-09 16:43:54 -07:00
sdong 0831a35994 Add a DB Property For Number of Deletions in Memtables
Summary: Add a DB property for number of deletions in memtables. It can sometimes help people debug slowness because of too many deletes.

Test Plan: Add test cases.

Reviewers: rven, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D35247
2015-03-18 17:03:59 -07:00
Mark Callaghan c8da670325 Stop printing per-level stall times.
Summary:
Per-level stall times are the suggested stall time, not the actual stall time so this change stops printing them
both in the per-level output lines and in the summary. Also changed output for total stall time to include units
in all cases. The new output looks like:
Level   Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(cnt)    RecordIn   RecordDrop
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0     4/1          7   0.8      0.0     0.0      0.0       0.6      0.6       0.0   0.0      0.0     12.9        50       352    0.141        882            0            0
  L1     5/0          9   0.9      0.0     0.0      0.0       0.0      0.0       0.6   0.0      0.0      0.0         0         0    0.000          0            0            0
  L2    54/0         99   1.0      0.0     0.0      0.0       0.0      0.0       0.6   0.0      0.0      0.0         0         0    0.000          0            0            0
  L3   289/0        527   0.5      0.0     0.0      0.0       0.0      0.0       0.5   0.0      0.0      0.0         0         0    0.000          0            0            0
 Sum   352/1        642   0.0      0.0     0.0      0.0       0.6      0.6       1.7   1.0      0.0     12.9        50       352    0.141        882            0            0
 Int     0/0          0   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     15.5         0         3    0.118          7            0            0
Flush(GB): accumulative 0.627, interval 0.005
Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 882 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard

Task ID: #6493861

Blame Rev:

Test Plan:
run db_bench, look at output

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D35085
2015-03-14 15:01:43 -07:00
Igor Canadi db03739340 options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.

In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.

Test Plan: New unit tests and pass tests suites including valgrind.

Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo

Reviewed By: ikabiljo

Subscribers: yoshinorim, ikabiljo, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31437
2015-03-02 22:40:41 -08:00
sdong d45a6a4002 Add rocksdb.num-live-versions: number of live versions
Summary: Add a DB property about live versions. It can be helpful to figure out whether there are files not live but not yet deleted, in some use cases.

Test Plan: make all check

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33327
2015-02-19 13:10:37 -08:00
sdong 9132e52ea4 DB Stats Dump to print total stall time
Summary:
Add printing of stall time in DB Stats:

Sample outputs:

** DB Stats **
Uptime(secs): 53.2 total, 1.7 interval
Cumulative writes: 625940 writes, 625939 keys, 625940 batches, 1.0 writes per batch, 0.49 GB user ingest, stall micros: 50691070
Cumulative WAL: 625940 writes, 625939 syncs, 1.00 writes per sync, 0.49 GB written
Interval writes: 10859 writes, 10859 keys, 10859 batches, 1.0 writes per batch, 8.7 MB user ingest, stall micros: 1692319
Interval WAL: 10859 writes, 10859 syncs, 1.00 writes per sync, 0.01 MB written

Test Plan:
make all check
verify printing using db_bench

Reviewers: igor, yhchiang, rven, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D31239
2015-01-09 11:44:19 -08:00
sdong 73ee4febab Add comments about properties supported by DB::GetProperty() and DB::GetIntProperty()
Summary: Add comments in db.h to help users discover their options.

Test Plan: Compile

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31077
2015-01-07 15:09:35 -08:00
sdong 1f04066cab Add DBProperty to return number of snapshots and time for oldest snapshot
Summary:
Add a counter in SnapshotList to show number of snapshots. Also a unix timestamp in every snapshot.
Add two DB Properties to return number of snapshots and timestamp of the oldest one.

Test Plan: Add unit test checking

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D29919
2014-12-05 17:07:49 -08:00
Mark Callaghan 32a0a03844 Add Moved(GB) to Compaction IO stats
Summary:
Adds counter for bytes moved (files pushed down a level rather than compacted) to compaction
IO stats as Moved(GB). From the output removed these infrequently used columns: RW-Amp, Rn(cnt), Rnp1(cnt),
Wnp1(cnt), Wnew(cnt).
Example old output:
Level   Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s)  Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt)  Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) RecordIn RecordDrop
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0     0/0          0   0.0      0.0     0.0      0.0    2130.8   2130.8    0.0   0.0      0.0    109.1        0         0         0         0      20002     25068    0.798      28.75     182059    0.16       0          0
  L1   142/0        509   1.0   4618.5  2036.5   2582.0    4602.1   2020.2    4.5   2.3     88.5     88.1    24220    701246   1215528    514282      53466      4229   12.643       0.00          0    0.002032745988  300688729

Example new output:
Level   Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms)     RecordIn   RecordDrop
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0     7/0         13   1.8      0.0     0.0      0.0       0.6      0.6       0.0   0.0      0.0     14.7        44       353    0.124       0.03        626    0.05            0            0
  L1     9/0         16   1.6      0.0     0.0      0.0       0.0      0.0       0.6   0.0      0.0      0.0         0         0    0.000       0.00          0    0.00            0            0

Task ID: #

Blame Rev:

Test Plan:
make check, run db_bench --fillseq --stats_per_interval --stats_interval and look at output

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D29787
2014-12-03 18:28:39 -08:00
Yueh-Hsuan Chiang 274ba62707 Block internal_stats in ROCKSDB_LITE
Summary: Block internal_stats in ROCKSDB_LITE.

Test Plan: make OPT=-DROCKSDB_LITE shared_lib

Reviewers: ljin, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29541
2014-11-25 12:01:27 -08:00