Commit Graph

311 Commits

Author SHA1 Message Date
Mark Callaghan ed229a0dee Fixes for readcache-flashcache
Summary:
This fixes two problems:
1) the env should not be created twice when use_existing_db is false
2) the env dtor should run before cachedev_fd_ is closed.

Task ID: #

Blame Rev:

Test Plan:
Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D36795
2015-04-09 15:51:34 -07:00
Yoshinori Matsunobu f12614070f Fix TSAN build error of D36447
Summary:
D36447 caused build error when using COMPILE_WITH_TSAN=1.
This diff fixes the error.

Test Plan: jenkins

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D36579
2015-04-06 17:37:36 -07:00
Yoshinori Matsunobu 824e646341 Adding another NewFlashcacheAwareEnv function to support pre-opened fd
Summary:
There are some cases when flachcache file descriptor was
already allocated (i.e. fb-MySQL). Then NewFlashcacheAwareEnv returns an
error at open() because fd was already assigned. This diff adds another
function to instantiate FlashcacheAwareEnv, with pre-allocated fd cachedev_fd.

Test Plan: Tested with MyRocks using this function, then worked

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, MarkCallaghan, rven

Differential Revision: https://reviews.facebook.net/D36447
2015-04-06 16:50:36 -07:00
Mark Callaghan 1bd70fb54a Add --stats_interval_seconds to db_bench
Summary:
The --stats_interval_seconds determines interval for stats reporting
and overrides --stats_interval when set. I also changed tools/benchmark.sh
to report stats every 60 seconds so I can avoid trying to figure out a
good value for --stats_interval per test and per storage device.

Task ID: #6631621

Blame Rev:

Test Plan:
run tools/run_flash_bench, look at output

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D36189
2015-03-30 12:58:32 -07:00
Mark Callaghan 99ec2412e5 Make the benchmark scripts configurable and add tests
Summary:
This makes run_flash_bench.sh configurable. Previously it was hardwired for 1B keys and tests
ran for 12 hours each. That kept me from using it. This makes it configuable, adds more tests,
makes the duration per-test configurable and refactors the test scripts.

Adds the seekrandomwhilemerging test to db_bench which is the same as seekrandomwhilewriting except
the writer thread does Merge rather than Put.

Forces the stall-time column in compaction IO stats to use a fixed format (H:M:S) which makes
it easier to scrape and parse. Also adds an option to AppendHumanMicros to force a fixed format.
Sometimes automation and humans want different format.

Calls thread->stats.AddBytes(bytes); in db_bench for more tests to get the MB/sec summary
stats in the output at test end.

Adds the average ingest rate to compaction IO stats. Output now looks like:
https://gist.github.com/mdcallag/2bd64d18be1b93adc494

More information on the benchmark output is at https://gist.github.com/mdcallag/db43a58bd5ac624f01e1

For benchmark.sh changes default RocksDB configuration to reduce stalls:
* min_level_to_compress from 2 to 3
* hard_rate_limit from 2 to 3
* max_grandparent_overlap_factor and max_bytes_for_level_multiplier from 10 to 8
* L0 file count triggers from 4,8,12 to 4,12,20 for (start,stall,stop)

Task ID: #6596829

Blame Rev:

Test Plan:
run tools/run_flash_bench.sh

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D36075
2015-03-30 11:28:25 -07:00
Igor Canadi d61cb0b9de db_bench can now disable flashcache for background threads
Summary: Most of the approach is copied from WebSQL's MySQL branch. It's nice that we can do this without touching core RocksDB code.

Test Plan: Compiles and runs. Didn't test flashback code, as I don't have flashback device and most if it is c/p

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: rven, lgalanis, kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35391
2015-03-30 09:51:11 -07:00
Alexander.Mikhaylov a3e4b32483 fix compilation error (same as fix #284)
[maa@srv2-nskb-devg2 rocksdb-master]$ CXX=/usr/local/CC/gcc-4.7.4/bin/g++ EXTRA_CXXFLAGS=-std=c++11 DISABLE_WARNING_AS_ERROR=1  make db_bench
  CC       db/db_bench.o
db/db_bench.cc: In member function 'rocksdb::Slice rocksdb::Benchmark::AllocateKey(std::unique_ptr<const char []>*)':
db/db_bench.cc:1434:41: error: use of deleted function 'void std::unique_ptr<_Tp [], _Dp>::reset(_Up) [with _Up = char*; _Tp = const char; _Dp = std::default_delete<const char []>]'
In file included from /usr/local/CC/gcc-4.7.4/lib/gcc/x86_64-unknown-linux-gnu/4.7.4/../../../../include/c++/4.7.4/memory:86:0,
                 from ./include/rocksdb/db.h:14,
                 from ./db/dbformat.h:14,
                 from ./db/db_impl.h:21,
                 from db/db_bench.cc:33:
2015-03-26 14:53:42 +06:00
Yueh-Hsuan Chiang 248c063ba1 Report elapsed time in micros in ThreadStatus instead of start time.
Summary:
Report elapsed time of a thread operation in micros in ThreadStatus
instead of start time of a thread operation in seconds since the
Epoch, 1970-01-01 00:00:00 (UTC).

Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 \
--max_background_compactions=10 --max_background_flushes=3 \
--thread_status_per_interval=1000 --key_size=16 --value_size=1000 \
--num_column_families=10

Sample Output:
            ThreadID ThreadType                    cfName    Operation  ElapsedTime                                         Stage        State
     140667724562496   High Pri column_family_name_000002        Flush   772.419 ms                    FlushJob::WriteLevel0Table
     140667728756800   High Pri                   default        Flush   617.845 ms                    FlushJob::WriteLevel0Table
     140667732951104   High Pri column_family_name_000005        Flush   772.078 ms                    FlushJob::WriteLevel0Table
     140667875557440    Low Pri column_family_name_000008   Compaction  1409.216 ms                        CompactionJob::Install
     140667737145408    Low Pri
     140667749728320    Low Pri
     140667816837184    Low Pri column_family_name_000007   Compaction  1071.815 ms      CompactionJob::ProcessKeyValueCompaction
     140667787477056    Low Pri column_family_name_000009   Compaction   772.516 ms      CompactionJob::ProcessKeyValueCompaction
     140667741339712    Low Pri
     140667758116928    Low Pri column_family_name_000004   Compaction   620.739 ms      CompactionJob::ProcessKeyValueCompaction
     140667753922624    Low Pri
     140667842003008    Low Pri column_family_name_000006   Compaction  1260.079 ms      CompactionJob::ProcessKeyValueCompaction
     140667745534016    Low Pri

Reviewers: sdong, igor, rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35769
2015-03-24 11:32:25 -07:00
Mark Callaghan dfccc7b4e2 Add readwhilemerging benchmark
Summary:
This is like readwhilewriting but uses Merge rather than Put in the writer thread.
I am using it for in-progress benchmarks. I don't think the other benchmarks for Merge
cover this behavior. The purpose for this test is to measure read performance when
readers might have to merge results. This will also benefit from work-in-progress
to add skewed key generation.

Task ID: #

Blame Rev:

Test Plan:
Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D35115
2015-03-18 13:50:52 -07:00
Igor Canadi c88ff4ca76 Deprecate removeScanCountLimit in NewLRUCache
Summary: It is no longer used by the implementation, so we should also remove it from the public API.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34971
2015-03-17 15:04:37 -07:00
Igor Canadi 417367c42d Fix SIGSEGV when not using cache 2015-03-13 16:41:00 -07:00
Igor Canadi f690712652 Speed up db_bench shutdown
Summary: See t6489044

Test Plan: compiles

Reviewers: MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34977
2015-03-13 14:45:15 -07:00
Yueh-Hsuan Chiang c594b0e89d Allow GetThreadList() to report operation stage.
Summary: Allow GetThreadList() to report operation stage.

Test Plan:
  ./thread_list_test
  ./db_bench --benchmarks=fillrandom --num=100000 --threads=40 \
    --max_background_compactions=10 --max_background_flushes=3 \
    --thread_status_per_interval=1000 --key_size=16 --value_size=1000 \
    --num_column_families=10

  export ROCKSDB_TESTS=ThreadStatus
  ./db_test

Sample output
          ThreadID ThreadType                    cfName    Operation        OP_StartTime    ElapsedTime                                         Stage        State
   140116265861184    Low Pri
   140116270055488    Low Pri
   140116274249792   High Pri column_family_name_000005        Flush 2015/03/10-14:58:11           0 us                    FlushJob::WriteLevel0Table
   140116400078912    Low Pri column_family_name_000004   Compaction 2015/03/10-14:58:11           0 us     CompactionJob::FinishCompactionOutputFile
   140116358135872    Low Pri column_family_name_000006   Compaction 2015/03/10-14:58:10           1 us     CompactionJob::FinishCompactionOutputFile
   140116341358656    Low Pri
   140116295221312   High Pri                   default        Flush 2015/03/10-14:58:11           0 us                    FlushJob::WriteLevel0Table
   140116324581440    Low Pri column_family_name_000009   Compaction 2015/03/10-14:58:11           0 us      CompactionJob::ProcessKeyValueCompaction
   140116278444096    Low Pri
   140116299415616    Low Pri column_family_name_000008   Compaction 2015/03/10-14:58:11           0 us     CompactionJob::FinishCompactionOutputFile
   140116291027008   High Pri column_family_name_000001        Flush 2015/03/10-14:58:11           0 us                    FlushJob::WriteLevel0Table
   140116286832704    Low Pri column_family_name_000002   Compaction 2015/03/10-14:58:11           0 us     CompactionJob::FinishCompactionOutputFile
   140116282638400    Low Pri

Reviewers: rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34683
2015-03-13 10:45:40 -07:00
Islam AbdelRahman 1d43bc41fb Fixing segmentation fault in db_bench
Summary:
Fixing segmentation fault when running db_bench

This seg fault happens because num_created is used without being initialized

Test Plan:
running db_bench using these arguments
bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=0; sync=0; r=1000000; t=1; vs=800; bs=65536; cs=1048576; of=500000; si=1000000; ./db_bench --benchmarks=overwrite --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=$r --threads=$t --value_size=$vs --block_size=$bs --cache_size=$cs --bloom_bits=10 --cache_numshardbits=4 --open_files=$of --verify_checksum=1 --db=/home/tec/koko/ --sync=$sync --disable_wal=1 --compression_type=zlib --stats_interval=$si --compression_ratio=0.5 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --max_write_buffer_number=$wbn --max_background_compactions=$mbc --level0_file_num_compaction_trigger=$ctrig --level0_slowdown_writes_trigger=$delay --level0_stop_writes_trigger=$stop --num_levels=$levels --delete_obsolete_files_period_micros=$del --min_level_to_compress=$mcz --max_grandparent_overlap_factor=$overlap --stats_per_interval=1 --max_bytes_for_level_base=$bpl --use_existing_db=1

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34881
2015-03-11 17:57:16 -07:00
sdong 2884b100ba db_bench: Better way to randomize repeated read keys in -read_random_exp_range
Summary: Use a better way to map from a key with locality to a random location. Now with the same -read_random_exp_range setting, hit rate drops, which it is expected.

Test Plan: ./db_bench --benchmarks=readrandom -statistics -use_existing_db -cache_size=5000000 --read_random_exp_range=<multiple_values>

Reviewers: MarkCallaghan, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D34761
2015-03-11 11:46:14 -07:00
Yueh-Hsuan Chiang 89597bb66b Allow GetThreadList() to report the start time of the current operation.
Summary: Allow GetThreadList() to report the start time of the current operation.

Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 \
  --max_background_compactions=10 --max_background_flushes=3 \
  --thread_status_per_interval=1000 --key_size=16 --value_size=1000 \
  --num_column_families=10

Sample output:
          ThreadID ThreadType                    cfName    Operation        OP_StartTime         State
   140338840797248   High Pri column_family_name_000003        Flush 2015/03/09-17:49:59
   140338844991552   High Pri column_family_name_000004        Flush 2015/03/09-17:49:59
   140338849185856    Low Pri
   140338983403584    Low Pri
   140339008569408    Low Pri
   140338861768768    Low Pri
   140338924683328    Low Pri
   140338899517504    Low Pri
   140338853380160    Low Pri
   140338882740288    Low Pri
   140338865963072   High Pri column_family_name_000006        Flush 2015/03/09-17:49:59
   140338954043456    Low Pri
   140338857574464    Low Pri

Reviewers: igor, rven, sdong

Reviewed By: sdong

Subscribers: lgalanis, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34689
2015-03-10 14:51:28 -07:00
sdong 37921b4997 db_bench: Add Option -read_random_exp_range to allow read skewness.
Summary: Introduce parameter -read_random_exp_range in db_bench to provide some key skewness in readrandom and multireadrandom benchmarks. It will helpful to cover block cache better.

Test Plan:
Run benchmarks with this new parameter. I can clearly see block cache hit rate change while I increase this value (DB size is about 66MB):

./db_bench --benchmarks=readrandom -statistics -use_existing_db -cache_size=5000000 --read_random_exp_range=0.0
rocksdb.block.cache.data.miss COUNT : 958418
rocksdb.block.cache.data.hit COUNT : 41582

./db_bench --benchmarks=readrandom -statistics -use_existing_db -cache_size=5000000 --read_random_exp_range=5.0
rocksdb.block.cache.data.miss COUNT : 819518
rocksdb.block.cache.data.hit COUNT : 180482

./db_bench --benchmarks=readrandom -statistics -use_existing_db -cache_size=5000000 --read_random_exp_range=10.0
rocksdb.block.cache.data.miss COUNT : 450479
rocksdb.block.cache.data.hit COUNT : 549521

./db_bench --benchmarks=readrandom -statistics -use_existing_db -cache_size=5000000 --read_random_exp_range=20.0
rocksdb.block.cache.data.miss COUNT : 223192
rocksdb.block.cache.data.hit COUNT : 776808

Reviewers: MarkCallaghan, kradhakrishnan, yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D34629
2015-03-09 11:34:52 -07:00
Yueh-Hsuan Chiang dc4532c497 Add --thread_status_per_interval to db_bench
Summary:
Add --thread_status_per_interval to db_bench, which allows
db_bench to optionally enable print the current thread status
periodically.

Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 --max_background_compactions=10 --max_background_flushes=3 --thread_status_per_interval=1000 --key_size=16 --value_size=1000 --num_column_families=10

Sample output:
          ThreadID ThreadType                         dbName                     cfName       Operation           State
   140281571770432    Low Pri
   140281575964736   High Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000001           Flush
   140281710182464    Low Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000008      Compaction
   140281638879296    Low Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000007      Compaction
   140281592741952    Low Pri
   140281580159040   High Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000002           Flush
   140281676628032    Low Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000006      Compaction
   140281584353344    Low Pri
   140281622102080    Low Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000009      Compaction
   140281605324864    Low Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000004      Compaction
   140281601130560   High Pri  /tmp/rocksdbtest-5297/dbbench                    default           Flush
   140281596936256    Low Pri
   140281588547648    Low Pri

Reviewers: igor, rven, sdong

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34515
2015-03-06 11:22:06 -08:00
Igor Canadi db03739340 options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.

In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.

Test Plan: New unit tests and pass tests suites including valgrind.

Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo

Reviewed By: ikabiljo

Subscribers: yoshinorim, ikabiljo, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31437
2015-03-02 22:40:41 -08:00
Igor Sugak 62247ffa3b rocksdb: Add missing override
Summary:
When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes.

Prerequisites: bear and clang 3.5 build with extra tools

```lang=bash
% USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html
% clang-modernize -p . -include . -add-override
% make format
```

Test Plan:
Make sure all tests are passing.
```lang=bash
% #Use default fb code clang.
% make check
```
Verify less error and no missing override errors.
```lang=bash
% # Have trunk clang present in path.
% ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make
```

Reviewers: igor, kradhakrishnan, rven, meyering, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34077
2015-02-26 11:28:41 -08:00
Mark Callaghan 182b4ceacd Limit key range to number of keys, not number of writes
Summary:
An old commit (482401) changed DoWrite to use the value of --writes rather
than --num to determine the range for keys. This restores the old and correct
behavior which is to limit it using --num.

Task ID: #6353043

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34065
2015-02-25 15:53:45 -08:00
Igor Sugak 73711f956c rocksdb: Fix scan-build bug 'Memory leak' in db/db_bench.cc
Summary:
The bug is detected by scan-build.

In `void WriteSeqSeekSeq(ThreadState* thread)` memory is allocated in line 3118 `Slice key = AllocateKey();` but `Slice` is not responsible deleting `Slice::data()`.

Added `std::unique_ptr<const char[]>*` parameter to ` AllocateKey()`, so that it requires caller to not forget about Slice::data() management.

scan-build bug report: http://home.fburl.com/~sugak/latest6/report-6e9754.html#EndPath

Test Plan:
Make sure scan-build does not report 'Memory leak' in db/db_bench.cc and all tests are passing.
```lang=bash
% make analyze
% make check
```

Reviewers: lgalanis, igor, meyering, sdong

Reviewed By: meyering, sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33501
2015-02-19 14:27:48 -08:00
Igor Canadi ae82849bc9 Fix build failure 2015-01-21 18:23:12 -08:00
Igor Canadi 423dee8418 Abort db_bench if Get() returns error
Summary:
I saw this when running readrandom benchmark with corrupted database -- benchmark worked!

If a Get() returns corruption we should probably abort.

Test Plan: compiles

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31701
2015-01-21 18:18:15 -08:00
Igor Canadi 9ab5adfc59 New BlockBasedTable version -- better compressed block format
Summary:
This diff adds BlockBasedTable format_version = 2. New format version brings better compressed block format for these compressions:
1) Zlib -- encode decompressed size in compressed block header
2) BZip2 -- encode decompressed size in compressed block header
3) LZ4 and LZ4HC -- instead of doing memcpy of size_t encode size as varint32. memcpy is very bad because the DB is not portable accross big/little endian machines or even platforms where size_t might be 8 or 4 bytes.

It does not affect format for snappy.

If you write a new database with format_version = 2, it will not be readable by RocksDB versions before 3.10. DB::Open() will return corruption in that case.

Test Plan:
Added a new test in db_test.
I will also run db_bench and verify VSIZE when block_cache == 1GB

Reviewers: yhchiang, rven, MarkCallaghan, dhruba, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31461
2015-01-14 16:24:24 -08:00
Igor Canadi 15d2abbec3 Fix build issues 2015-01-09 13:04:06 -08:00
Leonidas Galanis 9d5bd411be benchmark.sh won't run through all tests properly if one specifies wal_dir to be different than db directory.
Summary:
A command line like this to run all the tests:
source benchmark.config.sh && nohup ./benchmark.sh 'bulkload,fillseq,overwrite,filluniquerandom,readrandom,readwhilewriting'
where
benchmark.config.sh is:
export DB_DIR=/data/mysql/rocksdata
export WAL_DIR=/txlogs/rockswal
export OUTPUT_DIR=/root/rocks_benchmarking/output

Will fail for the tests that need a new DB .

Also 1) set disable_data_sync=0 and 2) add debug mode to run through all the tests more quickly

Test Plan: run ./benchmark.sh 'debug,bulkload,fillseq,overwrite,filluniquerandom,readrandom,readwhilewriting' and verify that there are no complaints about WAL dir not being empty.

Reviewers: sdong, yhchiang, rven, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D30909
2015-01-05 15:36:47 -08:00
sdong e9ca358157 Fix CLANG build for db_bench
Summary: CLANG was broken for a recent change in db_ench. Fix it.

Test Plan: Build db_bench using CLANG.

Reviewers: rven, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30801
2014-12-30 18:33:49 -08:00
sdong a801c1fb09 db_bench --num_hot_column_families to be default off
Summary: Having --num_hot_column_families default on fails some existing regression tests. By default turn it off

Test Plan: Run db_bench to make sure it is default off.

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D30705
2014-12-24 09:00:23 -08:00
sdong ddc81440d5 db_bench to add an option as number of hot column families to add to
Summary:
Add option --num_hot_column_families in db_bench. If it is set, write options will first write to that number of column families, and then move on to next set of hot column families. The working set of column families can be smaller than total number of CFs.

It is to test how RocksDB can handle cold column families

Test Plan: Run db_bench with  --num_hot_column_families set and not set.

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30663
2014-12-23 16:52:05 -08:00
Igor Canadi 0acc738810 Speed up FindObsoleteFiles()
Summary:
There are two versions of FindObsoleteFiles():
* full scan, which is executed every 6 hours (and it's terribly slow)
* no full scan, which is executed every time a background process finishes and iterator is deleted

This diff is optimizing the second case (no full scan). Here's what we do before the diff:
* Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live.
* Get the list of live files to avoid deleting files that are live.
* Delete files that are in obsolete_files and not in live_files.

After this diff:
* The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files.
* Get the list of obsolete files (which exclude moved files).
* No need to get the list of live files, since all files in obsolete_files need to be deleted.

I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123

This depends on D30123.

P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff.

Test Plan:
One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version.

make check

Big number of compactions and flushes:

  ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30249
2014-12-22 12:04:45 +01:00
Leonidas Galanis 635c61fd3b Fix problem with create_if_missing option when wal_dir is used
Summary: When wal_dir is used, DestroyDB is not passed the wal_dir option and so we get a Corruption exception.

Test Plan:
Verified manually that the following command line works now:
./db_bench --db=/mnt/db/rocksdb ... --disable_wal=0 --wal_dir=/data/users/rocksdb/WAL... --benchmarks=filluniquerandom --use_existing_db=0...

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29859
2014-12-08 12:53:24 -08:00
Jonah Cohen a14b7873ee Enforce write buffer memory limit across column families
Summary:
Introduces a new class for managing write buffer memory across column
families.  We supplement ColumnFamilyOptions::write_buffer_size with
ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer
instance that enforces memory limits before flushing out to disk.

Test Plan: Added SharedWriteBuffer unit test to db_test.cc

Reviewers: sdong, rven, ljin, igor

Reviewed By: igor

Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim

Differential Revision: https://reviews.facebook.net/D22581
2014-12-02 12:09:20 -08:00
Yueh-Hsuan Chiang 13de000f07 Add rocksdb::ToString() to address cases where std::to_string is not available.
Summary:
In some environment such as android, the c++ library does not have
std::to_string.  This path adds rocksdb::ToString(), which wraps std::to_string
when std::to_string is not available, and implements std::to_string
in the other case.

Test Plan:
make dbg -j32
./db_test
make clean
make dbg OPT=-DOS_ANDROID -j32
./db_test

Reviewers: ljin, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29181
2014-11-24 20:44:49 -08:00
sdong 3a40c427b9 Fix db_bench on CLANG mode
Summary: "build all" breaks in Clang mode with db_bench. Fix it.

Test Plan: USE_CLANG=1 make all

Reviewers: ljin, rven, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D29379
2014-11-21 11:30:22 -08:00
Igor Canadi cd278584c9 Clean up StringSplit
Summary: stringSplit is not how we name our functions. Also, we had two StringSplit's in the codebase

Test Plan: make check

Reviewers: yhchiang, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29361
2014-11-21 11:05:28 -05:00
sdong a177742a9b Make db_stress built for ROCKSDB_LITE
Summary:
Make db_stress built for ROCKSDB_LITE.
The test doesn't pass tough. It seg fault quickly. But I took a look and it doesn't seem to be related to lite version. Likely to be a bug inside RocksDB.

Test Plan: make db_stress

Reviewers: yhchiang, rven, ljin, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D28797
2014-11-14 10:20:51 -08:00
Igor Canadi 767777c2bd Turn on -Wshorten-64-to-32 and fix all the errors
Summary:
We need to turn on -Wshorten-64-to-32 for mobile. See D1671432 (internal phabricator) for details.

This diff turns on the warning flag and fixes all the errors. There were also some interesting errors that I might call bugs, especially in plain table. Going forward, I think it makes sense to have this flag turned on and be very very careful when converting 64-bit to 32-bit variables.

Test Plan: compiles

Reviewers: ljin, rven, yhchiang, sdong

Reviewed By: yhchiang

Subscribers: bobbaldwin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28689
2014-11-11 16:47:22 -05:00
Tomislav Novak 35c8c814e8 Make ForwardIterator::status() more efficient
Summary:
In D19581 I made `ForwardIterator::status()` check all child iterators,
including immutable ones. It's, however, not necessary to do it every
time -- it'll suffice to check only when they're used and their status
could change.

This diff:
* introduces `immutable_status_` which is updated by `Seek()` and `Next()`
* removes special handling of `kIncomplete` status in those methods

Test Plan:
* `db_test`
* hacked ReadSequential in db_bench.cc to check `status()` in addition to
  validity:

```
   $ ./db_bench -use_existing_db -benchmarks readseq -disable_auto_compactions \
      -use_tailing_iterator  # without this patch
   Keys:       16 bytes each
   Values:     100 bytes each (50 bytes after compression)
   Entries:    1000000
   [...]
   DB path: [/dev/shm/rocksdbtest/dbbench]
   readseq      :       0.562 micros/op 1778103 ops/sec;   98.4 MB/s
   $ ./db_bench -use_existing_db -benchmarks readseq -disable_auto_compactions \
      -use_tailing_iterator  # with the patch
   readseq      :       0.433 micros/op 2311363 ops/sec;  127.8 MB/s
```

Reviewers: igor, sdong, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb, march, lovro

Differential Revision: https://reviews.facebook.net/D24063
2014-11-10 15:44:20 -08:00
Shi Feng ea18b944a7 Add db_bench option --report_file_operations
Summary: Add db_bench option --report_file_operations

Test Plan:
./db_bench --report_file_operations
Observe outputs on # of file operations

Reviewers: ljin, MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: yhchiang, rven, igor, dhruba

Differential Revision: https://reviews.facebook.net/D27945
2014-11-05 18:40:18 -08:00
Igor Canadi b680033e63 Include atomic in env_test 2014-10-27 15:41:05 -07:00
Igor Canadi 48842ab316 Deprecate AtomicPointer
Summary: RocksDB already depends on C++11, so we might as well all the goodness that C++11 provides. This means that we don't need AtomicPointer anymore. The less things in port/, the easier it will be to port to other platforms.

Test Plan: make check + careful visual review verifying that NoBarried got memory_order_relaxed, while Acquire/Release methods got memory_order_acquire and memory_order_release

Reviewers: rven, yhchiang, ljin, sdong

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D27543
2014-10-27 14:50:21 -07:00
sdong 2a8e5203d8 db_bench: --batch_size used for write benchmarks too
Summary: Now --bench_size is only used in multireadrandom tests, although the codes allow it to run in all write tests. I don't see a reason why we can't enable it.

Test Plan:
Run
   ./db_bench -benchmarks multirandomwrite --threads=5 -batch_size=16
and see the stats printed out in LOG to make sure batching really happened.

Reviewers: ljin, yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D25509
2014-10-22 18:54:12 -07:00
sdong 5bfb7f5d0b db_bench: seekrandom can specify --seek_nexts to read specific keys after seek.
Summary:
Add a function as tittle.
Also use the same parameter to fillseekseq too.

Test Plan: Run seekrandom using the new parameter

Reviewers: ljin, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: rven, igor, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D25035
2014-10-20 11:55:33 -07:00
sdong b7d3d6ebc5 db_bench: set thread pool size according to max_background_flushes
Summary: option max_background_flushes doesn't make sense if thread pool size is not set accordingly. Set the thread pool size as what we do for max_background_compactions.

Test Plan: Run db_bench with max_background_flushes > 1

Reviewers: yhchiang, igor, rven, ljin

Reviewed By: ljin

Subscribers: MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D24717
2014-10-09 20:38:15 -07:00
Tomislav Novak 88edfd90ae SkipListRep::LookaheadIterator
Summary:
This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an
optimization for the tailing use case which includes many seeks. E.g. consider
the following operations on a skip list iterator:

   Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ...

If `lookahead` is positive, `SkipListRep` will return an iterator which also
keeps track of the previously visited node. Seek() then first does a linear
search starting from that node (up to `lookahead` steps). As in the tailing
example above, this may require fewer than ~log(n) comparisons as with regular
skip list search.

Test Plan:
Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It
first writes N records (with consecutive keys), then measures how much time it
takes to read them by calling `Seek()` and `Next()`.

   $ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \
      -key_size 8 -write_buffer_size $[1024*1024*1024] -value_size 50 \
      -seekseq_next 2 -skip_list_lookahead=0
   [...]
   DB path: [/dev/shm/rocksdbtest/dbbench]
   fillseekseq  :       0.389 micros/op 2569047 ops/sec;

   real    0m21.806s
   user    0m12.106s
   sys     0m9.672s

   $ time ./db_bench [...] -skip_list_lookahead=2
   [...]
   DB path: [/dev/shm/rocksdbtest/dbbench]
   fillseekseq  :       0.153 micros/op 6540684 ops/sec;

   real    0m19.469s
   user    0m10.192s
   sys     0m9.252s

Reviewers: ljin, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb, march, lovro

Differential Revision: https://reviews.facebook.net/D23997
2014-10-07 11:48:23 -07:00
mike@arpaia.co f0f7955497 Fixing comile errors on OS X
Summary: Building master on OS X has some compile errors due to implicit type conversions which generate warnings which RocksDB's build settings raise as errors.

Test Plan: It compiles!

Reviewers: ljin, igor

Reviewed By: ljin

Differential Revision: https://reviews.facebook.net/D24135
2014-09-29 16:05:25 -07:00
Mark Callaghan 747523d241 Print per column family metrics in db_bench
Summary: see above

Test Plan:
make check, ran db_bench and looked at output

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Differential Revision: https://reviews.facebook.net/D24189
2014-09-29 15:47:05 -07:00
Lei Jin fbd2dafc9f CompactedDBImpl::MultiGet() for better CuckooTable performance
Summary:
Add the MultiGet API to allow prefetching.
With file size of 1.5G, I configured it to have 0.9 hash ratio that can
fill With 115M keys and result in 2 hash functions, the lookup QPS is
~4.9M/s  vs. 3M/s for Get().
It is tricky to set the parameters right. Since files size is determined
by power-of-two factor, that means # of keys is fixed in each file. With
big file size (thus smaller # of files), we will have more chance to
waste lot of space in the last file - lower space utilization as a
result. Using smaller file size can improve the situation, but that
harms lookup speed.

Test Plan: db_bench

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23673
2014-09-25 13:34:51 -07:00
Lei Jin 3c68006109 CompactedDBImpl
Summary:
Add a CompactedDBImpl that will enabled when calling OpenForReadOnly()
and the DB only has one level (>0) of files. As a performan comparison,
CuckooTable performs 2.1M/s with CompactedDBImpl vs. 1.78M/s with
ReadOnlyDBImpl.

Test Plan: db_bench

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23553
2014-09-25 11:14:01 -07:00
Lei Jin 57a32f147f change target_file_size_base to uint64_t
Summary: It contrains the file size to be 4G max with int

Test Plan:
tried to grep instance and made sure other related variables are also
uint64

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23697
2014-09-22 11:15:03 -07:00
Lei Jin 51af7c326c CuckooTable: add one option to allow identity function for the first hash function
Summary:
MurmurHash becomes expensive when we do millions Get() a second in one
thread. Add this option to allow the first hash function to use identity
function as hash function. It results in QPS increase from 3.7M/s to
~4.3M/s. I did not observe improvement for end to end RocksDB
performance. This may be caused by other bottlenecks that I will address
in a separate diff.

Test Plan:
```
[ljin@dev1964 rocksdb] ./cuckoo_table_reader_test --enable_perf --file_dir=/dev/shm --write --identity_as_first_hash=0
==== Test CuckooReaderTest.WhenKeyExists
==== Test CuckooReaderTest.WhenKeyExistsWithUint64Comparator
==== Test CuckooReaderTest.CheckIterator
==== Test CuckooReaderTest.CheckIteratorUint64
==== Test CuckooReaderTest.WhenKeyNotFound
==== Test CuckooReaderTest.TestReadPerformance
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.272us (3.7 Mqps) with batch size of 0, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.138us (7.2 Mqps) with batch size of 10, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.142us (7.1 Mqps) with batch size of 25, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.142us (7.0 Mqps) with batch size of 50, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.144us (6.9 Mqps) with batch size of 100, # of found keys 125829120

With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.201us (5.0 Mqps) with batch size of 0, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.121us (8.3 Mqps) with batch size of 10, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.123us (8.1 Mqps) with batch size of 25, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.121us (8.3 Mqps) with batch size of 50, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.112us (8.9 Mqps) with batch size of 100, # of found keys 104857600

With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.251us (4.0 Mqps) with batch size of 0, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.107us (9.4 Mqps) with batch size of 10, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.099us (10.1 Mqps) with batch size of 25, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.100us (10.0 Mqps) with batch size of 50, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.116us (8.6 Mqps) with batch size of 100, # of found keys 83886080

With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.189us (5.3 Mqps) with batch size of 0, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.095us (10.5 Mqps) with batch size of 10, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.096us (10.4 Mqps) with batch size of 25, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.098us (10.2 Mqps) with batch size of 50, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.105us (9.5 Mqps) with batch size of 100, # of found keys 73400320

[ljin@dev1964 rocksdb] ./cuckoo_table_reader_test --enable_perf --file_dir=/dev/shm --write --identity_as_first_hash=1
==== Test CuckooReaderTest.WhenKeyExists
==== Test CuckooReaderTest.WhenKeyExistsWithUint64Comparator
==== Test CuckooReaderTest.CheckIterator
==== Test CuckooReaderTest.CheckIteratorUint64
==== Test CuckooReaderTest.WhenKeyNotFound
==== Test CuckooReaderTest.TestReadPerformance
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.230us (4.3 Mqps) with batch size of 0, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.086us (11.7 Mqps) with batch size of 10, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.088us (11.3 Mqps) with batch size of 25, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.083us (12.1 Mqps) with batch size of 50, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.083us (12.1 Mqps) with batch size of 100, # of found keys 125829120

With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.159us (6.3 Mqps) with batch size of 0, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 10, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.080us (12.6 Mqps) with batch size of 25, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.080us (12.5 Mqps) with batch size of 50, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.082us (12.2 Mqps) with batch size of 100, # of found keys 104857600

With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.154us (6.5 Mqps) with batch size of 0, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.077us (13.0 Mqps) with batch size of 10, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.077us (12.9 Mqps) with batch size of 25, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 50, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.079us (12.6 Mqps) with batch size of 100, # of found keys 83886080

With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.218us (4.6 Mqps) with batch size of 0, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.083us (12.0 Mqps) with batch size of 10, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.085us (11.7 Mqps) with batch size of 25, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.086us (11.6 Mqps) with batch size of 50, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 100, # of found keys 73400320
```

Reviewers: sdong, igor, yhchiang

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23451
2014-09-18 11:00:48 -07:00
Feng Zhu 0af157f9bf Implement full filter for block based table.
Summary:
1. Make filter_block.h a base class. Derive block_based_filter_block and full_filter_block. The previous one is the traditional filter block. The full_filter_block is newly added. It would generate a filter block that contain all the keys in SST file.

2. When querying a key, table would first check if full_filter is available. If not, it would go to the exact data block and check using block_based filter.

3. User could choose to use full_filter or tradional(block_based_filter). They would be stored in SST file with different meta index name. "filter.filter_policy" or "full_filter.filter_policy". Then, Table reader is able to know the fllter block type.

4. Some optimizations have been done for full_filter_block, thus it requires a different interface compared to the original one in filter_policy.h.

5. Actual implementation of filter bits coding/decoding is placed in util/bloom_impl.cc

Benchmark: base commit 1d23b5c470
Command:
db_bench --db=/dev/shm/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --write_buffer_size=134217728 --max_write_buffer_number=2 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --verify_checksum=false --max_background_compactions=4 --use_plain_table=0 --memtablerep=prefix_hash --open_files=-1 --mmap_read=1 --mmap_write=0 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_bits=500000 --compression_type=lz4 --num=393216000 --use_hash_search=1 --block_size=1024 --block_restart_interval=16 --use_existing_db=1 --threads=1 --benchmarks=readrandom —disable_auto_compactions=1
Read QPS increase for about 30% from 2230002 to 2991411.

Test Plan:
make all check
valgrind db_test
db_stress --use_block_based_filter = 0
./auto_sanity_test.sh

Reviewers: igor, yhchiang, ljin, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D20979
2014-09-08 10:37:05 -07:00
Igor Canadi 8de151bb99 Add db_bench with lots of column families to regression tests
Summary:
That way we can see when this graph goes up and be happy.

Couple of changes:
1. title
2. fix db_bench to delete column families before deleting the DB. this was asserting when compiled in debug mode
3. don't sync manifest when disableDataSync. We discussed this offline. I can move it to separate diff if you'd like

Test Plan: ran it

Reviewers: sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22815
2014-09-05 14:20:18 -07:00
liuhuahang bb6ae0f80c fix more compile warnings
N/A

Change-Id: I5b6f9c70aea7d3f3489328834fed323d41106d9f
Signed-off-by: liuhuahang <liuhuahang@zerus.co>
2014-09-05 14:14:37 +08:00
Radheshyam Balasundaram 4142a3e783 Adding a user comparator for comparing Uint64 slices.
Summary:
- New Uint64 comparator
- Modify Reader and Builder to take custom user comparators instead of bytewise comparator
- Modify logic for choosing unused user key in builder
- Modify iterator logic in reader
- test changes

Test Plan:
cuckoo_table_{builder,reader,db}_test
make check all

Reviewers: ljin, sdong

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D22377
2014-08-27 10:39:31 -07:00
Lei Jin 384400128f move block based table related options BlockBasedTableOptions
Summary:
I will move compression related options in a separate diff since this
diff is already pretty lengthy.
I guess I will also need to change JNI accordingly :(

Test Plan: make all check

Reviewers: yhchiang, igor, sdong

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21915
2014-08-25 14:22:05 -07:00
Radheshyam Balasundaram 162b8151f1 Adding Column Family support in db_bench.
Summary:
Adding num_column_families flag. Adding support for column families in DoWrite and ReadRandom methods.
[Igor, please let me know if this approach sounds good. I shall add it to other methods too.]

Test Plan: Ran fillseq on 1M keys and 10 Column families and ran readrandom.

Reviewers: sdong, yhchiang, igor, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21387
2014-08-18 18:15:01 -07:00
Radheshyam Balasundaram 36e759d199 Adding Cuckoo Table SST option to db_bench
Summary: Adding flags to use cuckoo table SST in db_bench.cc

Test Plan: Ran benchmark with fillseq and readrandom

Reviewers: sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21729
2014-08-18 11:59:38 -07:00
Lei Jin 58c49466d2 Allow env_posix to lower background thread IO priority
Summary: This is a linux-specific system call.

Test Plan: ran db_bench

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: haobo, leveldb

Differential Revision: https://reviews.facebook.net/D21183
2014-08-13 20:49:58 -07:00
Feng Zhu d3f2ec694f check prefix_size when using hash search in db_bench
Summary:
1. Check prefix_size when enable use_hash_search in db_bench
2. Remove include/statistics.h in db_bench

Test Plan: ./db_bench --use_hash_search=1

Reviewers: ljin, yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21375
2014-08-11 10:47:52 -07:00
Feng Zhu 50c2dcb78f add options.block_restart_interval in db_bench
Summary:
  Add block_restart_interval in db_bench, default value 16

Test Plan:
  make

Reviewers: sdong

Reviewed By: sdong

Differential Revision: https://reviews.facebook.net/D20331
2014-07-21 12:01:40 -07:00
Stanislau Hlebik 92d73cbe78 Add PlainTableOptions
Summary:
Since we have a lot of options for PlainTable, add a struct PlainTableOptions
to manage them

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20175
2014-07-18 00:08:38 -07:00
Feng Zhu bc6b2ab401 enable kHashSearch for blocktable in db_bench
Summary:
  add a flag called use_hash_search in db_bench

Test Plan:
  make all check
  ./db_bench --use_hash_search=1

Reviewers: ljin, haobo, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, dhruba

Differential Revision: https://reviews.facebook.net/D20067
2014-07-16 17:32:30 -07:00
Radheshyam Balasundaram f0660d5253 Adding NUMA support to db_bench tests
Summary:
Changes:
- Adding numa_aware flag to db_bench.cc
- Using numa.h library to bind memory and cpu of threads to a fixed NUMA node
Result: There seems to be no significant change in the micros/op time with numa_aware enabled. I also tried this with other implementations, including a combination of pthread_setaffinity_np, sched_setaffinity and set_mempolicy methods. It'd be great if someone could point out where I'm going wrong and if we can achieve a better micors/op.

Test Plan:
Ran db_bench tests using following command:
./db_bench --db=/mnt/tmp --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=/mnt/tmp --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --duration=300 --benchmarks=readwhilewriting --use_existing_db=1 --num=157286400 --threads=24 --writes_per_second=10240 --numa_aware=[False/True]

The tests were run in private devserver with 24 cores and the db was prepopulated using filluniquerandom test. The tests resulted in 0.145 us/op with numa_aware=False and 0.161 us/op with numa_aware=True.

Reviewers: sdong, yhchiang, ljin, igor

Reviewed By: ljin, igor

Subscribers: igor, leveldb

Differential Revision: https://reviews.facebook.net/D19353
2014-07-07 10:53:31 -07:00
Yueh-Hsuan Chiang faa8d21922 Improve an assertion in RandomGenerator::Generate() in db_bench.
Summary:
RandomGenerator::Generate() currently has an assertion len < data_.size().
However, it is actually fine to have len == data_.size().
This diff change the assertion to len <= data_.size().

Test Plan:
make db_bench
./db_bench

Reviewers: haobo, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19269
2014-06-24 15:29:28 -06:00
Lei Jin 3b0dc76699 db_bench: measure the real latency of write/delete
Summary: as title

Test Plan: make release

Reviewers: haobo, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19227
2014-06-23 13:23:02 -07:00
Lei Jin a1b5650a75 db_bench: sanity check on compression ratio
Summary: as requested by mark

Test Plan: make release

Reviewers: sdong, haobo

Reviewed By: haobo

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19221
2014-06-23 10:46:16 -07:00
Igor Canadi d4a8423334 Remove seek compaction
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.

This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.

There is one test case that relied on didIO information. I'll try to find another way to implement it.

Test Plan: make check

Reviewers: sdong, haobo, yhchiang, ljin, dhruba

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19161
2014-06-20 10:23:02 +02:00
Lei Jin 388d2054c7 forward iterator
Summary:
Forward iterator puts everything together in a flat structure instead of
a hierarchy of nested iterators. this should simplify the code and
provide better performance. It also enables more optimization since all
information are accessiable in one place.
Init evaluation shows about 6% improvement

Test Plan: db_test and db_bench

Reviewers: dhruba, igor, tnovak, sdong, haobo

Reviewed By: haobo

Subscribers: sdong, leveldb

Differential Revision: https://reviews.facebook.net/D18795
2014-05-30 14:31:55 -07:00
Lei Jin f29c62fc6f add an iterator refresh option for SeekRandom
Summary: One more option to allow iterator refreshing when using normal iterator

Test Plan: ran db_bench

Reviewers: haobo, sdong, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18849
2014-05-30 14:09:22 -07:00
sdong 4e0602f941 Remove maximum key_size check in db_bench
Summary: Key size limit doesn't seem to be applicable anymore. Remove it.

Test Plan: run a couple of tests in db_bench

Reviewers: haobo, igor, yhchiang, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18723
2014-05-15 11:06:37 -07:00
Igor Canadi a1068c91a1 Make RocksDB work with newer gflags
Summary:
Newer gflags switched from `google` namespace to `gflags` namespace. See: https://github.com/facebook/rocksdb/issues/139 and https://github.com/facebook/rocksdb/issues/102

Unfortunately, they don't define any macro with their namespace, so we need to actually try to compile gflags with two different namespace to figure out which one is the correct one.

Test Plan: works in fbcode environemnt. I'll also try in ubutnu with newer gflags

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18537
2014-05-08 17:25:13 -07:00
Igor Canadi 0afc8bc29a xxHash
Summary:
Originally: https://github.com/facebook/rocksdb/pull/87/files

I'm taking over to apply some finishing touches

Test Plan: will add tests

Reviewers: dhruba, haobo, sdong, yhchiang, ljin

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18315
2014-05-01 14:09:32 -04:00
Yueh-Hsuan Chiang 9d9d2965cb Add a new mem-table representation based on cuckoo hash.
Summary:
= Major Changes =
* Add a new mem-table representation, HashCuckooRep, which is based cuckoo hash.
  Cuckoo hash uses multiple hash functions.  This allows each key to have multiple
  possible locations in the mem-table.

  - Put: When insert a key, it will try to find whether one of its possible
    locations is vacant and store the key.  If none of its possible
    locations are available, then it will kick out a victim key and
    store at that location.  The kicked-out victim key will then be
    stored at a vacant space of its possible locations or kick-out
    another victim.  In this diff, the kick-out path (known as
    cuckoo-path) is found using BFS, which guarantees to be the shortest.

 - Get: Simply tries all possible locations of a key --- this guarantees
   worst-case constant time complexity.

 - Time complexity: O(1) for Get, and average O(1) for Put if the
   fullness of the mem-table is below 80%.

 - Default using two hash functions, the number of hash functions used
   by the cuckoo-hash may dynamically increase if it fails to find a
   short-enough kick-out path.

 - Currently, HashCuckooRep does not support iteration and snapshots,
   as our current main purpose of this is to optimize point access.

= Minor Changes =
* Add IsSnapshotSupported() to DB to indicate whether the current DB
  supports snapshots.  If it returns false, then DB::GetSnapshot() will
  always return nullptr.

Test Plan:
Run existing tests.  Will develop a test specifically for cuckoo hash in
the next diff.

Reviewers: sdong, haobo

Reviewed By: sdong

CC: leveldb, dhruba, igor

Differential Revision: https://reviews.facebook.net/D16155
2014-04-29 17:13:46 -07:00
Igor Canadi 38693d99c4 Fix more signed/unsigned comparsions 2014-04-29 12:40:18 -07:00
Lei Jin 3995e801ab kill ReadOptions.prefix and .prefix_seek
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D17805
2014-04-25 12:21:34 -07:00
Igor Canadi f9f8965e96 Print out stack trace in mac, too
Summary: While debugging Mac-only issue with ThreadLocalPtr, this was very useful. Let's print out stack trace in MAC OS, too.

Test Plan: Verified that somewhat useful stack trace was generated on mac. Will run PrintStack() on linux, too.

Reviewers: ljin, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18189
2014-04-23 09:11:35 -04:00
Igor Canadi 8dc34364d2 Rename "benchmark" back to "bench".
Also, make `benchharness.cc` not compiled into rocksdb library.
2014-04-21 13:12:15 -07:00
Pratyush Seth ff1b5df4c6 Added benchmark functionality on the lines of folly/Benchmark.h
Summary: Added benchmark functionality on the lines of folly/Benchmark.h

Test Plan: Added unit tests

Reviewers: igor, haobo, sdong, ljin, yhchiang, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17973
2014-04-21 12:29:55 -07:00
sdong c87ed0942c Fix db_bench's multireadrandom
Summary: multireadrandom is broken. Fix it

Test Plan: run it and see segfault has gone.

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17781
2014-04-14 15:43:34 -07:00
sdong d5e087b6df db_bench: add a mode to operate multiple DBs
Summary: This patch introduces a new parameter num_multi_db in db_bench. When this parameter is larger than 1, multiple DBs will be created. In all benchmarks, any operation applies to a random DB among them. This is to benchmark the performance of similar applications.

Test Plan: run db_bench on both of num_multi_db=0 and more.

Reviewers: haobo, ljin, igor

Reviewed By: igor

CC: igor, yhchiang, dhruba, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17769
2014-04-11 16:59:08 -07:00
Lei Jin 0af36d6aa6 SeekRandomWhileWriting
Summary: as title

Test Plan: ran it

Reviewers: igor, haobo, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17751
2014-04-11 09:47:20 -07:00
Lei Jin 7a92537fc4 db_bench: add IteratorCreationWhileWriting mode and allow prefix_seek
Summary: as title

Test Plan: ran it

Reviewers: igor, haobo, yhchiang

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17655
2014-04-10 10:15:59 -07:00
Igor Canadi 4daea66343 Turn on -Wmissing-prototypes
Summary: Compiling for iOS has by default turned on -Wmissing-prototypes, which causes rocksdb to fail compiling. This diff turns on -Wmissing-prototypes in our compile options and cleans up all functions with missing prototypes.

Test Plan: compiles

Reviewers: dhruba, haobo, ljin, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17649
2014-04-09 21:17:14 -07:00
Lei Jin 4824014e3b speed up db_bench filluniquerandom mode
Summary:
filluniquerandom is painfully slow due to the naive bitmap check to find
out if a key has been seen before. Majority of time is spent on searching
the last few keys. Split a giant BitSet to smaller ones so that we can
quickly check if a BitSet is full and thus can skip quickly.

It used to take over one hour to filluniquerandom for 100M keys, now it
takes about 10 mins.

Test Plan:
unit test
also verified correctness in db_bench and make sure all keys are
generated

Reviewers: igor, haobo, yhchiang

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D17607
2014-04-09 11:25:21 -07:00
Igor Canadi 34455deb06 Fix Mac OS compile issues 2014-04-08 14:05:53 -07:00
Lei Jin 0c1126d4cf db_bench cleanup
Summary:
clean up the db_bench a little bit. also avoid allocating memory for key
in the loop

Test Plan:
I verified a run with filluniquerandom & readrandom. Iterator seek will be used lot
to measure performance. Will fix whatever comes up

Reviewers: haobo, igor, yhchiang

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17559
2014-04-08 11:21:09 -07:00
Lei Jin c90d446ee7 make hash_link_list Node's key space consecutively followed at the end
Summary: per sdong's request, this will help processor prefetch on n->key case.

Test Plan: make all check

Reviewers: sdong, haobo, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17415
2014-04-04 15:37:28 -07:00
Lei Jin 0d755fff14 cache friendly blocked bloomfilter
Summary:
By constraining the probes within cache line(s), we can improve the
cache miss rate thus performance. This probably only makes sense for
in-memory workload so defaults the option to off.

Numbers and comparision can be found in wiki:
https://our.intern.facebook.com/intern/wiki/index.php/Ljin/rocksdb_perf/2014_03_17#Bloom_Filter_Study

Test Plan: benchmarked this change substantially. Will run make all check as well

Reviewers: haobo, igor, dhruba, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17133
2014-03-28 09:21:20 -07:00
Lei Jin 8d007b4aaf Consolidate SliceTransform object ownership
Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.

Test Plan: db_bench && make asan_check

Reviewers: haobo, igor, sdong

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16587
2014-03-10 12:56:46 -07:00
Lei Jin 04298f8c33 output perf_context in db_bench readrandom
Summary:
Add helper function to print perf context data in db_bench if enabled.
I didn't find any code that actually exports perf context data. Not sure
if I missed anything

Test Plan: ran db_bench

Reviewers: haobo, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16575
2014-03-05 10:32:54 -08:00
Lei Jin 64138b5d9c fix db_bench to use HashSkipList for real
Summary:
For HashSkipList case, DBImpl has sanity check to see if prefix_extractor in
options is the same as the one in memtable factory. If not, it falls
back to SkipList. As result, I was experimenting with SkipList
performance. No wonder it is much worse than LinkedList

Test Plan: ran benchmark

Reviewers: haobo, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16569
2014-03-05 10:28:53 -08:00
Lei Jin 51560ba755 config max_background_flushes in db_bench
Summary: as title

Test Plan: make release

Reviewers: haobo, sdong, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16437
2014-03-05 10:27:17 -08:00
Lei Jin a5b1d2f146 make key evenly distributed between 0 and FLAGS_num
Summary:
The issue is that when FLAGS_num is small, the leading bytes of the key
are padded with 0s. This makes all keys have the same prefix 00000000

Most of the changes are just to make lint happy

Test Plan: ran db_bench

Reviewers: sdong, haobo, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16317
2014-03-04 17:08:05 -08:00
Lei Jin dea894ef8d expose wal_dir in db_bench
Summary: as title

Test Plan: ran db_bench

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16269
2014-02-25 10:43:46 -08:00
Lei Jin 28b7f7faa8 enable plain table in db_bench
Summary: as title

Test Plan: ran db_bench to gather stats

Reviewers: haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16059
2014-02-12 10:41:55 -08:00
Albert Strasheim df2f92214a Support for LZ4 compression. 2014-02-08 14:15:51 -08:00
Igor Canadi 1560bb913e Readrandom with tailing iterator
Summary:
Added an option for readrandom benchmark to run with tailing iterator instead of Get. Benefit of tailing iterator is that it doesn't require locking DB mutex on access.

I also have some results when running on my machine. The results highly depend on number of cache shards. With our current benchmark setting of 4 table cache shards and 6 block cache shards, I don't see much improvements of using tailing iterator. In that case, we're probably seeing cache mutex contention.

Here are the results for different number of shards

    cache shards       tailing iterator        get
       6                      1.38M           1.16M
      10                      1.58M           1.15M

As soon as we get rid of cache mutex contention, we're seeing big improvements in using tailing iterator vs. ordinary get.

Test Plan: ran regression test

Reviewers: dhruba, haobo, ljin, kailiu, sding

Reviewed By: haobo

CC: tnovak

Differential Revision: https://reviews.facebook.net/D15867
2014-02-07 09:47:47 -08:00
kailiu 84f8185fc0 Merge branch 'master' into performance
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/memtable.cc
2014-02-05 21:21:00 -08:00