Commit graph

504 commits

Author SHA1 Message Date
sdong 0abaed2e08 Support multiple DB directories in universal compaction style
Summary:
This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file.
Level-style stays the same.

Test Plan: Add new unit tests

Reviewers: ljin, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D19869
2014-07-15 12:06:28 -07:00
Igor Canadi 20c056306b Remove stats logger
Summary: Browsing through the code, looks like StatsLogger is not used at all!

Test Plan: compiles

Reviewers: ljin, sdong, yhchiang, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19827
2014-07-15 09:16:32 -04:00
sdong dd6c444822 Improve Put()'s comment to indicate that the key is overwritten if existing
Summary: As title

Test Plan: Not needed for comment only.

Reviewers: yhchiang, ljin, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: xjin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19887
2014-07-14 14:43:59 -07:00
Reed Allman 1fc71a4b16 C API: create missing cf's, cleanup 2014-07-10 12:55:53 -07:00
sdong 01700b6911 Update master to version 3.3
Summary: As tittle

Test Plan: no need

Reviewers: igor, yhchiang, ljin

Reviewed By: ljin

Subscribers: haobo, dhruba, xjin, leveldb

Differential Revision: https://reviews.facebook.net/D19629
2014-07-10 11:59:35 -07:00
sdong 36de0e5359 Add a function to return current perf level
Summary: Add a function to return the perf level. It is to allow a wrapper of DB to increase the perf level and restore the original perf level after finishing the function call.

Test Plan: Add a verification in db_test

Reviewers: yhchiang, igor, ljin

Reviewed By: ljin

Subscribers: xjin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19551
2014-07-10 11:35:48 -07:00
Stanislau Hlebik 30c81e7717 Removing NewTotalOrderPlainTableFactory
Summary:
Seems like NewTotalOrderPlainTableFactory is useless and is semantically incorrect.
Total order mode indicator is prefix_extractor == nullptr,
but NewTotalOrderPlainTableFactory doesn't set it to be nullptr. That's why some tests
in plain_table_db_tests is incorrect.

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19587
2014-07-10 11:32:04 -07:00
Igor Canadi f0a8be253e JSON (Document) API sketch
Summary:
This is a rough sketch of our new document API. Would like to get some thoughts and comments about the high-level architecture and API.

I didn't optimize for performance at all. Leaving some low-hanging fruit so that we can be happy when we fix them! :)

Currently, bunch of features are not supported at all. Indexes can be only specified when creating database. There is no query planner whatsoever. This will all be added in due time.

Test Plan: Added a simple unit test

Reviewers: haobo, yhchiang, dhruba, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18747
2014-07-10 09:31:42 -07:00
Lei Jin 534357ca3a integrate rate limiter into rocksdb
Summary:
Add option and plugin rate limiter for PosixWritableFile. The rate
limiter only applies to flush and compaction. WAL and MANIFEST are
excluded from this enforcement.

Test Plan: db_test

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19425
2014-07-08 12:31:49 -07:00
Lei Jin 5ef1ba7ff5 generic rate limiter
Summary:
A generic rate limiter that can be shared by threads and rocksdb
instances. Will use this to smooth out write traffic generated by
compaction and flush. This will help us get better p99 behavior on flash
storage.

Test Plan:
unit test output
==== Test RateLimiterTest.Rate
request size [1 - 1023], limit 10 KB/sec, actual rate: 10.374969 KB/sec, elapsed 2002265
request size [1 - 2047], limit 20 KB/sec, actual rate: 20.771242 KB/sec, elapsed 2002139
request size [1 - 4095], limit 40 KB/sec, actual rate: 41.285299 KB/sec, elapsed 2202424
request size [1 - 8191], limit 80 KB/sec, actual rate: 81.371605 KB/sec, elapsed 2402558
request size [1 - 16383], limit 160 KB/sec, actual rate: 162.541268 KB/sec, elapsed 3303500

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19359
2014-07-08 11:41:57 -07:00
Reed Allman fd3fb4b0bf C API: update options w/ convenience funcs & fifo compaction 2014-07-08 10:57:45 -07:00
sdong 01159aa802 stackable_db: add default function for GetLiveFilesMetaData()
Summary: stackable_db doesn't have GetLiveFilesMetaData() implemented. Add it

Test Plan: make all check

Reviewers: yhchiang, ljin, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19491
2014-07-07 17:02:57 -07:00
Igor Canadi 8a03935f8c Fix valgrind error in c_test
Summary:
External contribution caused some valgrind errors: 1a34aaaef0

This diff fixes them

Test Plan: ran valgrind

Reviewers: sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19485
2014-07-07 14:41:54 -07:00
Evan Shaw 3f7104d7c5 C API: Allow setting compaction filter factory 2014-07-08 08:12:36 +12:00
Evan Shaw 91bede79cc C API: Add support for compaction filter factories (v1) 2014-07-08 08:12:36 +12:00
Evan Shaw 0bf5589c74 Fix compaction_filter.h typos 2014-07-08 08:11:01 +12:00
Igor Canadi 0bc5fa9f40 Merge pull request #190 from edsrzf/c-api-writebatch-serialized
C API: support constructing write batch from serialized representation
2014-07-07 10:17:43 -07:00
Reed Allman 1a34aaaef0 C API: column family support 2014-07-07 01:41:01 -07:00
Evan Shaw 9fc23d0c56 C API: support constructing write batch from serialized representation 2014-07-06 10:36:33 +12:00
Yueh-Hsuan Chiang 90a6aca48e Finer report I/O stats about Flush and Compaction.
Summary:
This diff allows the I/O stats about Flush and Compaction to be reported
in a more accurate way.  Instead of measuring the size of a file, it
measure I/O cost in per read / write basis.

Test Plan: make all check

Reviewers: sdong, igor, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19383
2014-07-03 16:28:03 -07:00
Yueh-Hsuan Chiang d4d338de33 Add timeout_hint_us to WriteOptions and introduce Status::TimeOut.
Summary:
This diff adds timeout_hint_us to WriteOptions.  If it's non-zero, then
1) writes associated with this options MAY be aborted when it has been
  waiting for longer than the specified time.  If an abortion happens,
  associated writes will return Status::TimeOut.
2) the stall time of the associated write caused by flush or compaction
  will be limited by timeout_hint_us.

The default value of timeout_hint_us is 0 (i.e., OFF.)

The statistics of timeout writes will be recorded in WRITE_TIMEDOUT.

Test Plan:
export ROCKSDB_TESTS=WriteTimeoutAndDelayTest
make db_test
./db_test

Reviewers: igor, ljin, haobo, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18837
2014-07-03 15:47:02 -07:00
sdong 2459f7ec4e Support Multiple DB paths (without having an interface to expose to users)
Summary:
In this patch, we allow RocksDB to support multiple DB paths internally.
No user interface is supported yet so this patch is silent to users.

Test Plan: make all check

Reviewers: igor, haobo, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18921
2014-07-02 21:14:44 -07:00
sdong 9c332aa11a HashLinkList memtable switches a bucket to a skip list to reduce performance outliers
Summary:
In this patch, we enhance HashLinkList memtable to reduce performance outliers when a bucket contains too many entries. We switch to skip list for this case to enable binary search.

Add threshold_use_skiplist parameter to determine when a bucket needs to switch to skip list.

The new data structure is documented in comments in the codes.

Test Plan:
make all check
set threshold_use_skiplist in several tests

Reviewers: yhchiang, haobo, ljin

Reviewed By: yhchiang, ljin

Subscribers: nkg-, xjin, dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D19299
2014-07-01 17:14:15 -07:00
sdong 19de6a7aad Remove MemTableRep::GetIterator(const Slice& slice)
Summary: It seems to me that when ever function MemTableRep::GetIterator(const Slice& slice) is used, we can use MemTableRep::GetDynamicPrefixIterator() instead. Just delete it to simplify the codes.

Test Plan: make all check

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: xjin, dhruba, haobo, leveldb

Differential Revision: https://reviews.facebook.net/D19281
2014-06-25 14:09:29 -07:00
Haobo Xu dfb31d152d [RocksDB] allow LDB tool to have customized key formatter
Summary: Currently ldb tool dump keys either in ascii format or hex format - neither is ideal if the key has a binary structure and is not readable in ascii. This diff also allows LDB tool to be customized in ways beyond DB options.

Test Plan: verify that key formatter works with some simple db with binary key.

Reviewers: sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19209
2014-06-23 15:35:40 -07:00
Igor Canadi 00b26c3a83 JSONDocument
Summary:
After evaluating options for JSON storage, I decided to implement our own. The reason is that we'll be able to optimize it better and we get to reduce unnecessary dependencies (which is what we'd get with folly).

I also plan to write a serializer/deserializer for JSONDocument with our own binary format similar to BSON. That way we'll store binary JSON format in RocksDB instead of the plain-text JSON. This means less storage and faster deserialization.

There are still some inefficiencies left here. I plan to optimize them after we develop a functioning DocumentDB. That way we can move and iterate faster.

Test Plan: added a unit test

Reviewers: dhruba, haobo, sdong, ljin, yhchiang

Reviewed By: haobo

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18831
2014-06-20 11:14:14 +02:00
Igor Canadi d4a8423334 Remove seek compaction
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.

This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.

There is one test case that relied on didIO information. I'll try to find another way to implement it.

Test Plan: make check

Reviewers: sdong, haobo, yhchiang, ljin, dhruba

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19161
2014-06-20 10:23:02 +02:00
Igor Canadi bae495740d Merge pull request #179 from edsrzf/c-api-compaction-filter
Support for compaction filters in the C API
2014-06-19 21:22:46 +02:00
Evan Shaw d72313a7fa Add a way to set compaction filter in the C API 2014-06-19 16:31:24 +12:00
Evan Shaw df2701373d Support for compaction filters in the C API 2014-06-19 16:31:17 +12:00
sdong edd47c5104 PlainTable to encode to avoid to rewrite prefix when it is the same as the previous key
Summary:
Add a encoding feature of PlainTable to encode PlainTable's keys to save some bytes for the same prefixes.
The data format is documented in table/plain_table_factory.h

Test Plan: Add unit test coverage in plain_table_db_test

Reviewers: yhchiang, igor, dhruba, ljin, haobo

Reviewed By: haobo

Subscribers: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D18735
2014-06-18 20:41:52 -07:00
Haobo Xu 0f0076ed5a [RocksDB] Reduce memory footprint of the blockbased table hash index.
Summary:
Currently, the in-memory hash index of blockbased table uses a precise hash map to track the prefix to block range mapping. In some use cases, especially when prefix itself is big, the memory overhead becomes a problem. This diff introduces a fixed hash bucket array that does not store the prefix and allows prefix collision, which is similar to the plaintable hash index, in order to reduce the memory consumption.
Just a quick draft, still testing and refining.

Test Plan: unit test and shadow testing

Reviewers: dhruba, kailiu, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19047
2014-06-18 18:16:07 -07:00
Igor Canadi 3525aac9e5 Change order of parameters in adaptive table factory
Summary:
This is minor, but if we put the writing talbe factory as the third parameter, when we add a new table format, we'll have a situation:
1) block based factory
2) plain table factory
3) output factory
4) new format factory

I think it makes more sense to have output as the first parameter.

Also, fixed a NewAdaptiveTableFactory() call in unit test

Test Plan: unit test

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19119
2014-06-18 07:04:37 +02:00
sdong 8c265c08f1 HashLinkList to log distribution of number of entries aross buckets
Summary: Add two parameters of hash linked list to log distribution of number of entries across all buckets, and a sample row when there are too many entries in one single bucket.

Test Plan: Turn it on in plain_table_db_test and see the logs.

Reviewers: haobo, ljin

Reviewed By: ljin

Subscribers: leveldb, nkg-, dhruba, yhchiang

Differential Revision: https://reviews.facebook.net/D19095
2014-06-17 17:55:36 -07:00
sdong 200e4b4a72 Add a table factory that can read DB with both of PlainTable and BlockBasedTable in it
Summary: The new table factory is used if users want to convert a DB from one table format to the other. A user can use this table to open a DB written using one table format and write new files to another table format.

Test Plan: add a unit test

Reviewers: haobo, igor

Reviewed By: igor

Subscribers: dhruba, ljin, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D19017
2014-06-17 11:49:22 -07:00
Yueh-Hsuan Chiang e6e259b8ab Include max_write_buffer_number >= 2 to SanitizeOptions.
Summary: Include max_write_buffer_number >= 2 to SanitizeOptions.

Test Plan: make all check

Reviewers: haobo, sdong, igor, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19077
2014-06-16 16:26:46 -07:00
Igor Canadi f43c8262c2 Don't compress block bigger than 2GB
Summary: This is a temporary solution to a issue that we have with compression libraries. See task #4453446.

Test Plan: make check doesn't complain :)

Reviewers: haobo, ljin, yhchiang, dhruba, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18975
2014-06-09 12:26:09 -07:00
Igor Canadi a0191c9dfe Create Missing Column Families
Summary: Provide an convenience option to create column families if they are missing from the DB. Task #4460490

Test Plan: added unit test. also, stress test for some time

Reviewers: sdong, haobo, dhruba, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D18951
2014-06-06 18:04:56 -07:00
sdong df9069d23f In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.

Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: leveldb, dhruba, yhchiang, igor

Differential Revision: https://reviews.facebook.net/D18513
2014-06-02 17:44:57 -07:00
sdong 462796697c dynamic_bloom: replace some divide (remainder) operations with shifts in locality mode, and other improvements
Summary:
This patch changes meaning of options.bloom_locality: 0 means disable cache line optimization and any positive number means use CACHE_LINE_SIZE as block size (the previous behavior is the block size will be CACHE_LINE_SIZE*options.bloom_locality). By doing it, the divide operations inside a block can be replaced by a shift.

Performance is improved:
https://reviews.facebook.net/P471

Also, improve the basic algorithm in two ways:
(1) make sure num of blocks is an odd number
(2) rotate bytes after every probe in locality mode. Since the divider is 2^n, unless doing it, we are never able to use all the bits.
Improvements of false positive: https://reviews.facebook.net/P459

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: dhruba, yhchiang, igor, leveldb

Differential Revision: https://reviews.facebook.net/D18843
2014-06-02 17:36:38 -07:00
Igor Canadi f068d2a94d Move master version to 3.2 2014-05-23 10:27:56 -07:00
Igor Canadi 6de6a06631 FIFO compaction style
Summary:
Introducing new compaction style -- FIFO.

FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.

FIFO compaction style is suited for storing high-frequency event logs.

Test Plan: Added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

Subscribers: alberts, leveldb

Differential Revision: https://reviews.facebook.net/D18765
2014-05-21 11:43:35 -07:00
Igor Canadi b2cf95fe38 Call EnableFileDeletions with false as argument 2014-05-20 14:28:51 -07:00
Igor Canadi c07c9606ed Expose Status::code() 2014-05-14 16:23:40 -07:00
Igor Canadi 26f5dd9a5a TablePropertiesCollectorFactory
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.

Here's description of task #4296714:
       I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.

       For example, this table has 3155 entries and 1391828 deleted keys.

The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.

Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.

In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.

Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests

Reviewers: sdong, dhruba, haobo, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18579
2014-05-13 12:30:55 -07:00
Igor Canadi 3edc056f6d comment 2014-05-10 11:25:56 -07:00
Igor Canadi 038a477b53 Make it easier to start using RocksDB
Summary:
This diff is addressing multiple things with a single goal -- to make RocksDB easier to use:
* Add some functions to Options that make RocksDB easier to tune.
* Add example code for both simple RocksDB and RocksDB with Column Families.
* Rewrite our README.md

Regarding Options, I took a stab at something we talked about for a long time:
* https://www.facebook.com/groups/rocksdb.dev/permalink/563169950448190/

I added functions:
* IncreaseParallelism() -- easy, increases the thread pool and max_background_compactions
* OptimizeLevelStyleCompaction(memtable_memory_budget) -- the easiest way to optimize rocksdb for less stalls with level style compaction. This is very likely not ideal configuration. Feel free to suggest improvements. I used some of Mark's suggestions from here: https://github.com/facebook/rocksdb/issues/54
* OptimizeUniversalStyleCompaction(memtable_memory_budget) -- optimize for universal compaction.

Test Plan: compiled rocksdb. ran examples.

Reviewers: dhruba, MarkCallaghan, haobo, sdong, yhchiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18621
2014-05-10 10:49:33 -07:00
sdong 3a171dcb51 Pass logger to memtable rep and TLB page allocation error logged to info logs
Summary:
TLB page allocation errors are now logged to info logs, instead of stderr.
In order to do that, mem table rep's factory functions take a info logger now.

Test Plan: make all check

Reviewers: haobo, igor, yhchiang

Reviewed By: yhchiang

CC: leveldb, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D18471
2014-05-05 16:43:37 -07:00
sdong 4a7c747064 Revert "Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB""
And make the default 0 for hash linked list memtable

This reverts commit d69dc64be7.
2014-05-04 13:56:29 -07:00
Igor Canadi db1854d78c Declare all DB methods virtual so that StackableDB can override them 2014-05-04 11:39:49 -07:00