Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.
This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.
There is one test case that relied on didIO information. I'll try to find another way to implement it.
Test Plan: make check
Reviewers: sdong, haobo, yhchiang, ljin, dhruba
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19161
Summary:
We decided that one of the long term goals is to unify level and universal compaction.
As a small first step, I'm unifying level 0 sorting methods.
Previously, we used to sort level 0 files in level compaction by file number and in universal compaction by sequence number.
But it turns out that in level compaction, sorting by file number is exactly the same as sorting by sequence number.
Test Plan:
Ran make check with bunch of asserts to verify the sorting order is exactly the same.
Also, make check with this patch
Reviewers: haobo, yhchiang, ljin, dhruba, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19131
Summary: Currently, RocksDB returns error if a db written with prefix hash index, is later opened without providing a prefix extractor. This is uncessarily harsh. Without a prefix extractor, we could always fallback to the normal binary index.
Test Plan: unit test, also manually veried LOG that fallback did occur.
Reviewers: sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19191
Summary:
Currently, when something badly happen in the DB::Write() while the write-queue
contains more than one element, the current design seems to forget to clean up
the queue as well as wake-up all the writers, this potentially makes rocksdb
hang on writes.
Test Plan: make all check
Reviewers: sdong, ljin, igor, haobo
Reviewed By: haobo
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19167
Summary: asan_crash_test is failing on segfault
Test Plan: running asan_crash_test
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19149
Summary:
Add a encoding feature of PlainTable to encode PlainTable's keys to save some bytes for the same prefixes.
The data format is documented in table/plain_table_factory.h
Test Plan: Add unit test coverage in plain_table_db_test
Reviewers: yhchiang, igor, dhruba, ljin, haobo
Reviewed By: haobo
Subscribers: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D18735
Summary:
This is minor, but if we put the writing talbe factory as the third parameter, when we add a new table format, we'll have a situation:
1) block based factory
2) plain table factory
3) output factory
4) new format factory
I think it makes more sense to have output as the first parameter.
Also, fixed a NewAdaptiveTableFactory() call in unit test
Test Plan: unit test
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19119
Summary: Add two parameters of hash linked list to log distribution of number of entries across all buckets, and a sample row when there are too many entries in one single bucket.
Test Plan: Turn it on in plain_table_db_test and see the logs.
Reviewers: haobo, ljin
Reviewed By: ljin
Subscribers: leveldb, nkg-, dhruba, yhchiang
Differential Revision: https://reviews.facebook.net/D19095
Summary: The new table factory is used if users want to convert a DB from one table format to the other. A user can use this table to open a DB written using one table format and write new files to another table format.
Test Plan: add a unit test
Reviewers: haobo, igor
Reviewed By: igor
Subscribers: dhruba, ljin, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D19017
Summary: Include max_write_buffer_number >= 2 to SanitizeOptions.
Test Plan: make all check
Reviewers: haobo, sdong, igor, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19077
Summary:
We added multiple fields to FileMetaData recently and are planning to add more.
This refactoring separate the minimum information for accessing the file. This object is copyable (FileMetaData is not copyable since the ref counter). I hope this refactoring can enable further improvements:
(1) use it to design a more efficient data structure to speed up read queries.
(2) in the future, when we add information of storage level, we can easily do the encoding, instead of enlarge this structure, which might expand memory work set for file meta data.
The definition is same as current EncodedFileMetaData used in two level iterator, so now the logic in two level iterator is easier to understand.
Test Plan: make all check
Reviewers: haobo, igor, ljin
Reviewed By: ljin
Subscribers: leveldb, dhruba, yhchiang
Differential Revision: https://reviews.facebook.net/D18933
Some platforms, particularly Windows, do not have a single method that can
release both a held reader lock and a held writer lock; instead, a
separate method (ReleaseSRWLockShared or ReleaseSRWLockExclusive) must be
called in each case.
This may also be necessary to back MutexRW with a shared_mutex in C++14;
the current language proposal includes both an unlock() and a
shared_unlock() method.
Summary:
https://reviews.facebook.net/D17205 removed the logic of skipping file key range check when there are less than 3 level 0 files. This patch brings it back.
Other than that, add another small optimization to avoid to check all the levels if most higher levels don't have any file.
Test Plan: make all check
Reviewers: ljin
Reviewed By: ljin
Subscribers: yhchiang, igor, haobo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19035
Summary: as title
Test Plan:
db_bench
the initial result is very promising. I will post results of complete
runs
Reviewers: dhruba, haobo, sdong, igor
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18867
Summary:
Clean PlainTableReader's data structures:
(1) inline bloom_ (in order to do this, change DynamicBloom to allow lazy initialization)
(2) remove some variables only used when initialization from the class
(3) put variables not used in normal read code paths to the end of the class and reference prefix_extractor directly
(4) make Options a reference.
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: ljin
Subscribers: igor, yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18891
Summary: Provide an convenience option to create column families if they are missing from the DB. Task #4460490
Test Plan: added unit test. also, stress test for some time
Reviewers: sdong, haobo, dhruba, ljin, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D18951
Summary: We have a perf regression of Write() even with one column family. Make fast path for single column family to avoid the perf regression. See task #4455480
Test Plan: make check
Reviewers: sdong, ljin
Reviewed By: sdong, ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18963
Summary: Now sst_dump fails in block based tables if binary search index is used, as it requires a prefix extractor. Add it.
Test Plan: Run it against such a file to make sure it fixes the problem.
Reviewers: yhchiang, kailiu
Reviewed By: kailiu
Subscribers: ljin, igor, dhruba, haobo, leveldb
Differential Revision: https://reviews.facebook.net/D18927
Summary: In universal compaction, MaxFileSizeForLevel is ULLONG_MAX. We've been preallocation files to UULONG_MAX size all these time :)
Test Plan: make check
Reviewers: dhruba, haobo, ljin, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18915
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
Summary:
At the end of BackgroundCallCompaction(), we call SignalAll(), even though we don't need to. If compaction hasn't done anything and there's another compaction running, there is no need to signal on the condition variable. Doing so creates a tight feedback loop which results in log files like:
wait for memtable flush
compaction nothing to do
wait for memtable flush
compaction nothing to do
This change eliminates that
Test Plan:
make check
Also:
icanadi@dev1440 ~ $ grep "nothing to do" /fast-rocksdb-tmp/rocksdb_test/column_family_test/LOG | wc -l
7435
icanadi@dev1440 ~ $ grep "nothing to do" /fast-rocksdb-tmp/rocksdb_test/column_family_test/LOG | wc -l
372
First version is before the change, second version is after the change.
Reviewers: dhruba, ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18855
Summary:
We've seen some production issues where column family is detected as stale, although there is only one column family in the system. This is a quick fix that:
1) doesn't flush stale column families if there's only one of them
2) Use 4 as a coefficient instead of 2 for determening when a column family is stale. This will make flushing less aggressive, while still keep a nice dynamic flushing of very stale CFs.
Test Plan: make check
Reviewers: dhruba, haobo, ljin, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18861
Summary:
Forward iterator puts everything together in a flat structure instead of
a hierarchy of nested iterators. this should simplify the code and
provide better performance. It also enables more optimization since all
information are accessiable in one place.
Init evaluation shows about 6% improvement
Test Plan: db_test and db_bench
Reviewers: dhruba, igor, tnovak, sdong, haobo
Reviewed By: haobo
Subscribers: sdong, leveldb
Differential Revision: https://reviews.facebook.net/D18795
Summary: One more option to allow iterator refreshing when using normal iterator
Test Plan: ran db_bench
Reviewers: haobo, sdong, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18849
Summary:
Introducing new compaction style -- FIFO.
FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.
FIFO compaction style is suited for storing high-frequency event logs.
Test Plan: Added a unit test
Reviewers: dhruba, haobo, sdong
Reviewed By: dhruba
Subscribers: alberts, leveldb
Differential Revision: https://reviews.facebook.net/D18765
Summary: Key size limit doesn't seem to be applicable anymore. Remove it.
Test Plan: run a couple of tests in db_bench
Reviewers: haobo, igor, yhchiang, dhruba
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18723
Summary: Cleaned up compaction logging a little bit. Now file sizes are easier to read. Also, removed the trailing space.
Test Plan:
verified that i'm happy with logging output:
files_size[#33(seq=101,sz=98KB,0) #31(seq=81,sz=159KB,0) #26(seq=0,sz=637KB,0)]
Reviewers: sdong, haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18549
Summary:
In order to use arena to a use case that the total allocation size might be small (LogBuffer is already such a case), inline 1KB of data in it, so that it can be mostly in stack or inline in another class.
If always inlining 2KB is a concern, I could make it a template to determine what to inline. However, dependents need to changes. Doesn't go with it for now
Test Plan: make all check.
Reviewers: haobo, igor, yhchiang, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18609
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.
Here's description of task #4296714:
I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.
For example, this table has 3155 entries and 1391828 deleted keys.
The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.
Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.
In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.
Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests
Reviewers: sdong, dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18579
Summary:
Fixed a file-not-found issue when a log file is moved to archive
by doing a missing retry.
Test Plan:
make db_test
export ROCKSDB_TEST=TransactionLogIteratorRace
./db_test
Reviewers: sdong, haobo
Reviewed By: sdong
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D18669
Summary: as title
Test Plan: Still compile
Reviewers: haobo, igor, yhchiang
Reviewed By: igor
CC: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18603
Summary:
Newer gflags switched from `google` namespace to `gflags` namespace. See: https://github.com/facebook/rocksdb/issues/139 and https://github.com/facebook/rocksdb/issues/102
Unfortunately, they don't define any macro with their namespace, so we need to actually try to compile gflags with two different namespace to figure out which one is the correct one.
Test Plan: works in fbcode environemnt. I'll also try in ubutnu with newer gflags
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18537
Summary: We had a hypothesis in https://reviews.facebook.net/D18507 that empty-string internal keys might have been caused by compaction filter deleting all the entries. I added a unit test for that case. Unforutnately, everything works as expected.
Test Plan: this is a test
Reviewers: dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18519
Summary:
This was reported by our customers in task #4295529.
Cause:
* MANIFEST file contains a VersionEdit, which contains file entries whose 'smallest' and 'largest' internal keys are empty. String with zero characters. Root cause of corruption was not investigated. We should report corruption when this happens. However, we currently SIGSEGV.
Here's what happens:
* VersionEdit encodes zero-strings happily and stores them in smallest and largest InternalKeys. InternalKey::Encode() does assert when `rep_.empty()`, but we don't assert in production environemnts. Also, we should never assert as a result of DB corruption.
* As part of our ConsistencyCheck, we call GetLiveFilesMetaData()
* GetLiveFilesMetadata() calls `file->largest.user_key().ToString()`
* user_key() function does: 1. assert(size > 8) (ooops, no assert), 2. returns `Slice(internal_key.data(), internal_key.size() - 8)`
* since `internal_key.size()` is unsigned int, this call translates to `Slice(whatever, 1298471928561892576182756)`. Bazinga.
Fix:
* VersionEdit checks if InternalKey is valid in `VersionEdit::GetInternalKey()`. If it's invalid, returns corruption.
Lessons learned:
* Always keep in mind that even if you `assert()`, production code will continue execution even if assert fails.
* Never `assert` based on DB corruption. Assert only if the code should guarantee that assert can't fail.
Test Plan: dumped offending manifest. Before: assert. Now: corruption
Reviewers: dhruba, haobo, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18507
Summary: One of our users reported current file corruption. The machine was rebooted during the time. This is the only think I can think of which could cause current file corruption. Just add this paranoid check.
Test Plan: make all check
Reviewers: haobo, igor
Reviewed By: haobo
CC: yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18495
Summary:
TLB page allocation errors are now logged to info logs, instead of stderr.
In order to do that, mem table rep's factory functions take a info logger now.
Test Plan: make all check
Reviewers: haobo, igor, yhchiang
Reviewed By: yhchiang
CC: leveldb, yhchiang, dhruba
Differential Revision: https://reviews.facebook.net/D18471
Summary:
db_test includes Benchmark for LogAndApply. This diff removes it from db_test and puts it into a separate log_and_apply bench. I just wanted to play around with our new benchmark framework and figure out how it works.
I would also like to show you how great it is! I believe right set of microbenchmarks can speed up our productivity a lot and help catch early regressions.
Test Plan: no
Reviewers: dhruba, haobo, sdong, ljin, yhchiang
Reviewed By: yhchiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18261
Summary:
Added a new option `max_total_wal_size`. Once the total WAL size goes over that, we make an attempt to flush all column families that still have data in the earliest WAL file.
By default, I calculate `max_total_wal_size` dynamically, that should be good-enough for non-advanced customers.
Test Plan: Added a test
Reviewers: dhruba, haobo, sdong, ljin, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18345
Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table.
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: nkg-, dhruba, leveldb, igor, yhchiang
Differential Revision: https://reviews.facebook.net/D18357
Summary:
= Major Changes =
* Add a new mem-table representation, HashCuckooRep, which is based cuckoo hash.
Cuckoo hash uses multiple hash functions. This allows each key to have multiple
possible locations in the mem-table.
- Put: When insert a key, it will try to find whether one of its possible
locations is vacant and store the key. If none of its possible
locations are available, then it will kick out a victim key and
store at that location. The kicked-out victim key will then be
stored at a vacant space of its possible locations or kick-out
another victim. In this diff, the kick-out path (known as
cuckoo-path) is found using BFS, which guarantees to be the shortest.
- Get: Simply tries all possible locations of a key --- this guarantees
worst-case constant time complexity.
- Time complexity: O(1) for Get, and average O(1) for Put if the
fullness of the mem-table is below 80%.
- Default using two hash functions, the number of hash functions used
by the cuckoo-hash may dynamically increase if it fails to find a
short-enough kick-out path.
- Currently, HashCuckooRep does not support iteration and snapshots,
as our current main purpose of this is to optimize point access.
= Minor Changes =
* Add IsSnapshotSupported() to DB to indicate whether the current DB
supports snapshots. If it returns false, then DB::GetSnapshot() will
always return nullptr.
Test Plan:
Run existing tests. Will develop a test specifically for cuckoo hash in
the next diff.
Reviewers: sdong, haobo
Reviewed By: sdong
CC: leveldb, dhruba, igor
Differential Revision: https://reviews.facebook.net/D16155
Summary:
ReadFirstRecord() reads the actual log file from disk on every call. This diff introduces a cache layer on top of ReadFirstRecord(), which should significantly speed up repeated calls to GetUpdatesSince().
I also cleaned up some stuff, but the whole TransactionLogIterator could use some refactoring, especially if we see increased usage.
Test Plan: make check
Reviewers: haobo, sdong, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18387
Summary:
When TransactionLogIterator comes to EOF, it calls UnmarkEOF and continues reading. However, if glibc cached the EOF status of the file, it will get EOF again, even though the new data might have been written to it.
This has been causing errors in Mac OS.
Test Plan: test passes, was failing before
Reviewers: dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18381
Summary:
As a follow-up diff for https://reviews.facebook.net/D17805, add
optimization to check PrefixMayMatch on Seek()
Test Plan: make all check
Reviewers: igor, haobo, sdong, yhchiang, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17853
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor
Test Plan: make all check
Reviewers: igor, haobo, sdong, yhchiang
Reviewed By: haobo
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D17805
Summary: As summary. Add two autovectors that get filled up in MakeRoomForWrite and they get deleted outside of mutex
Test Plan: make check
Reviewers: dhruba, haobo, ljin, sdong
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18249
Summary:
Now that we have column families involved, we need to add extra context to every log message. They now start with "[column family name] log message"
Also added some logging that I think would be useful, like level summary after every flush (I often needed that when going through the logs).
Test Plan: make check + ran db_bench to confirm I'm happy with log output
Reviewers: dhruba, haobo, ljin, yhchiang, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18303
Summary:
I'm getting lots of e-mails with CompactionInputErrorParanoid failing. Most recent example early morning today was: http://ci-builds.fb.com/job/rocksdb_valgrind/562/consoleFull
I'm putting a stop to these e-mails. I investigated why the test is flakey and it turns out it's because of non-determinsim of compaction scheduling. If there is a compaction after the last flush, CorruptFile will corrupt the compacted file instead of file at level 0 (as it assumes). That makes `Check(9, 9)` fail big time.
I also saw some errors with table file getting outputed to >= 1 levels instead of 0. Also fixed that.
Test Plan: Ran corruption_test 100 times without a failure. Previously it usually failed at 10th occurrence.
Reviewers: dhruba, haobo, ljin
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18285
Summary: IterKey set buffer_size_ to a wrong initial value, causing it to always allocate values from heap instead of stack if the key size is smaller. Fix it.
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: igor, dhruba, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D18279
Summary: While debugging Mac-only issue with ThreadLocalPtr, this was very useful. Let's print out stack trace in MAC OS, too.
Test Plan: Verified that somewhat useful stack trace was generated on mac. Will run PrintStack() on linux, too.
Reviewers: ljin, haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18189
Summary: In this patch, two new DB properties are defined: rocksdb.num-immutable-mem-table and rocksdb.num-entries-imm-mem-tables, from where number of entries in mem tables can be exposed to users
Test Plan:
Cover the codes in db_test
make all check
Reviewers: haobo, ljin, igor
Reviewed By: igor
CC: nkg-, igor, yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18207
Summary: Get rid of the devil. Probably won't impact anything on the perf side.
Test Plan: make all check
Reviewers: igor, haobo, sdong, yhchiang
Reviewed By: haobo
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D18153
Summary:
This is a temp solution to expose index sizes to users from PlainTableReader before we persistent them to files.
In this patch, the memory consumption of indexes used by PlainTableReader will be reported as two user defined properties, so that users can monitor them.
Test Plan:
Add a unit test.
make all check`
Reviewers: haobo, ljin
Reviewed By: haobo
CC: nkg-, yhchiang, igor, ljin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18195
Summary:
Using ThreadLocalPtr as a flag to determine if a mutex is locked or not enables us to implement AssertNotHeld(). It also makes AssertHeld() actually correct.
I had to remove port::Mutex as a dependency for util/thread_local.h, but that's fine since we can just use std::mutex :)
Test Plan: make check
Reviewers: ljin, dhruba, haobo, sdong, yhchiang
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18171
Summary:
This will enable people using TTL DB to do so with multiple column families. They can also specify different TTLs for each one.
TODO: Implement CreateColumnFamily() in TTL world.
Test Plan: Added a very simple sanity test.
Reviewers: dhruba, haobo, ljin, sdong, yhchiang
Reviewed By: haobo
CC: leveldb, alberts
Differential Revision: https://reviews.facebook.net/D17859
Summary: Added benchmark functionality on the lines of folly/Benchmark.h
Test Plan: Added unit tests
Reviewers: igor, haobo, sdong, ljin, yhchiang, dhruba
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17973
Summary:
The file tree structure in Version is prebuilt and the range of each file is known.
On the Get() code path, we do binary search in FindFile() by comparing
target key with each file's largest key and also check the range for each L0 file.
With some pre-calculated knowledge, each key comparision that has been done can serve
as a hint to narrow down further searches:
(1) If a key falls within a L0 file's range, we can safely skip the next
file if its range does not overlap with the current one.
(2) If a key falls within a file's range in level L0 - Ln-1, we should only
need to binary search in the next level for files that overlap with the current one.
(1) will be able to skip some files depending one the key distribution.
(2) can greatly reduce the range of binary search, especially for bottom
levels, given that one file most likely only overlaps with N files from
the level below (where N is max_bytes_for_level_multiplier). So on level
L, we will only look at ~N files instead of N^L files.
Some inital results: measured with 500M key DB, when write is light (10k/s = 1.2M/s), this
improves QPS ~7% on top of blocked bloom. When write is heavier (80k/s =
9.6M/s), it gives us ~13% improvement.
Test Plan: make all check
Reviewers: haobo, igor, dhruba, sdong, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17205
Summary:
D17961 has two bugs:
(1) two level iterator fails to populate FileMetaData.table_reader, causing performance regression.
(2) table cache handle the !status.ok() case in the wrong place, causing seg fault which shouldn't happen.
Test Plan: make all check
Reviewers: ljin, igor, haobo
Reviewed By: ljin
CC: yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D17991
Summary:
One of our profilings shows that Version::Get() sometimes is slow when getting pointer of user comparators or other global objects. In this patch:
(1) we keep pointers of immutable objects in Version to avoid accesses them though option objects or cfd objects
(2) table_reader is directly cached in FileMetaData so that table cache don't have to go through handle first to fetch it
(3) If level 0 has less than 3 files, skip the filtering logic based on SST tables' key range. Smallest and largest key are stored in separated memory locations, which has potential cache misses
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: igor, yhchiang, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D17739
Summary: Current behavior of creating new DB is, if there is existing log files, we will go ahead and replay them on top of empty DB. This is a behavior that no user would expect. With this patch, we will fail the creation if a user creates a DB with existing log files.
Test Plan: make all check
Reviewers: haobo, igor, ljin
Reviewed By: haobo
CC: nkg-, yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D17817
Summary:
Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.
Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)
Test Plan: compiles :)
Reviewers: dhruba, haobo, ljin, sdong, yhchiang
Reviewed By: yhchiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17835
Summary:
With multiple column families, especially when manual Flush is executed, we might roll the log file, although the current log file is empty (no data has been written to the log).
After the diff, we won't create new log file if current is empty.
Next, I will write an algorithm that will flush column families that reference old log files (i.e., that weren't flushed in a while)
Test Plan: Added an unit test. Confirmed that unit test failes in master
Reviewers: dhruba, haobo, ljin, sdong
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17631
Summary: multireadrandom is broken. Fix it
Test Plan: run it and see segfault has gone.
Reviewers: ljin
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17781
Summary:
Fix the following compile error
./db/tailing_iter.h:17:1: error: class 'SuperVersion' was previously declared as a struct [-Werror,-Wmismatched-tags]
class SuperVersion;
^
./db/column_family.h:77:8: note: previous use is here
struct SuperVersion {
^
./db/tailing_iter.h:17:1: note: did you mean struct here?
class SuperVersion;
^~~~~
struct
1 error generated.
Test Plan: make
Reviewers: ljin, igor, haobo, sdong
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17799
Summary:
replace the super version acquisision in tailing itrator with thread
local
Test Plan: will post results
Reviewers: igor, haobo, sdong, yhchiang, dhruba
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17757
Summary:
Similar to GetImp(), use SuperVersion from thread local instead of acquriing mutex.
I don't expect this change will make a dent on NewIterator() performance
because the bottleneck seems to be on the rest part of the API
Test Plan:
make asan_check
will post perf numbers
Reviewers: haobo, igor, sdong, dhruba, yhchiang
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17643
Summary: This patch introduces a new parameter num_multi_db in db_bench. When this parameter is larger than 1, multiple DBs will be created. In all benchmarks, any operation applies to a random DB among them. This is to benchmark the performance of similar applications.
Test Plan: run db_bench on both of num_multi_db=0 and more.
Reviewers: haobo, ljin, igor
Reviewed By: igor
CC: igor, yhchiang, dhruba, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D17769
Summary:
it writes ~10M data, default L0 compaction trigger is 4, plus 2 writer
buffer, so that can accommodate ~6M data before compaction happens for
sure. I guess encoding is doing a good job to shrink the data so that
sometime, compaction does not get triggered. I get test failure quite
often.
Test Plan: ran it multiple times and all got pass
Reviewers: igor, sdong
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17775
Summary: as title
Test Plan: ran it
Reviewers: igor, haobo, yhchiang
Reviewed By: yhchiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17751
Summary: XCode for some reason injects `#define DEBUG 1` into our code, which makes compile fail because we use `DEBUG` keyword for other stuff. This diff fixes the issue by renaming `DEBUG` to `DEBUG_LEVEL`.
Test Plan: compiles
Reviewers: dhruba, haobo, sdong, yhchiang, ljin
Reviewed By: yhchiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17709
Summary: Based on previous patches, this diff eventually provides the end-to-end mechanism for users to specify the hash-index.
Test Plan: Wrote several new unit tests.
Reviewers: sdong, haobo, dhruba
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16539
Summary: as title
Test Plan: ran it
Reviewers: igor, haobo, yhchiang
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17655
Summary: Compiling for iOS has by default turned on -Wmissing-prototypes, which causes rocksdb to fail compiling. This diff turns on -Wmissing-prototypes in our compile options and cleans up all functions with missing prototypes.
Test Plan: compiles
Reviewers: dhruba, haobo, ljin, sdong
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17649
Summary:
1. Polish IterKey a little bit.
2. Turn to use it in local parameter of current_user_key in DBImpl::ProcessKeyValueCompaction(). Our profile showing that DBImpl::ProcessKeyValueCompaction() has about 14% costs in std::string (the base including reading and writing data but excluding compaction filtering), which is higher than it should be. There are two std::string used in DBImpl::ProcessKeyValueCompaction(), compaction_filter_value and current_user_key and it's hard to distinguish the two.
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: igor, yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D17613
Summary:
This is testing behavior that was reported in https://github.com/facebook/rocksdb/issues/111
No issue was found, but it still good to commit this and make CompressedCache more robust.
Test Plan: this is a plan
Reviewers: ljin, dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17625
Summary:
filluniquerandom is painfully slow due to the naive bitmap check to find
out if a key has been seen before. Majority of time is spent on searching
the last few keys. Split a giant BitSet to smaller ones so that we can
quickly check if a BitSet is full and thus can skip quickly.
It used to take over one hour to filluniquerandom for 100M keys, now it
takes about 10 mins.
Test Plan:
unit test
also verified correctness in db_bench and make sure all keys are
generated
Reviewers: igor, haobo, yhchiang
Reviewed By: igor
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D17607
Summary: When opening DB in read-only mode, client can choose to only specify a subset of column families ("default" column family can't be omitted, though)
Test Plan: added a unit test in column_family_test
Reviewers: haobo, sdong, ljin, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17565
Summary:
GetProperty test is flakey.
Before this diff: P8635927
After: P8635945
We need to make sure the thread is done before we destruct sleeping tasks. Otherwise, bad things happen.
Test Plan: See summary
Reviewers: ljin, sdong, haobo, dhruba
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17595
Summary:
clean up the db_bench a little bit. also avoid allocating memory for key
in the loop
Test Plan:
I verified a run with filluniquerandom & readrandom. Iterator seek will be used lot
to measure performance. Will fix whatever comes up
Reviewers: haobo, igor, yhchiang
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17559
Summary: ToString() is expensive. Profiling shows that most compaction threads are stuck in jemalloc, allocating a new string. This will help out a litte.
Test Plan: make check
Reviewers: haobo, danguo
Reviewed By: danguo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17583
Summary: This will allow us to disable them completely for iOS or for better performance
Test Plan: will run make all check
Reviewers: igor, haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17511
Summary:
Move PlainTableIterator's copied key from std::string local buffer to avoid paying the extra costs in std::string related to sharing. Reuse the same buffer class in DbIter. Move the class to dbformat.h.
This patch improves iterator performance significantly. Running this benchmark:
./table_reader_bench --num_keys2=17 --iterator --plain_table --time_unit=nanosecond
The average latency is improved to about 750 nanoseconds from 1100 nanoseconds.
Test Plan:
Add a unit test.
make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: igor, yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D17547
Summary: If a client specifies wal_dir with trailing '/', we will fail in deleting obsolete log files. See task #4083746
Test Plan: make check
Reviewers: haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17535
Summary: Our measurement shows that sometimes new log::Write's constructor can take hundreds of milliseconds. It's unclear why but just simply move it out of DB mutex.
Test Plan: make all check
Reviewers: haobo, ljin, igor
Reviewed By: haobo
CC: nkg-, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17487
Summary: per sdong's request, this will help processor prefetch on n->key case.
Test Plan: make all check
Reviewers: sdong, haobo, igor
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17415
Summary: Flushing log buffer earlier to avoid confusion of time holding the locks.
Test Plan: Should be safe as long as several related db test passes
Reviewers: haobo, igor, ljin
Reviewed By: igor
CC: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D17493
Summary: Move several some common logging still in DB mutex to log buffer.
Test Plan: make all check
Reviewers: haobo, igor, ljin, nkg-
Reviewed By: nkg-
CC: nkg-, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17439
Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition.
Test Plan: TransactionLogIteratorRace; make check
Reviewers: dhruba, ljin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17121
Summary: As we know, logging can be slow, or even hang for some file systems. Move one more logging out of DB mutex.
Test Plan: make all check
Reviewers: haobo, igor, ljin
Reviewed By: igor
CC: yhchiang, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D17427
Summary: The previous change D15087 changed existing compaction filter, which makes the commonly used class not backward compatible. Revert the older interface. Use a new interface for V2 instead.
Test Plan: make all check
Reviewers: haobo, yhchiang, igor
CC: danguo, dhruba, ljin, igor, leveldb
Differential Revision: https://reviews.facebook.net/D17223
Summary: It is a regression valgrind bug caused by using FileMetaData as index block handle. One of the fields of FileMetaData is not initialized after being contructed and copied, but I'm not able to find which one. Also, I realized that it's not a good idea to use FileMetaData as in TwoLevelIterator::InitDataBlock(), a copied FileMetaData can be compared with the one in version set byte by byte, but the refs can be changed. Also comparing such a large structure is slightly more expensive. Use a simpler structure instead
Test Plan:
Run the failing valgrind test (Harness.RandomizedLongDB)
make all check
Reviewers: igor, haobo, ljin
Reviewed By: igor
CC: yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17409
Summary: DBIter now uses a std::string for saved_key. Based on some profiling, it could be more expensive than we though. Optimize it with the same technique as LookupKey -- if it is short, we copy it to a static allocated char. Otherwise, dynamically allocate memory for it.
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: dhruba, igor, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D17289
Summary:
In total order mode, iterator's seek() shouldn't check total order.
Also some cleaning up about checking null for shared pointers. I don't know the behavior before it.
This bug was reported by @igor.
Test Plan: test plain_table_db_test
Reviewers: ljin, haobo, igor
Reviewed By: igor
CC: yhchiang, dhruba, igor, leveldb
Differential Revision: https://reviews.facebook.net/D17391
Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this.
Test Plan: make check
Reviewers: haobo, ljin
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17343
Summary: Fix some more CompactionFilterV2 valgrind issues. Maybe it would make sense for CompactionFilterV2 to delete its prefix_extractor?
Test Plan: ran CompactionFilterV2* tests with valgrind. issues before patch -> no issues after
Reviewers: haobo, sdong, ljin, dhruba
Reviewed By: dhruba
CC: leveldb, danguo
Differential Revision: https://reviews.facebook.net/D17337
Summary: Since we are optimizing for server workloads, some default values are not optimized any more. We change some of those values that I feel it's less prone to regression bugs.
Test Plan: make all check
Reviewers: dhruba, haobo, ljin, igor, yhchiang
Reviewed By: igor
CC: leveldb, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D16995