Commit graph

213 commits

Author SHA1 Message Date
Feng Zhu f697cad15c create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
    Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
    Thus increase the file meta data locality, speed up "Get" and "FindFile"

    benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"

benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920

Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.

Test Plan:
    make all check

Reviewers: ljin, haobo, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, igor

Differential Revision: https://reviews.facebook.net/D19419
2014-07-09 22:14:39 -07:00
Yueh-Hsuan Chiang 70828557ef Some fixes on size compensation logic for deletion entry in compaction
Summary:
This patch include two fixes:
1. newly created Version will now takes the aggregated stats for average-value-size from the latest Version.
2. compensated size of a file is now computed only for newly created / loaded file, this addresses the issue where files are already sorted by their compensated file size but might sometimes observe some out-of-order due to later update on compensated file size.

Test Plan:
export ROCKSDB_TESTS=CompactionDele
./db_test

Reviewers: ljin, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19557
2014-07-09 12:46:08 -07:00
sdong 2459f7ec4e Support Multiple DB paths (without having an interface to expose to users)
Summary:
In this patch, we allow RocksDB to support multiple DB paths internally.
No user interface is supported yet so this patch is silent to users.

Test Plan: make all check

Reviewers: igor, haobo, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18921
2014-07-02 21:14:44 -07:00
Igor Canadi a2e0d890ed No need for files_by_size_ in universal compaction
Summary: files_by_size_ is sorted by time in case of universal compaction. However, Version::files_ is also sorted by time. So no need for files_by_size_

Test Plan:
1) make check with the change
2) make check with `assert(last_index == c->input_version_->files_[level].size() - 1);` in compaction picker

Reviewers: dhruba, haobo, yhchiang, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19125
2014-07-01 08:55:04 +02:00
Yueh-Hsuan Chiang e813f5b6d9 Allow compaction to reclaim storage more effectively.
Summary:
This diff allows compaction to reclaim storage more effectively.
In the current design, compactions are mainly triggered based on
the file sizes.  However, since deletion entries does not have
value, files which have many deletion entries are less likely
to be compacted.  As a result, it may took a while to make
deletion entries to be compacted.

This diff address issue by compensating the size of deletion
entries during compaction process: the size of each deletion
entry in the compaction process is augmented by 2x average
value size.  The diff applies to both leveled and universal
compacitons.

Test Plan:
develop CompactionDeletionTrigger
make db_test
./db_test

Reviewers: haobo, igor, ljin, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19029
2014-06-24 16:37:06 -06:00
Igor Canadi d4a8423334 Remove seek compaction
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.

This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.

There is one test case that relied on didIO information. I'll try to find another way to implement it.

Test Plan: make check

Reviewers: sdong, haobo, yhchiang, ljin, dhruba

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19161
2014-06-20 10:23:02 +02:00
sdong 983c93d731 VersionSet::Get(): Bring back the logic of skipping key range check when there are <=3 level 0 files
Summary:
https://reviews.facebook.net/D17205 removed the logic of skipping file key range check when there are less than 3 level 0 files. This patch brings it back.

Other than that, add another small optimization to avoid to check all the levels if most higher levels don't have any file.

Test Plan: make all check

Reviewers: ljin

Reviewed By: ljin

Subscribers: yhchiang, igor, haobo, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19035
2014-06-13 15:51:44 -07:00
Lei Jin c83b085770 prefetch bloom filter data block for L0 files
Summary: as title

Test Plan:
db_bench
the initial result is very promising. I will post results of complete
runs

Reviewers: dhruba, haobo, sdong, igor

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18867
2014-06-12 10:06:18 -07:00
sdong df9069d23f In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.

Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: leveldb, dhruba, yhchiang, igor

Differential Revision: https://reviews.facebook.net/D18513
2014-06-02 17:44:57 -07:00
Lei Jin 388d2054c7 forward iterator
Summary:
Forward iterator puts everything together in a flat structure instead of
a hierarchy of nested iterators. this should simplify the code and
provide better performance. It also enables more optimization since all
information are accessiable in one place.
Init evaluation shows about 6% improvement

Test Plan: db_test and db_bench

Reviewers: dhruba, igor, tnovak, sdong, haobo

Reviewed By: haobo

Subscribers: sdong, leveldb

Differential Revision: https://reviews.facebook.net/D18795
2014-05-30 14:31:55 -07:00
Igor Canadi 6de6a06631 FIFO compaction style
Summary:
Introducing new compaction style -- FIFO.

FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.

FIFO compaction style is suited for storing high-frequency event logs.

Test Plan: Added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

Subscribers: alberts, leveldb

Differential Revision: https://reviews.facebook.net/D18765
2014-05-21 11:43:35 -07:00
Igor Canadi c489499a2b Fix OSX compile 2014-04-26 17:15:43 -04:00
Lei Jin d642c60bdc Check PrefixMayMatch on Seek()
Summary:
As a follow-up diff for https://reviews.facebook.net/D17805, add
optimization to check PrefixMayMatch on Seek()

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17853
2014-04-25 12:22:23 -07:00
Lei Jin 0f2d768191 hints for narrowing down FindFile range and avoiding checking unrelevant L0 files
Summary:
The file tree structure in Version is prebuilt and the range of each file is known.
On the Get() code path, we do binary search in FindFile() by comparing
target key with each file's largest key and also check the range for each L0 file.
With some pre-calculated knowledge, each key comparision that has been done can serve
as a hint to narrow down further searches:
(1) If a key falls within a L0 file's range, we can safely skip the next
file if its range does not overlap with the current one.
(2) If a key falls within a file's range in level L0 - Ln-1, we should only
need to binary search in the next level for files that overlap with the current one.

(1) will be able to skip some files depending one the key distribution.
(2) can greatly reduce the range of binary search, especially for bottom
levels, given that one file most likely only overlaps with N files from
the level below (where N is max_bytes_for_level_multiplier). So on level
L, we will only look at ~N files instead of N^L files.

Some inital results: measured with 500M key DB, when write is light (10k/s = 1.2M/s), this
improves QPS ~7% on top of blocked bloom. When write is heavier (80k/s =
9.6M/s), it gives us ~13% improvement.

Test Plan: make all check

Reviewers: haobo, igor, dhruba, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17205
2014-04-21 09:10:12 -07:00
sdong fa430bfd04 Minimize accessing multiple objects in Version::Get()
Summary:
One of our profilings shows that Version::Get() sometimes is slow when getting pointer of user comparators or other global objects. In this patch:
(1) we keep pointers of immutable objects in Version to avoid accesses them though option objects or cfd objects
(2) table_reader is directly cached in FileMetaData so that table cache don't have to go through handle first to fetch it
(3) If level 0 has less than 3 files, skip the filtering logic based on SST tables' key range. Smallest and largest key are stored in separated memory locations, which has potential cache misses

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17739
2014-04-17 14:14:00 -07:00
Igor Canadi 588bca2020 RocksDBLite
Summary:
Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.

Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)

Test Plan: compiles :)

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17835
2014-04-15 13:39:26 -07:00
Igor Canadi b947fdc89d Column family support for DB::OpenForReadOnly()
Summary: When opening DB in read-only mode, client can choose to only specify a subset of column families ("default" column family can't be omitted, though)

Test Plan: added a unit test in column_family_test

Reviewers: haobo, sdong, ljin, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17565
2014-04-09 09:56:17 -07:00
Igor Canadi e20fa3f8a4 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_set.cc
2014-03-19 17:22:20 -07:00
Igor Canadi fcd5c5e828 ComputeCompactionScore in CompactionPicker
Summary:
As it turns out, we need the call to ComputeCompactionScore (previously: Finalize) in CompactionPicker.

The issue caused a deadlock in db_stress: http://ci-builds.fb.com/job/rocksdb_crashtest/290/console

The last two lines before a deadlock were:
2014/03/18-22:43:41.481029 7facafbee700 (Original Log Time 2014/03/18-22:43:41.480989) Compaction nothing to do
2014/03/18-22:43:41.481041 7faccf7fc700 wait for fewer level0 files...

"Compaction nothing to do" and other thread waiting for fewer level0 files. Hm hm.

I moved the pre-sorting to SaveTo, which should fix both the original and the new issue.

Test Plan: make check for now, will run db_stress in jenkins

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17037
2014-03-19 16:52:26 -07:00
Igor Canadi e25819a185 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
2014-03-18 14:00:20 -07:00
Igor Canadi 758fa8c359 Don't Finalize in CompactionPicker
Summary:
Finalize re-sorts (read: mutates) the files_ in Version* and it is called by CompactionPicker during normal runtime. At the same time, this same Version* lives in the SuperVersion* and is accessed without the mutex in GetImpl() code path.

Mutating the files_ in one thread and reading the same files_ in another thread is a bad idea. It caused this issue: http://ci-builds.fb.com/job/rocksdb_crashtest/285/console

Long-term, we need to be more careful with method contracts and clearly document what state can be mutated when. Now that we are much faster because we don't lock in GetImpl(), we keep running into data races that were not a problem before when we were slower. db_stress has been very helpful in detecting those.

Short-term, I removed Finalize() from CompactionPicker.

Note: I believe this is an issue in current 2.7 version running in production.

Test Plan:
make check
Will also run db_stress to see if issue is gone

Reviewers: sdong, ljin, dhruba, haobo

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16983
2014-03-18 13:59:59 -07:00
Igor Canadi 3055a15b29 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
2014-03-18 13:24:27 -07:00
Igor Canadi ae25742af9 Fix race condition in manifest roll
Summary:
When the manifest is getting rolled the following happens:
1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
2) mutex is unlocked
3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
5) mutex is locked

If FindObsoleteFiles happens between (3) and (4) it will:
1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
2) Delete old manifest (because the manifest_file_number_ already points to a new one)

I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.

However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?

Test Plan: make check

Reviewers: ljin, dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16929
2014-03-17 21:50:15 -07:00
Igor Canadi f0e1e3ebf1 CF cleanup part 2 2014-03-13 14:34:01 -07:00
Igor Canadi 9634ba42ac Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/db_impl.cc
	db/db_impl.h
	db/tailing_iter.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-10 17:26:09 -07:00
Haobo Xu 66da467983 [RocksDB] LogBuffer Cleanup
Summary: Moved LogBuffer class to an internal header. Removed some unneccesary indirection. Enabled log buffer for BackgroundCallFlush. Forced log buffer flush right after Unlock to improve time ordering of info log.

Test Plan: make check; db_bench compare LOG output

Reviewers: sdong

Reviewed By: sdong

CC: leveldb, igor

Differential Revision: https://reviews.facebook.net/D16707
2014-03-10 11:05:44 -07:00
sdong ecb1ffa2a8 Buffer info logs when picking compactions and write them out after releasing the mutex
Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.

Test Plan:
make all check
check the log lines while running some tests that trigger compactions.

Reviewers: haobo, igor, dhruba

Reviewed By: dhruba

CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D16515
2014-03-05 15:36:32 -08:00
Igor Canadi 8ea21a778b [CF] Rething LogAndApply for column families
Summary:
I though I might get away with as little changes to LogAndApply() as possible. It turns out this is not the case.

This diff introduces different behavior of LogAndApply() for three cases:
1. column family add
2. column family drop
3. no-column family manipulation

(1) and (2) don't support group commit yet.

There were a lot of problems with old version od LogAndApply, detected by db_stress. The biggest was non-atomicity of manifest writes and metadata changes (i.e. if column family add is in manifest, it also has to be in in-memory data structure).

Test Plan: db_stress

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16491
2014-02-28 14:46:48 -08:00
Igor Canadi 6e7cae7711 [CF] More tests
Summary: New unit tests for column families

Test Plan: this is a test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16359
2014-02-26 14:16:23 -08:00
Igor Canadi 5ad7ee03ea [CF] Log deletion in column families
Summary:
* Added unit test that verifies that obsolete files are deleted.
* Advance log number for empty column family when cutting log file.
* MinLogNumber() bug fix! (caught by the new unit test)

Test Plan: unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16311
2014-02-25 16:54:41 -08:00
Igor Canadi 76c048183c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	include/rocksdb/db.h
2014-02-14 16:46:03 -08:00
kailiu 63690625cd Expose the table properties to application
Summary: Provide a public API for users to access the table properties for each SSTable.

Test Plan: Added a unit tests to test the function correctness under differnet conditions.

Reviewers: haobo, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16083
2014-02-13 16:28:21 -08:00
Igor Canadi ccdb93e775 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/memtable_list.h
	db/version_set.cc
	db/version_set.h
2014-02-12 14:01:30 -08:00
Igor Canadi b06840aa7d [CF] Rethinking ColumnFamilyHandle and fix to dropping column families
Summary:
The change to the public behavior:
* When opening a DB or creating new column family client gets a ColumnFamilyHandle.
* As long as column family handle is alive, client can do whatever he wants with it, even drop it
* Dropped column family can still be read from (using the column family handle)
* Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB
* As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls)

Internally:
* Ref-counting ColumnFamilyData
* New thread-safety for ColumnFamilySet
* Dropped column families are now completely dropped and their memory cleaned-up

Test Plan: added some tests to column_family_test

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16101
2014-02-12 13:47:09 -08:00
Siying Dong 33042669f6 Reduce malloc of iterators in Get() code paths
Summary:
This patch optimized Get() code paths by avoiding malloc of iterators. Iterator creation is moved to mem table rep implementations, where a callback is called when any key is found. This is the same practice as what we do in (SST) table readers.

db_bench result for readrandom following a writeseq, with no compression, single thread and tmpfs, we see throughput improved to 144958 from 139027, about 3%.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D14685
2014-02-11 10:32:51 -08:00
Igor Canadi 0143abdbb0 Merge branch 'master' into columnfamilies
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/db_impl.h
	db/db_iter.cc
	db/db_test.cc
	db/dbformat.h
	db/memtable.cc
	db/memtable_list.cc
	db/memtable_list.h
	db/table_cache.cc
	db/table_cache.h
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/write_batch.cc
	db/write_batch_test.cc
	include/rocksdb/options.h
	util/options.cc
2014-02-06 15:58:20 -08:00
Igor Canadi f276e0e59d [CF] Options -> DBOptions
Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15933
2014-02-05 14:56:09 -08:00
Igor Canadi c24d8c4e90 [CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.

To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.

This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 11:55:30 -08:00
Igor Canadi 29bacb2eb6 VersionSet cleanup
Summary:
Removed icmp_ from VersionSet (since it's per-column-family, not per-DB-instance)
Unfriended VersionSet and ColumnFamilyData (yay!)
Removed VersionSet::NumberLevels()
Cleaned up DBImpl

Test Plan: make check

Reviewers: dhruba, haobo, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15819
2014-02-03 13:10:47 -08:00
Igor Canadi f7489123e2 Move compaction picker and internal key comparator to ColumnFamilyData
Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15801
2014-01-31 16:06:55 -08:00
Igor Canadi 3615f534d1 Enable flushing memtables from arbitrary column families
Summary: Removed default_cfd_ from all flush code paths. This means we can now flush memtables from arbitrary column families!

Test Plan: Added a new unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15789
2014-01-31 14:42:52 -08:00
Igor Canadi 6973bb1722 MakeRoomForWrite() support for column families
Summary: Making room for write will be the hardest part of the column family implementation. For now, I just iterate through all column families and run MakeRoomForWrite() for every one.

Test Plan: make check does not complain

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15597
2014-01-30 16:12:08 -08:00
Igor Canadi ac92420fc5 Merge branch 'master' into performance
Conflicts:
	db/db_impl.h
2014-01-30 10:09:23 -08:00
Igor Canadi fa99d53e55 Change ColumnFamilyData from struct to class
Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15579
2014-01-29 15:18:36 -08:00
Igor Canadi 20b231d712 Merge branch 'master' into columnfamilies 2014-01-29 11:47:43 -08:00
Igor Canadi f24a3ee52d Read from and write to different column families
Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber()

Test Plan: Added a unit test in column_family_test

Reviewers: dhruba, haobo, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15537
2014-01-29 11:38:16 -08:00
Igor Canadi e57f0cc1a1 add include <atomic> to version_set.h 2014-01-29 08:17:43 -08:00
kailiu a5e220f5ef Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_test.cc
	db/memtable_list.cc
	db/memtable_list.h
	table/block_based_table_reader.cc
	table/table_test.cc
	util/cache.cc
	util/coding.cc
2014-01-28 10:35:55 -08:00
Igor Canadi 4bf25357ae [column families] Removing VersionSet::current()
Summary: Instead of VersionSet::current(), DBImpl uses default_cfd_->current directly.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15483
2014-01-27 17:04:46 -08:00
Igor Canadi 511b03a5b5 LogAndApply to take ColumnFamilyData
Summary: This removes the default implementation of LogAndApply that applied the changed to the default column family by default. It is mostly simple reformatting.

Test Plan: make check

Reviewers: dhruba, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15465
2014-01-27 13:57:58 -08:00
Igor Canadi eb055609e4 [column families] Move memtable and immutable memtable list to column family data
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15459
2014-01-27 13:37:14 -08:00
Igor Canadi ae16606f98 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set.h
2014-01-27 11:11:51 -08:00
Igor Canadi 832158e7f7 Fsync directory after we create a new file
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.

Should I also add FsyncDir() to new files that get created by a compaction?

Test Plan: Confirmed that FsyncDir is returning Status::OK()

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D14751
2014-01-27 11:02:21 -08:00
Igor Canadi cf783c674a Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.h
2014-01-27 10:00:24 -08:00
Igor Canadi 6c2ca1d3e6 Move NeedsCompaction() from VersionSet to Version
Summary: There is no reason to have functions NeedCompaction(), MaxCompactionScore() and MaxCompactionScoreLevel() in VersionSet, since they don't access any data in VersionSet.

Test Plan: make check

Reviewers: kailiu, haobo, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15333
2014-01-27 09:59:00 -08:00
Igor Canadi 1423e7c9de Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
	db/version_set_reduce_num_levels.cc
	util/ldb_cmd.cc
2014-01-24 15:03:54 -08:00
Igor Canadi 677fee27c6 Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.

Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.

Test Plan: reduce_levels_test

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15303
2014-01-24 14:57:04 -08:00
Igor Canadi 7c5e583a27 ColumnFamilySet
Summary:
I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner.

Let me know if you have any comments. I will commit tomorrow.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15357
2014-01-23 14:03:38 -08:00
Igor Canadi 23f6791c9e Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl_readonly.cc
	db/db_test.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
	db/version_set_reduce_num_levels.cc
2014-01-21 17:01:52 -08:00
Kai Liu d4f65f1683 Merge branch 'master' into performance
This patch merges master's changes on build_tools/format-diff.sh.
Conflicts:
	db/version_edit.cc
2014-01-16 14:31:18 -08:00
Igor Canadi 6d6fb70960 Remove compaction pointers
Summary: The only thing we do with compaction pointers is set them to some values, we never actually read them. I don't know what we used them for, but it doesn't look like we use them anymore.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15225
2014-01-16 14:06:53 -08:00
Igor Canadi c699c84af4 CompactionPicker
Summary:
This is a big one. This diff moves all the code related to picking compactions from VersionSet to new class CompactionPicker. Column families' compactions will be completely separate processes, so we need to have multiple CompactionPickers.

To make this easier to review, most of the code change is just copy/paste. There is also a small change not to use VersionSet::current_, but rather to take `Version* version` as a parameter. Most of the other code is exactly the same.

In future diffs, I will also make some improvements to CompactionPickers. I think the most important part will be encapsulating it better. Currently Version, VersionSet, Compaction and CompactionPicker are all friend classes, which makes it harder to change the implementation.

This diff depends on D15171, D15183, D15189 and D15201

Test Plan: `make check`

Reviewers: kailiu, sdong, dhruba, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15207
2014-01-16 13:03:52 -08:00
kailiu 1304d8c8ce Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_impl.h
	db/db_test.cc
	db/memtable.cc
	db/memtable.h
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/hash_skiplist_rep.cc
	util/options.cc
2014-01-15 23:12:31 -08:00
Igor Canadi 787f11bb3b Move more functions from VersionSet to Version
Summary:
This moves functions:
* VersionSet::Finalize() -> Version::UpdateCompactionStats()
* VersionSet::UpdateFilesBySize() -> Version::UpdateFilesBySize()

The diff depends on D15189, D15183 and D15171

Test Plan: make check

Reviewers: kailiu, sdong, haobo, dhruba

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15201
2014-01-15 16:23:36 -08:00
Igor Canadi 615d1ea2f4 Moving Compaction class to separate header file
Summary:
I'm sure we'll all agree that version_set.cc needs simplifying. This diff moves Compaction class to a separate file.

The diff depends on D15171 and D15183

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15189
2014-01-15 16:22:34 -08:00
Igor Canadi 2f4eda7890 Move functions from VersionSet to Version
Summary:
There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier.

The functions moved are:
* NumLevelBytes
* LevelSummary
* LevelFileSummary
* MaxNextLevelOverlappingBytes
* AddLiveFiles (previously AddLiveFilesCurrentVersion())
* NeedSlowdownForNumLevel0Files

The diff continues on (and depends on) D15171

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong, emayanke

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15183
2014-01-15 16:18:04 -08:00
Igor Canadi 65a8a52b54 Decrease reliance on VersionSet::NumberLevels()
Summary:
With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels()

This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet.

I have also slightly changed how VersionSet keeps track of manifest size.

This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor.

In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_

Test Plan: make check

Reviewers: haobo, dhruba, kailiu, sdong

Reviewed By: sdong

Differential Revision: https://reviews.facebook.net/D15171
2014-01-15 16:15:43 -08:00
Igor Canadi d9cd7a063f Fix CompactRange to apply filter to every key
Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.

This patch fixed the unit test.

Test Plan: Added a failing unit test and a fix, so it's not failing anymore.

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D14421
2014-01-14 16:19:09 -08:00
Siying Dong fbbf0d1456 Pre-calculate whether to slow down for too many level 0 files
Summary: Currently in DBImpl::MakeRoomForWrite(), we do  "versions_->NumLevelFiles(0) >= options_.level0_slowdown_writes_trigger" to check whether the writer thread needs to slow down. However, versions_->NumLevelFiles(0) is slightly more expensive than we expected. By caching the result of the comparison when installing a new version, we can avoid this function call every time.

Test Plan:
make all check
Manually trigger this behavior by applying universal compaction style and make sure inserts are made slow after there are certain number of files.

Reviewers: haobo, kailiu, igor

Reviewed By: kailiu

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D15141
2014-01-14 11:23:02 -08:00
Igor Canadi a107691711 [column families] Implement refcounting ColumnFamilyData
Summary: We don't want to delete ColumnFamilyData object if somebody has references to it.

Test Plan: `make check` for now, but will need to implement bigger column family test case

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15111
2014-01-13 09:22:44 -08:00
Igor Canadi d076cef347 [column families] Get rid of VersionSet::current_ and keep current Version for each column family
Summary:
The biggest change here is getting rid of current_ Version and adding a column_family_data->current Version to each column family.

I have also fixed some smaller things in VersionSet that made it easier to implement Column family support.

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15105
2014-01-13 08:59:15 -08:00
Siying Dong aa0ef6602d [Performance Branch] If options.max_open_files set to be -1, cache table readers in FileMetadata for Get() and NewIterator()
Summary:
In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator().

We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency.

Test Plan: add a test case in db_test

Reviewers: haobo, kailiu

Reviewed By: haobo

CC: dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D15039
2014-01-10 15:57:49 -08:00
Igor Canadi 72918efffe [column families] Implement DB::OpenWithColumnFamilies()
Summary:
In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes:
* Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings
* Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&)
* Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily()

I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed.

Test Plan: Added a simple unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15033
2014-01-07 11:05:50 -08:00
Igor Canadi ef6ad1708d [column families] Support to create and drop column families
Summary:
This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733

It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly.

Test Plan: Added unit test

Reviewers: dhruba, haobo, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15021
2014-01-03 01:12:16 -08:00
kailiu 476416c27c Some minor refactoring on the code
Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features.

Test Plan: ran `make check`

Reviewers: haobo, dhruba, sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15009
2014-01-02 16:32:31 -08:00
Igor Canadi 7535443083 [RocksDB] Support for column families in manifest
Summary:
<This diff is for Column Family branch>

Added fields in manifest file to support adding and deleting column families.

Pretty simple change, each version edit record can be:
1. add column family
2. drop column family
3. add and delete N files from a single column family (compactions and flushes will generate such records)

Test Plan: make check works, the code is backward compatible

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14733
2014-01-02 04:18:28 -08:00
kailiu f1cec73a76 Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
2013-12-27 12:23:17 -08:00
Igor Canadi 1fdb3f7dc6 [RocksDB] Optimize locking for Get
Summary:
Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB.

Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers:

  overwrite                    56345  ->   63001
  fillseq                      193730 ->  185296
  readrandom                   771301 -> 1219803 (58% improvement!)
  readrandom_smallblockcache   677609 ->  862850
  readrandom_memtable_sst      710440 -> 1109223
  readrandom_fillunique_random 221589 ->  247869
  memtablefillrandom           105286 ->   92643
  memtablereadrandom           763033 -> 1288862

Test Plan:
make asan_check
I am also running db_stress

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14679
2013-12-20 09:57:58 -08:00
Siying Dong aaf9c6203c [RocksDB][Performance Branch]Iterator Cleanup method only tries to find obsolete files if it has the last reference to a version
Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version.

Test Plan: make all check

Reviewers: haobo, igor, dhruba, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14595
2013-12-11 13:59:43 -08:00
Siying Dong a8029fdc75 Introduce MergeContext to Lazily Initialize merge operand list
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415

Conflicts:
	db/db_impl.cc
	db/memtable.cc
2013-12-11 11:37:28 -08:00
Siying Dong ef2211a9ca [RocksDB Performance Branch] Introduce MergeContext to Lazily Initialize merge operand list
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases

Test Plan: make all check

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14415
2013-12-06 10:28:59 -08:00
Igor Canadi 9bc4a26f56 Small changes in Deleting obsolete files
Summary:
@haobo's suggestions from https://reviews.facebook.net/D13827

Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop.

I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though.

Test Plan: Ran deletefile_test

Reviewers: haobo

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14025
2013-11-12 11:53:26 -08:00
Igor Canadi 1510339e52 Speed up FindObsoleteFiles
Summary:
Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file.

It might speed things up a bit, but makes code uglier. I don't really like it.

Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though.

Let's discuss offline today.

Test Plan: Ran ./db_stress, verified that files are getting deleted

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D13827
2013-11-08 15:23:46 -08:00
Haobo Xu 8cbe5bb56b [RocksDB] Add OnCompactionStart to CompactionFilter class
Summary: This is to give application compaction filter a chance to access context information of a specific compaction run. For example, depending on whether a compaction goes through all data files, the application could do things differently.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13683
2013-10-31 13:36:43 -07:00
Siying Dong 9edda37027 Universal Compaction to Have a Size Percentage Threshold To Decide Whether to Compress
Summary:
This patch adds a option for universal compaction to allow us to only compress output files if the files compacted previously did not yet reach a specified ratio, to save CPU costs in some cases.

Compression is always skipped for flushing. This is because the size information is not easy to evaluate for flushing case. We can improve it later.

Test Plan:
add test
DBTest.UniversalCompactionCompressRatio1 and DBTest.UniversalCompactionCompressRatio12

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13467
2013-10-17 13:33:39 -07:00
Dhruba Borthakur 9cd221094c Add appropriate LICENSE and Copyright message.
Summary:
Add appropriate LICENSE and Copyright message.

Test Plan:
make check

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-10-16 17:48:41 -07:00
Siying Dong 073cbfc8f0 Enable background flush thread by default and fix issues related to it
Summary:
Enable background flush thread in this patch and fix unit tests with:
(1) After background flush, schedule a background compaction if condition satisfied;
(2) Fix a bug that if universal compaction is enabled and number of levels are set to be 0, compaction will not be automatically triggered
(3) Fix unit tests to wait for compaction to finish instead of flush, before checking the compaction results.

Test Plan: pass all unit tests

Reviewers: haobo, xjin, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13461
2013-10-16 13:32:53 -07:00
Kai Liu 86ef6c3f74 Add statistics to sst file
Summary:
So far we only have key/value pairs as well as bloom filter stored in the
sst file.  It will be great if we are able to store more metadata about
this table itself, for example, the entry size, bloom filter name, etc.

This diff is the first step of this effort. It allows table to keep the
basic statistics mentioned in http://fburl.com/14995441, as well as
allowing writing user-collected stats to stats block.

After this diff, we will figure out the interface of how to allow user to collect their interested statistics.

Test Plan:
1. Added several unit tests.
2. Ran `make check` to ensure it doesn't break other tests.

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13419
2013-10-14 15:56:13 -07:00
Dhruba Borthakur 4463b11cad Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.
Summary: Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.

Test Plan: make check

Reviewers: emayanke, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13311
2013-10-06 00:14:26 -07:00
Dhruba Borthakur a143ef9b38 Change namespace from leveldb to rocksdb
Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.

Test Plan: compile rocksdb

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13287
2013-10-04 11:59:26 -07:00
Dhruba Borthakur 4012ca1c7b Added a parameter to limit the maximum space amplification for universal compaction.
Summary:
Added a new field called max_size_amplification_ratio in the
CompactionOptionsUniversal structure. This determines the maximum
percentage overhead of space amplification.

The size amplification is defined to be the ratio between the size of
the oldest file to the sum of the sizes of all other files. If the
size amplification exceeds the specified value, then min_merge_width
and max_merge_width are ignored and a full compaction of all files is done.
A value of 10 means that the size a database that stores 100 bytes
of user data could occupy 110 bytes of physical storage.

Test Plan: Unit test DBTest.UniversalCompactionSpaceAmplification added.

Reviewers: haobo, emayanke, xjin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12825
2013-09-13 16:27:18 -07:00
Dhruba Borthakur fc0c399d2e Introduced a new flag non_blocking_io in ReadOptions.
Summary:
If ReadOptions.non_blocking_io is set to true, then KeyMayExists
and Iterators will return data that is cached in RAM.
If the Iterator needs to do IO from storage to serve the data,
then the Iterator.status() will return Status::IsRetry().

Test Plan:
Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs
issues from storage. Added DBTest.NonBlockingIteration to verify
nonblocking Iterations.

Reviewers: emayanke, haobo

Reviewed By: haobo

CC: leveldb

Maniphest Tasks: T63

Differential Revision: https://reviews.facebook.net/D12531
2013-08-28 10:49:14 -07:00
Tyler Harter 4504c99030 Internal/user key bug fix.
Summary: Fix code so that the filter_block layer only assumes keys are internal when prefix_extractor is set.

Test Plan: ./filter_block_test

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12501
2013-08-23 14:49:57 -07:00
Tyler Harter 94cf218720 Revert "Prefix scan: db_bench and bug fixes"
This reverts commit c2bd8f4824.
2013-08-22 18:01:11 -07:00
Tyler Harter c2bd8f4824 Prefix scan: db_bench and bug fixes
Summary: If use_prefix_filters is set and read_range>1, then the random seeks will set a the prefix filter to be the prefix of the key which was randomly selected as the target.  Still need to add statistics (perhaps in a separate diff).

Test Plan: ./db_bench --benchmarks=fillseq,prefixscanrandom --num=10000000 --statistics=1 --use_prefix_blooms=1 --use_prefix_api=1 --bloom_bits=10

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D12273
2013-08-22 16:06:50 -07:00
Simha Venkataramaiah 60bf2b7d4a Add APIs to query SST file metadata and to delete specific SST files
Summary: An api to query the level, key ranges, size etc for each SST file and an api to delete a specific file from the db and all associated state in the bookkeeping datastructures.

Notes: Editing the manifest version does not release the obsolete files right away. However deleting the file directly will mess up the iterator. We may need a more aggressive/timely file deletion api.

I have used std::unique_ptr - will switch to boost:: since this is external. thoughts?

Unit test is fragile right now as it expects the compaction at certain levels.

Test Plan: unittest

Reviewers: dhruba, vamsi, emayanke

CC: zshao, leveldb, haobo

Task ID: #

Blame Rev:
2013-08-22 15:27:19 -07:00
Dhruba Borthakur 93d77a27d2 Universal Compaction should keep DeleteMarkers unless it is the earliest file.
Summary:
The pre-existing code was purging a DeleteMarker if thay key did not
exist in deeper levels.  But in the Universal Compaction Style, all
files are in Level0. For compaction runs that did not include the
earliest file, we were erroneously purging the DeleteMarkers.

The fix is to purge DeleteMarkers only if the compaction includes
the earlist file.

Test Plan: DBTest.Randomized triggers this code path.

Differential Revision: https://reviews.facebook.net/D12081
2013-08-09 14:03:57 -07:00
Dhruba Borthakur f5fa26b6a9 Merge branch 'performance' of github.com:facebook/rocksdb into performance
Conflicts:
	db/builder.cc
	db/db_impl.cc
	db/version_set.cc
	include/leveldb/statistics.h
2013-08-07 11:58:06 -07:00
Deon Nicholas c2d7826ced [RocksDB] [MergeOperator] The new Merge Interface! Uses merge sequences.
Summary:
Here are the major changes to the Merge Interface. It has been expanded
to handle cases where the MergeOperator is not associative. It does so by stacking
up merge operations while scanning through the key history (i.e.: during Get() or
Compaction), until a valid Put/Delete/end-of-history is encountered; it then
applies all of the merge operations in the correct sequence starting with the
base/sentinel value.

I have also introduced an "AssociativeMerge" function which allows the user to
take advantage of associative merge operations (such as in the case of counters).
The implementation will always attempt to merge the operations/operands themselves
together when they are encountered, and will resort to the "stacking" method if
and only if the "associative-merge" fails.

This implementation is conjectured to allow MergeOperator to handle the general
case, while still providing the user with the ability to take advantage of certain
efficiencies in their own merge-operator / data-structure.

NOTE: This is a preliminary diff. This must still go through a lot of review,
revision, and testing. Feedback welcome!

Test Plan:
  -This is a preliminary diff. I have only just begun testing/debugging it.
  -I will be testing this with the existing MergeOperator use-cases and unit-tests
(counters, string-append, and redis-lists)
  -I will be "desk-checking" and walking through the code with the help gdb.
  -I will find a way of stress-testing the new interface / implementation using
db_bench, db_test, merge_test, and/or db_stress.
  -I will ensure that my tests cover all cases: Get-Memtable,
Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0,
Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found,
end-of-history, end-of-file, etc.
  -A lot of feedback from the reviewers.

Reviewers: haobo, dhruba, zshao, emayanke

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11499
2013-08-05 20:14:32 -07:00
Dhruba Borthakur 711a30cb30 Merge branch 'master' into performance
Conflicts:
	include/leveldb/options.h
	include/leveldb/statistics.h
	util/options.cc
2013-08-02 10:22:08 -07:00