Summary:
Currently, deadlock cycles are held in std::unordered_map. The problem with it is that it allocates/deallocates memory on every insertion/deletion. This limits throughput since we're doing this expensive operation while holding a global mutex. Fix this by using a vector which caches memory instead.
Running the deadlock stress test, this change increased throughput from 39k txns/s -> 49k txns/s. The effect is more noticeable in MyRocks.
Closes https://github.com/facebook/rocksdb/pull/1545
Differential Revision: D4205662
Pulled By: lth
fbshipit-source-id: ff990e4
Summary:
Currently, in the Direct I/O read mode, the last sector of the file, if not full, is not handled correctly. If the return value of pread is not multiplier of kSectorSize, we still go ahead and continue reading, even if the buffer is not aligned. With the commit, if the return value is not multiplier of kSectorSize, and all but the last sector has been read, we simply return.
Closes https://github.com/facebook/rocksdb/pull/1550
Differential Revision: D4209609
Pulled By: lightmark
fbshipit-source-id: cb0b439
Summary:
This patch clarifies the contract of PositionedAppend with some unit
tests and also implements it for PosixWritableFile. (Tasks: 14524071)
Closes https://github.com/facebook/rocksdb/pull/1514
Differential Revision: D4204907
Pulled By: maysamyabandeh
fbshipit-source-id: 06eabd2
Summary:
It is hard to measure acutal memory usage by std containers. Even
providing a custom allocator will miss count some of the usage. Here we
only do a wild guess on its memory usage.
Closes https://github.com/facebook/rocksdb/pull/1511
Differential Revision: D4179945
Pulled By: yiwu-arbug
fbshipit-source-id: 32ab929
Summary:
Currently our skip-list have an optimization to speedup sequential
inserts from a single stream, by remembering the last insert position.
We extend the idea to support sequential inserts from multiple streams,
and even tolerate small reordering wihtin each stream.
This PR is the interface part adding the following:
- Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
- Add `InsertWithHint()` interface to memtable, to allow underlying
implementation to return a hint of insert position, which can be later
pass back to optimize inserts.
- Memtable will maintain a map from prefix to hints and pass the hint
via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
Closes https://github.com/facebook/rocksdb/pull/1419
Differential Revision: D4079367
Pulled By: yiwu-arbug
fbshipit-source-id: 3555326
Summary:
This fixes a correctness issue where ranges with same begin key would overwrite each other.
This diff uses InternalKey as TombstoneMap's key such that all tombstones have unique keys even when their start keys overlap. We also update TombstoneMap to use an internal key comparator.
End-to-end tests pass and are here (https://gist.github.com/ajkr/851ffe4c1b8a15a68d33025be190a7d9) but cannot be included yet since the DeleteRange() API is yet to be checked in. Note both tests failed before this fix.
Closes https://github.com/facebook/rocksdb/pull/1484
Differential Revision: D4155248
Pulled By: ajkr
fbshipit-source-id: 304b4b9
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.
added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456
Differential Revision: D4111271
Pulled By: ajkr
fbshipit-source-id: 6e388d4
Summary:
DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line.
Closes https://github.com/facebook/rocksdb/pull/1439
Differential Revision: D4100943
Pulled By: yiwu-arbug
fbshipit-source-id: 79f7218
Summary:
Rocksdb currently has many references to std::map.emplace_back()
which is not implemented in gcc 4.7, but valid in gcc 4.8. Confirmed that
it did not build with gcc 4.7, but builds fine with gcc 4.8
Closes https://github.com/facebook/rocksdb/pull/1272
Differential Revision: D4101385
Pulled By: IslamAbdelRahman
fbshipit-source-id: f6af453
* util/build_verion.cc.in: add this file, so cmake and make can share the
template file for generating util/build_version.cc.
* CMakeLists.txt: also, cmake v2.8.11 does not support file(GENERATE ...),
so we are using configure_file() for creating build_version.cc.
* Makefile: use util/build_verion.cc.in for creating build_version.cc.
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Summary: Using real clock causes failures of DBSSTTest.RateLimitedDelete in some cases. Turn away from the real time. Use fake time instead.
Test Plan: Run the tests and all existing tests.
Reviewers: yiwu, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65145
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.
Test Plan: make all check -j64
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D65121
Summary:
reland https://reviews.facebook.net/D62523
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: sdong, yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65211
Summary:
In the current implementation of RateLimiter, the difference
between the configured rate and the actual rate might be more
than 20%, while our test only allows 15% difference. This diff
relaxes the acceptable bias RateLimiterTest::Rate test be 25%
to make the test less flaky.
Test Plan: rate_limiter_test
Reviewers: IslamAbdelRahman, andrewkr, yiwu, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64941
Summary: A convience method to atomically get and reset ticker count. I'm wanting to use it to have a thin wrapper to the statistics object to export ticker counts to ODS for LogDevice (since they don't even use fb303).
Test Plan:
test in LogDevice shadow cluster.
https://fburl.com/461868822
Reviewers: andrewkr, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64869
Summary:
We always run consistency checks when compiling in debug mode
allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode
Test Plan:
make check -j64
make release
Reviewers: lightmark, sdong, yiwu
Reviewed By: yiwu
Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba
Differential Revision: https://reviews.facebook.net/D64701
Summary:
I saw this exception thrown because sometimes we may resize with -ve value
if we have empty max_bytes_for_level_multiplier_additional vector
Test Plan: run the tests
Reviewers: yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64791
Hello and thank you for RocksDB,
I noticed when using log_write_bench that writes were always 88 bytes:
> strace -e trace=write ./log_write_bench -num_records 2 2>&1 | head -n 2
write(3, "\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371"..., 88) = 88
write(3, "\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371"..., 88) = 88
> strace -e trace=write ./log_write_bench -record_size 4096 -num_records 2 2>&1 | head -n 2
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 88) = 88
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 88) = 88
I think this should be:
<< record.assign('X', FLAGS_record_size);
>> record.assign(FLAGS_record_size, 'X');
So fill and not buffer. Otherwise I always see writes of size 88 (the decimal value for chr "X").
string& assign (const char* s, size_t n);
buffer - Copies the first n characters from the array of characters pointed by s.
string& assign (size_t n, char c);
fill - Replaces the current value by n consecutive copies of character c.
perl -le 'print ord "X"'
88
With the change:
> strace -e trace=write ./log_write_bench -record_size 4096 -num_records 2 2>&1 | head -n 2
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
> strace -e trace=write ./log_write_bench -num_records 2 2>&1 | head -n 2
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 249) = 249
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 249) = 249
Thanks.
01c27be5fbhttps://reviews.facebook.net/D16239
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: andrewkr, yhchiang, yiwu, sdong
Reviewed By: sdong
Subscribers: hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62523
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()
Also add tests in db_iter_test and db_iterator_test
Pass all tests
Cheers!
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64149
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary:
MyRocks build is broken because they are using "-Werror=missing-field-initializers"
We should fix that by explicitly passing these arguments
Test Plan: Build MyRocks
Reviewers: sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64161
Summary: 0.9 can make the test flaky since just found one test fail with 0.88
Test Plan: make all check
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63939
Summary: EnvPosixTestWithParam.TwoPools relies on explicit sleeping, so it sometimes fail. Fix it.
Test Plan: Run tests with high parallelism many times and make sure the test passes.
Reviewers: yiwu, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63417
Summary: Add mutable options info into `OptionsTypeInfo` and use it to parse mutable options map. Also support `max_bytes_for_level_multiplier_additional` in option file.
Test Plan: unit test
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63843
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.
Test Plan: existing unit tests.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63945
Summary:
Add Env::RandomRWFile in env.h and implement it for POSIX
RandomRWFile is a file that allow us to read from / write to random offsets in the file
I will implement it for other Envs later after finishing the whole task for AddFile()
Test Plan: unit tests
Reviewers: andrewkr, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62433
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.
Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63141
Summary: Fix two Windows build problems.
Test Plan: Build on Windows and run all Linux tests.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63189
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63177
* Fix StatsLevel so that kExceptTimeForMutex leaves compression stats enabled and kExceptDetailedTimers disables mutex lock stats. Also change default stats level to kExceptDetailedTimers (disabling both compression and mutex timing).
* Changed order of StatsLevel enum to simplify logic for determining what stats to record.
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
Test Plan: Add two new unit tests. Run all existing tests, including jtest.
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59829
Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor.
Test Plan: make all check
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62193
Summary:
To reduce contention for atomics when HistogramStats are shared across
threads, this diff makes them thread-specific so updates are faster. This comes
at the expense of slower reads (much less frequent), which now require merging
all histograms. In this diff,
- Thread-specific HistogramImpl is created upon the thread's first measureTime()
- Thread-specific HistogramImpl are merged and deleted upon thread termination or ThreadLocalPtr destruction, whichever comes first
- getHistogramString() and histogramData() merge all histograms, both thread-specific and previously merged ones
Test Plan:
unit tests, ran db_bench and verified histograms look similar
before:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
...
+ 7.63% db_bench db_bench [.] rocksdb::HistogramStat::Add
after:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
...
+ 0.98% db_bench db_bench [.] rocksdb::HistogramStat::Add
Reviewers: sdong, MarkCallaghan, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62649
Summary:
Fix ClockCache memory leak found by valgrind:
# Add destructor to cleanup cached values.
# Delete key with cache handle immediately after handle is recycled, and erase table entry immediately if duplicated cache entry is inserted.
Test Plan:
make DISABLE_JEMALLOC=1 valgrind_check
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62973
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.
We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification
Test Plan: added new tests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: yiwu, leveldb, march, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58707
Summary:
This diff split ThreadPool to
-ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h)
-ThreadPoolImpl (actual implementation in util/threadpool_imp.h)
This allow us to expose ThreadPool to the user so we can use it as an option later
Test Plan: existing unit tests
Reviewers: andrewkr, yiwu, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62085
Summary:
The global atomics we previously used for tickers had poor cache performance
since they were typically updated from different threads, causing frequent
invalidations. In this diff,
- recordTick() updates a local ticker value specific to the thread in which it was called
- When a thread exits, its local ticker value is added into merged_sum
- getTickerCount() returns the sum of all threads' local ticker values and the merged_sum
- setTickerCount() resets all threads' local ticker values and sets merged_sum to the value provided by the caller.
In a next diff I will make a similar change for histogram stats.
Test Plan:
before:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
$ perf report -g --stdio | grep recordTick
7.59% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick
...
after:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
$ perf report -g --stdio | grep recordTick
1.46% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick
...
Reviewers: kradhakrishnan, MarkCallaghan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: yiwu, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62337
Summary:
We used to allow insert into full block cache as long as `strict_capacity_limit=false`. This diff further restrict insert to full cache if caller don't intent to hold handle to the cache entry after insert.
Hope this diff fix the assertion failure with db_stress: https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=211853102&step_id=2475070014
db_stress: util/lru_cache.cc:278: virtual void rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*): Assertion `lru_.next == &lru_' failed.
The assertion at lru_cache.cc:278 can fail when an entry is inserted into full cache and stay in LRU list.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62325
Summary:
Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted.
Depends on D61977.
Test Plan: See unit test.
Reviewers: lightmark, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D62241
Summary:
This function allows the user to provide a custom function to fold all
threads' local data. It will be used in my next diff for aggregating statistics
stored in thread-local data. Note the test case uses atomics as thread-local
values due to the synchronization requirement (documented in code).
Test Plan: unit test
Reviewers: yhchiang, sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62049
Summary:
Add mid-point insertion functionality to LRU cache. Caller of `Cache::Insert()` can set an additional parameter to make a cache entry have higher priority. The LRU cache will reserve at most `capacity * high_pri_pool_pct` bytes for high-pri cache entries. If `high_pri_pool_pct` is zero, the cache degenerates to normal LRU cache.
Context: If we are to put index and filter blocks into RocksDB block cache, index/filter block can be swap out too early. We want to add an option to RocksDB to reserve some capacity in block cache just for index/filter blocks, to mitigate the issue.
In later diffs I'll update block based table reader to use the interface to cache index/filter blocks at high priority, and expose the option to `DBOptions` and make it dynamic changeable.
Test Plan: unit test.
Reviewers: IslamAbdelRahman, sdong, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D61977
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader
Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )
Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61473
Summary:
Clock-based cache implemenetation aim to have better concurreny than
default LRU cache. See inline comments for implementation details.
Test Plan:
Update cache_test to run on both LRUCache and ClockCache. Adding some
new tests to catch some of the bugs that I fixed while implementing the
cache.
Reviewers: kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61647
* Create rate limiter using factory function in the test.
* Convert function local statics in option helper to a C array
that does not perform dynamic memory allocation. This is helpful
when you try to memory isolate different DB instances.
Summary: ... so that I can include the header and create LRUCache specific tests for D61977
Test Plan:
make check
Reviewers: lightmark, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62145
Summary:
Added 2 statistics in compaction job statistics, to
identify if single deletes are not meeting a matching key
(fallthrough) or single deletes are meeting a merge, delete or
another single delete (i.e. not the expected case of put).
Test Plan: Tested the statistics using write_stress and compaction_job_stats_test
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61749
Summary:
We were frequently seeing a race between SyncPoint::Process() and
SyncPoint::~SyncPoint() (e.g.,
https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=207289975&step_id=2412725431).
The issue was marked_thread_id_ gets deleted when the main thread is exiting and
simultaneously background threads may access it. We can prevent this race
condition by checking whether sync points are disabled (assuming the test terminates
with them disabled) before attempting to access that member. I do not understand
why accesses to other members (mutex_ and enabled_) are ok but anyways the
test no longer fails tsan.
Test Plan: ran tests
Reviewers: sdong, yhchiang, IslamAbdelRahman, yiwu, wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62133
Summary:
Env holds a pointer of ThreadStatusUpdater, which will be deleted when
Env is deleted. However, in case a rocksdb database is deleted after
Env is deleted. Then this will introduce a free-after-use of this
ThreadStatusUpdater.
This patch fix this by never deleting the ThreadStatusUpdater in Env,
which is in general safe as Env is a singleton in most cases.
Test Plan: thread_list_test
Reviewers: andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59187
Summary: preparation for detecting Cache type. If SimCache, we then may trigger some command like "setSimCapacity()" with setOptions()
Test Plan: make all check
Reviewers: yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61953
Summary:
This diff improves the documentation for GetOptionsFromMap APIs and
fixes a bug in GetOptionsFromMap functions in convenience.h
where new_options will still be changed when the function
call is not successful.
Test Plan: options_test
Reviewers: IslamAbdelRahman, kradhakrishnan, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61731
From the Linux manual:
MAP_ANONYMOUS
The mapping is not backed by any file; its contents
are initialized to zero. The fd and offset arguments are
ignored; however, some implementations require fd to be -1
if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable
applications should ensure this.
FreeBSD is such a case, it wil just return an error.
Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows).
Test Plan: Add `date_tiered_test`.
Reviewers: dhruba, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61653
Summary:
patch for diff https://reviews.facebook.net/D58587
Also change StopWatch class to add a fifth param named overwrite which decides whether to overwrite *elapse or add on it.
Test Plan: make all check -j64
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61239
Summary:
The patch is a continuation of part 5. It glues the abstraction for
file layout and metadata, and flush out the implementation of the API. It
adds unit tests for the implementation.
Test Plan: Run unit tests
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57549
Summary:
The call stack used to look like this during static initialization:
#0 0x00000000008032d1 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff683b300) at util/thread_local.cc:172
#1 0x00000000008030a7 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:135
#2 0x000000000080310f in rocksdb::ThreadLocalPtr::StaticMeta::Mutex() () at util/thread_local.cc:141
#3 0x0000000000803103 in rocksdb::ThreadLocalPtr::StaticMeta::InitSingletons() () at util/thread_local.cc:139
#4 0x000000000080305d in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:106
It involves outer/inner classes and the call stacks goes
outer->inner->outer->inner, which is too difficult to understand. We can avoid
a level of back-and-forth by skipping StaticMeta::InitSingletons(), which
doesn't initialize anything beyond what ThreadLocalPtr::Instance() already
initializes.
Now the call stack looks like this during static initialization:
#0 0x00000000008032c5 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff683b300) at util/thread_local.cc:170
#1 0x00000000008030a7 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:135
#2 0x000000000080305d in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:106
Test Plan:
unit tests
verify StaticMeta::mutex_ is still initialized in DefaultEnv() (StaticMeta::mutex_ is the only variable intended to be initialized via StaticMeta::InitSingletons() which I removed)
#0 0x00000000005cee17 in rocksdb::port::Mutex::Mutex(bool) (this=0x7ffff69500b0, adaptive=false) at port/port_posix.cc:52
#1 0x0000000000769cf8 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff6950000) at util/thread_local.cc:168
#2 0x0000000000769a53 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:133
#3 0x0000000000769a09 in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:105
#4 0x0000000000647d98 in rocksdb::Env::Default() () at util/env_posix.cc:845
Reviewers: lightmark, yhchiang, sdong
Reviewed By: sdong
Subscribers: arahut, IslamAbdelRahman, yiwu, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60813
Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.
Test Plan: Run all existing tests.
Reviewers: IslamAbdelRahman, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60513
Summary:
TickersNameMap is not consistent with Tickers enum.
this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics)
Test Plan: added new unit test
Reviewers: sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61083
Summary:
Persistent cache tier is the tier abstraction that can work for any block
device based device mounted on a file system. The design/implementation can
handle any generic block device.
Any generic block support is achieved by generalizing the access patten as
{io-size, q-depth, direct-io/buffered}.
We have specifically tested and adapted the IO path for NVM and SSD.
Persistent cache tier consists of there parts :
1) File layout
Provides the implementation for handling IO path for reading and writing data
(key/value pair).
2) Meta-data
Provides the implementation for handling the index for persistent read cache.
3) Implementation
It binds (1) and (2) and flushed out the PersistentCacheTier interface
This patch provides implementation for (1)(2). Follow up patch will provide (3)
and tests.
Test Plan: Compile and run check
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57117
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).
* Prevent incrementing compression counter when compression is turned off in options.
* Prevent incrementing compression counter when compression is turned off in options.
* Added two more supported compression types to test code in db_test.cc
* Prevent incrementing compression counter when compression is turned off in options.
* Added new StatsLevel that excludes compression timing.
* Fixed casting error in coding.h
* Fixed CompressionStatsTest for new StatsLevel.
* Removed unused variable that was breaking the Linux build
Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc.
Test Plan:
make check -j64
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59655
Summary: In Zlib_Compress and BZip2_Compress the calculation for size was slightly off when using compression_foramt_version 2 (which includes the decompressed size in the output). Also there were unnecessary loops around the deflate/BZ2_bzCompress calls. In Zlib_Compress there was also a possible exit from the function after calling deflateInit2 that didn't call deflateEnd.
Test Plan: Standard tests
Reviewers: sdong, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: sdong, IslamAbdelRahman, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60537
Summary:
I was investigating performance issues in the SstFileWriter and found all of the following:
- The SstFileWriter::Add() function created a local InternalKey every time it was called generating a allocation and free each time. Changed to have an InternalKey member variable that can be reset with the new InternalKey::Set() function.
- In SstFileWriter::Add() the smallest_key and largest_key values were assigned the result of a ToString() call, but it is simpler to just assign them directly from the user's key.
- The Slice class had no move constructor so each time one was returned from a function a new one had to be allocated, the old data copied to the new, and the old one was freed. I added the move constructor which also required a copy constructor and assignment operator.
- The BlockBuilder::CurrentSizeEstimate() function calculates the current estimate size, but was being called 2 or 3 times for each key added. I changed the class to maintain a running estimate (equal to the original calculation) so that the function can return an already calculated value.
- The code in BlockBuilder::Add() that calculated the shared bytes between the last key and the new key duplicated what Slice::difference_offset does, so I replaced it with the standard function.
- BlockBuilder::Add() had code to copy just the changed portion into the last key value (and asserted that it now matched the new key). It is more efficient just to copy the whole new key over.
- Moved this same code up into the 'if (use_delta_encoding_)' since the last key value is only needed when delta encoding is on.
- FlushBlockBySizePolicy::BlockAlmostFull calculated a standard deviation value each time it was called, but this information would only change if block_size of block_size_deviation changed, so I created a member variable to hold the value to avoid the calculation each time.
- Each PutVarint??() function has a buffer and calls std::string::append(). Two or three calls in a row could share a buffer and a single call to std::string::append().
Some of these will be helpful outside of the SstFileWriter. I'm not 100% the addition of the move constructor is appropriate as I wonder why this wasn't done before - maybe because of compiler compatibility? I tried it on gcc 4.8 and 4.9.
Test Plan: The changes should not affect the results so the existing tests should all still work and no new tests were added. The value of the changes was seen by manually testing the SstFileWriter class through MyRocks and adding timing code to identify problem areas.
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59607
Summary: this test support CLI option to select HdfsEnv/NativeHdfsEnv now. The latter one is default. add test about when Rename(src, target) should overwrite target
Test Plan: existing test
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60399
Summary:
Use @omegaga's awesome feature to avoid use of callbacks for ensuring
SyncPoints happen in a particular thread.
Depends on D60375.
Test Plan:
$ ./auto_roll_logger_test
Reviewers: omegaga, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, omegaga, leveldb
Differential Revision: https://reviews.facebook.net/D60471
Summary: Add markers to sync points. A marked sync point will only be active when it is on the same thread as the marker sync point.
Test Plan: Write a unit test to validate.
Reviewers: sdong, IslamAbdelRahman, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60375
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.
Test Plan: Add a new unit test.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59925
Summary: LockFile is unnecessary in unit test
Test Plan: env_basic_test.cc
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60285
Summary:
Previously we couldn't run env_basic_test on Env::Default (PosixEnv on
our platforms) since GetChildren*() behavior was inconsistent with our other
Envs. We can normalize the output of GetChildren*() such that these test cases
work on PosixEnv too.
Test Plan: ran env_basic_test
Reviewers: wanning
Reviewed By: wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59943
Summary:
move cleanup to TearDown and handle directories, so cleanup will happen
even if a test fails in the middle.
Test Plan: ./env_basic_test
Reviewers: wanning
Reviewed By: wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60243
Summary: ensure no 2nd level children under test_dir_
Test Plan: env_basic_test on 4 envs
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59979
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.
Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59499
Summary: filter_deltes is not a frequently used feature. Remove it.
Test Plan: Run all test suites.
Reviewers: igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59427
Summary: We introduced default slow down and stop condition, but didn't reset it in bulk load mode. Fix it.
Test Plan: N/A
Reviewers: igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59757
Summary: Fixed a crash bug that incorrectly parse deprecated options in options_helper
Test Plan:
run db_bench with an old options file with memtable_prefix_bloom_probes
./db_bench --options_file=AN_OLD_OPTIONS_FILE --num=100 --benchmarks=fillseq
Reviewers: sdong, IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59787
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.
Test Plan: See unit test.
Reviewers: dhruba, horuff, sdong
Reviewed By: sdong
Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57813
Summary:
Try to decompress compressed blocks when a special flag is set.
assert and crash in debug builds if we can't decompress the just-compressed input.
Test Plan: Run unit-tests.
Reviewers: dhruba, andrewkr, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59145
Summary: It confuses some compilers to have slice.cc under multiple directories. Merge them.
Test Plan: Run existing tests
Reviewers: andrewkr, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59409
Summary: delete_scheduler_test and db_sst_test share a same directory name, causing possible fails on both tests when running in parallel. Fixed by changing directory name.
Test Plan: Run the two tests in parallel: `parallel -u ./{} ::: delete_scheduler_test db_sst_test`
Reviewers: sdong, andrewkr
Reviewed By: sdong, andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59529
Summary:
memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio
Test Plan: Run all existing tests
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59199
Summary: AFIK, options builder is not used by anyone. Remove it.
Test Plan: Run all existing tests.
Reviewers: IslamAbdelRahman, andrewkr, igor
Reviewed By: igor
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59319
GCC 5.4 will complain (see also options_parser.cc):
/home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc: In function 'rocksdb::CompactionStyle rocksdb::{anonymous}::PickCompactionStyle(size_t, int, int, uint64_t)':
/home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:29:7: error: 'log' is not a member of 'std'
std::log(target_db_size / write_buffer_size) / std::log(kBytesForLevelMultiplier)));
^
/home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:29:7: note: suggested alternative:
In file included from /usr/include/features.h:365:0,
from /usr/include/math.h:26,
from /home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:6:
/usr/include/bits/mathcalls.h:109:1: note: 'log'
__MATHCALL_VEC (log,, (_Mdouble_ __x));
Summary:
- Provide env_test as a static library. We will build it for future releases so internal Envs can use env_test by linking against this library.
- Add tests for CustomEnv, which is configurable via ENV_TEST_URI environment variable. It uses the URI-based Env lookup (depends on D58449).
- Refactor env_basic_test cases to use a unique/configurable directory for test files.
Test Plan:
built a test binary against librocksdb_env_test.a. It registered the
default Env with URI prefix "a://".
- verify runs all CustomEnv tests when URI with correct prefix is provided
```
$ ENV_TEST_URI="a://ok" ./tmp --gtest_filter="CustomEnv/*"
...
[ PASSED ] 12 tests.
```
- verify runs no CustomEnv tests when URI with non-matching prefix is provided
```
$ ENV_TEST_URI="b://ok" ./tmp --gtest_filter="CustomEnv/*"
...
[ PASSED ] 0 tests.
```
Reviewers: ldemailly, IslamAbdelRahman, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58485
Summary:
Extracted basic Env-related tests from mock_env_test and memenv_test into a
parameterized test for Envs: env_basic_test.
Depends on D58449. (The dependency is here only so I can keep this series of
diffs in a chain -- there is no dependency on that diff's code.)
Test Plan: ran tests
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58635
Summary:
Direct IO checkin breaks Windows build. Fixing the code to work for
Windows.
Test Plan: Run env_test in Windows 10 and make check in Linux
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59073
Summary: When rate_bytes_per_sec * refill_period_us_ overflows, the actual limited rate is very low. Handle this case so the rate will be large.
Test Plan: Add a unit test for it.
Reviewers: IslamAbdelRahman, andrewkr
Reviewed By: andrewkr
Subscribers: yiwu, lightmark, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58929
Summary:
Made it consistent with posix Env, which uses pread() that returns 0
(success) when an offset is given beyond EOF. The purpose of making these Envs
behave consistently is I am repurposing the in-memory Envs' tests for the basic
Env tests in D58635.
Test Plan: ran mock_env_test and memenv_test
Reviewers: IslamAbdelRahman, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58845
Summary: util/threadpool.cc's function name is the same as a well-known class name. It breaks unity build. Rename it.
Test Plan: Run all existing test.
Reviewers: yiwu, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58881
Summary:
O_DIRECT is not available in Mac as a flag for open. The fix is to make
use of fctl after the file is opened
Test Plan: Run the tests on mac and Linux
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58665
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options
Test Plan: UT
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, lgalanis, dhruba
Differential Revision: https://reviews.facebook.net/D58653
Summary:
This patch adds direct IO capability to RocksDB Env.
The direct IO capability is required for persistent cache since NVM is best
accessed as 4K direct IO. SSDs can leverage direct IO for reading.
Direct IO requires the offset and size be sector size aligned, and memory to
be kernel page aligned. Since neither RocksDB/Persistent read cache data
layout is aligned to sector size, the code can accommodate reading unaligned IO size
(or unaligned memory) at the cost of an alloc/copy.
The write code path expects the size and memory to be aligned.
Test Plan: Run RocksDB unit tests
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57393
Summary:
Persistent read cache relies on the accuracy of the GetUniqueIdFromFile
to generate a unique key for a given block of data. Currently we don't have an
implementation for Mac.
This patch adds an implementation.
Test Plan: Run unit tests
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58413
Summary:
Expose a simple function to convert CompressionType to it's corresponding option string
This is for a diff @yoshinorim is working on for MyRocks
Test Plan: unittest
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D58215
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.
RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode.
Blame Rev:
Test Plan: Run unit tests
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55707
Conditionally retrofit thread_posix for use with std::thread
and reuse the same logic. Posix users continue using Posix interfaces.
Enable XPRESS compression in test runs.
Fix master introduced signed/unsigned mismatch.
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]
The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.
Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.
We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))
So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?
Test Plan: provided test.
Reviewers: sdong
Subscribers: andrewkr, leveldb, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D57645
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.
modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
- log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.
- Recovery/writepath is now aware of prepared batches and commit and rollback markers.
Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.
Reviewers: IslamAbdelRahman, sdong
Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D56919
Summary:
On Mac OS X, the chroot directory we typically use ("/tmp") is actually
a symlink for "/private/tmp". Since we dereference symlinks in user-defined
paths, we must also dereference symlinks in chroot_dir_ such that we can perform
string comparisons on those paths.
Test Plan: ran env_test on Mac OS X and devserver
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57957
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.
I have also updated CompactionJobInfo to include the compression algorithm used in compaction
Test Plan:
added new unittest
existing unittests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: lightmark, andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D57669
Summary:
For testing backups, we needed an Env that is fully isolated from other
Envs on the same machine. Our in-memory Envs (MockEnv and InMemoryEnv) were
insufficient because they don't implement most directory operations.
This diff introduces a new Env, "ChrootEnv", that translates paths such that the
chroot directory appears to be the root directory. This way, multiple Envs can
be isolated in the filesystem by using different chroot directories. Since we
use the filesystem, all directory operations are trivially supported.
Test Plan:
I parameterized the existing EnvPosixTest so it runs tests on ChrootEnv
except the ioctl-related cases.
Reviewers: sdong, lightmark, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57543
Summary:
This is needed so that rocksdb users can add more
commands to the included ldb tool by adding more custom
commands.
Test Plan: make -j ldb
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57243
Summary:
For backwards compatibility with older option strings, the parser needs
to treat this argument as optional.
Test Plan:
Updated unit test to cover case where compression_opts is present but
max_dict_bytes is omitted.
Reviewers: MarkCallaghan, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57759
Summary: We changed default options of max_open_files and max_file_opening_threads but didn't revert it in OptimizeForSmallDb().
Test Plan: Add a unit test
Reviewers: igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57675
Summary:
Add an option `iterator_readahead_size` to `ReadOptions` to enable
configurable readahead for iterators similar to the corresponding
option for compaction.
Test Plan:
```
make commit_prereq
```
Reviewers: kumar.rangarajan, ott, igor, sdong
Reviewed By: sdong
Subscribers: yiwu, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55419
Summary:
Changing several option defaults:
options.max_open_files changes from 5000 to -1
options.base_background_compactions changes from max_background_compactions to 1
options.wal_recovery_mode changes from kTolerateCorruptedTailRecords to kTolerateCorruptedTailRecords
options.compaction_pri changes from kByCompensatedSize to kByCompensatedSize
Test Plan: Write unit tests to see OldDefaults() works as expected.
Reviewers: IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: MarkCallaghan, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56427
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
Summary:
Introduced option to dump malloc statistics using new option flag.
Added new command line option to db_bench tool to enable this
funtionality.
Also extended build to support environments with/without jemalloc.
Test Plan:
1) Build rocksdb using `make` command. Launch the following command
`./db_bench --benchmarks=fillrandom --dump_malloc_stats=true
--num=10000000` end verified that jemalloc dump is present in LOG file.
2) Build rocksdb using `DISABLE_JEMALLOC=1 make db_bench -j32` and ran
the same db_bench tool and found the following message in LOG file:
"Please compile with jemalloc to enable malloc dump".
3) Also built rocksdb using `make` command on MacOS to verify behavior
in non-FB environment.
Also to debug build configuration change temporary changed
AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build
tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler
command line. For both 2) and 3) this flag was not present.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57321
Summary:
The current implementation find the first different byte and try to increment it, if it cannot it return the original key
we can improve this by keep going after the first different byte to find the first non 0xFF byte and increment it
After trying this patch on some logdevice sst files I see decrease in there index block size by 8.5%
Test Plan: existing tests and updated test
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56241
* Musl libc does not provide adaptive mutex. Added feature test for PTHREAD_MUTEX_ADAPTIVE_NP.
* Musl libc does not provide backtrace(3). Added a feature check for backtrace(3).
* Fixed compiler error.
* Musl libc does not implement backtrace(3). Added platform check for libexecinfo.
* Alpine does not appear to support gcc -pg option. By default (gcc has PIE option enabled) it fails with:
gcc: error: -pie and -pg|p|profile are incompatible when linking
When -fno-PIE and -nopie are used it fails with:
/usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find gcrt1.o: No such file or directory
Added gcc -pg platform test and output PROFILING_FLAGS accordingly. Replaced pg var in Makefile with PROFILING_FLAGS.
* fix segfault when TEST_IOCTL_FRIENDLY_TMPDIR is undefined and default candidates are not suitable
* use ASSERT_DOUBLE_EQ instead of ASSERT_EQ
* When compiled with ROCKSDB_MALLOC_USABLE_SIZE UniversalCompactionFourPaths and UniversalCompactionSecondPathRatio tests fail due to premature memtable flushes on systems with 16-byte alignment. Arena runs out of block space before GenerateNewFile() completes.
Increased options.write_buffer_size.
Comparable with Snappy on comp ratio.
Implemented using Windows API, does not require external package.
Avaiable since Windows 8 and server 2012.
Use -DXPRESS=1 with CMake to enable.
Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it.
Test Plan: Try db_bench and see the stats are printed out.
Reviewers: yhchiang
Reviewed By: yhchiang
Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56769
Summary: If the user specified a small enough value for the rate limiter's bytes per second, the calculation for the number of refill bytes per period could become zero which would effectively cause the server to hang forever.
Test Plan: Existing tests
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56631
Summary: In option settable tests, bytes for pointers are not all skipped, so that they may be the same as the special character and cause false positive.
Test Plan: Run the test. Manually verify the issue is not there any more.
Reviewers: IslamAbdelRahman, andrewkr
Reviewed By: IslamAbdelRahman
Subscribers: kradhakrishnan, yiwu, yhchiang, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56553
Summary: Test DBOptionsAllFieldsSettable sometimes fails under valgrind. Move option settable tests to a separate test file and disable it in valgrind..
Test Plan: Run valgrind test and make sure the test doesn't run.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: kradhakrishnan, yiwu, yhchiang, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56529
Summary:
- Need to use unsigned long long for 64-bit literals on windows
- Need size_t for backup meta-file length since clang doesn't let us assign size_t to int
Test Plan: backupable_db_test and options_test
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56391
Summary: Cache shard bit 4 is sometimes too small and 6 is a more common value picked by users. Make that default. It shouldn't hurt much to change options.max_file_opening_threads default to be 16, which will reduce the worst case DB open time.
Test Plan: Run all existing tests.
Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, igor, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D55047
Summary:
Adapted a stderr logger from the option tests. Moved it to a separate
header so we can reuse it, e.g., from ldb subcommands for faster debugging. This
is especially useful to make errors/warnings more visible when running
"ldb repair", which involves potential data loss.
Test Plan:
ran options_test and "ldb repair"
$ ./ldb repair --db=./tmp/
[WARN] **** Repaired rocksdb ./tmp/; recovered 1 files; 588bytes. Some data may have been lost. ****
OK
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56151
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
Summary: Change some RocksDB default options to make it more friendly to server workloads.
Test Plan: Run all existing tests
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: sumeet, muthu, benj, MarkCallaghan, igor, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55941
Summary:
Expose the inverse of ToString(hex=true) on Slice: Slice::DecodeHex
Refactor the other implementation of to/from hex in ldb_cmd.h to use the Slice
version
(Difference between the 2 is whether 0x is expected/produced in front of the hex
string or not)
Eliminated support for invalid odd length hex string - this is now invalid
instead of having 1/2 byte set
Added (inverse of HexToString) test for LDBCommand::StringToHex which also
indirectly tests Slice::ToString(true)
After moving the original implementation from ldb_cmd.h, updated it to much simpler/efficient version
(originally/inspired from https://github.com/facebook/wdt/blob/master/util/EncryptionUtils.cpp#L140-L169 )
Test Plan: run tests
Reviewers: uddipta, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56121
Summary: Something changed and the special charactor seems to be conflict with an exisitng value. Change it to unblock the build.
Test Plan: Run the test and make sure it passes
Reviewers: kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55845
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction.
Test Plan:
Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false).
DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32
Mac: OK.
Linux: with D55287 patched in it's OK.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54801
Summary:
This fixes a similar issue as D54711: "CURRENT" file can mutate between
GetLiveFiles() and copy to the tmp directory, in which case it would reference
the wrong manifest filename. To fix this, I forge the "CURRENT" file such that
it simply contains the filename for the manifest returned by GetLiveFiles().
- Changed CreateCheckpoint() to forge current file
- Added CreateFile() utility function
- Added test case that rolls manifest during checkpoint creation
Test Plan:
$ ./checkpoint_test
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55065
Summary: Refactored db_bench transaction stress tests so that they can be called from unit tests as well.
Test Plan: run new unit test as well as db_bench
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55203
Summary:
Cache to have an option to fail Cache::Insert() when full. Update call sites to check status and handle error.
I totally have no idea what's correct behavior of all the call sites when they encounter error. Please let me know if you see something wrong or more unit test is needed.
Test Plan: make check -j32, see tests pass.
Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54705
Summary:
Use pure if-then check instead of assert in EraseColumnFamilyInfo
when the specified column family does not found in the cf_info_map_.
So the second deletion will be no op instead of crash.
Test Plan: existing test.
Reviewers: sdong, anthony, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55023
Summary: There are a few options in struct DBOptions that aren't handled by options_helper.cc. Add those missing options so they can be used by GetDBOptionsFromString() and friends.
Test Plan: Updated options_test.cc, reran all tests.
Reviewers: sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54603
Summary: similar to D52809 add option to exclude zero counters.
Test Plan:
[yiwu@dev4504.prn1 ~/rocksdb] ./iostats_context_test
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from IOStatsContextTest
[ RUN ] IOStatsContextTest.ToString
[ OK ] IOStatsContextTest.ToString (0 ms)
[----------] 1 test from IOStatsContextTest (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[ PASSED ] 1 test.
Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54591
Summary:
There was a race condition in the test where the rolling thread
acquired the mutex before the flush thread pinned the logger. Rather than add
more complicated synchronization to fix it, I followed Siying's suggestion to
use SyncPoint in the test code.
Comments in the LoadDependency() invocation explain the reason for each of the
sync points.
Test Plan:
Ran test 1000 times for tsan/asan. Will wait for all sandcastle tests
to finish before committing since this is a tricky test.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54615
Summary:
Relax the check condition of prefix_extractor in CheckOptionsCompatibility
by allowing changing value from non-nullptr to nullptr or nullptr to
non-nullptr.
Test Plan:
options_test
options_util_test
Reviewers: sdong, anthony, IslamAbdelRahman, kradhakrishnan, gunnarku
Reviewed By: gunnarku
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54477
Summary: Recent change break thread_local_test by introducing exception, which is disabled in LITE build. Fix it by disabling exception handling in LITE build.
Test Plan: Build with both of LITE and non-LITE
Reviewers: anthony, IslamAbdelRahman, yhchiang, kradhakrishnan, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54513
Summary: I have introduced max_allowed_space_ but did not initialize it
Test Plan: make check
Reviewers: sdong, yhchiang, anthony
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D54357
Summary:
Remove the SyncPoint usage in the destructor of PosixEnv as none
of any active tests is using it.
SyncPoint is a test-only utility class, and it's a static varible.
As a result, using SyncPoint in the destructor of PosixEnv will
make default Env depends on SyncPoint. Removing such dependency
could solve the problem crash issue only reproducable in Mac
environment.
Test Plan: OPT=-DTRAVIS V=1 make -j4 check on Mac environment
Reviewers: sdong, anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54333
Summary:
Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB.
When this limit is exceeded WriteImpl() will fail and return Status::Aborted()
Test Plan: unit testing
Reviewers: yhchiang, anthony, andrewkr, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D53763
Summary:
For GetLogFileSize() and Flush(), they previously did not follow the
synchronization pattern for accessing logger_. This meant ResetLogger() could
cause logger_ destruction while the unsynchronized functions were accessing it,
causing a segfault.
Also made the mutex instance variable mutable so we can preserve
GetLogFileSize()'s const-ness.
Test Plan:
new test case, it's quite ugly because both threads need to access
one of the functions with SyncPoints (PosixLogger::Flush()), and also special
handling is needed to prevent the mutex and sync points from conflicting.
Reviewers: kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54237
Summary:
Previously compilation failed when ROCKSDB_NO_FBCODE=1 because fcntl.h
wasn't included for open().
Related issue: https://github.com/facebook/rocksdb/issues/977
Test Plan:
verified below command works now:
$ make clean && ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_FALLOCATE=1 make -j32 env_test
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54135
Summary:
When a child thread that uses ThreadLocalPtr, ThreadLocalPtr::OnThreadExit
will be called when that child thread is destroyed. However,
OnThreadExit will try to access a static singleton of ThreadLocalPtr,
which will be destroyed when the main thread exit. As a result,
when a child thread that uses ThreadLocalPtr exits AFTER the main thread
exits, illegal memory access will occur.
This diff includes a test that reproduce this legacy bug.
==2095206==ERROR: AddressSanitizer: heap-use-after-free on address
0x608000007fa0 at pc 0x959b79 bp 0x7f5fa7426b60 sp 0x7f5fa7426b58
READ of size 8 at 0x608000007fa0 thread T1
This patch fix this issue by having the thread local mutex never be deleted
(but will leak small piece of memory at the end.) The patch also describe
a better solution (thread_local) in the comment that requires gcc 4.8.1 and
in latest clang as a future work once we agree to move toward gcc 4.8.
Test Plan:
COMPILE_WITH_ASAN=1 make thread_local_test -j32
./thread_local_test --gtest_filter="*MainThreadDiesFirst"
Reviewers: anthony, hermanlee4, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53013
Summary:
Added this new function, which returns filename, size, and modified
timestamp for each file in the provided directory. The default implementation
retrieves the metadata sequentially using existing functions. In the next diff
I'll make HdfsEnv override this function to use libhdfs's bulk get function.
This won't work on windows due to the path separator.
Test Plan:
new unit test
$ ./env_test --gtest_filter=EnvPosixTest.ConsistentChildrenMetadata
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: IslamAbdelRahman, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53781
Summary: Add a new option to BlockBasedTableOptions that will allow us to change the restart interval for the index block
Test Plan: unit tests
Reviewers: yhchiang, anthony, andrewkr, sdong
Reviewed By: sdong
Subscribers: march, dhruba
Differential Revision: https://reviews.facebook.net/D53721
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.
The watermarks are calculated based on slowdown thresholds.
Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.
Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D53409
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size
Test Plan: unit tests
Reviewers: rven, anthony, yhchiang, sdong
Reviewed By: sdong
Subscribers: igor, lovro, march, dhruba
Differential Revision: https://reviews.facebook.net/D50469
Summary: Measuring mutex duration will measure time inside DB mutex, which breaks our best practice. Add a stat level in Statistics class. By default, disable to measure the mutex operations.
Test Plan: Add a unit test to make sure it is off by default.
Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, andrewkr, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D53367
Summary:
We can avoid the dependency by forward-declaring ColumnFamilyData and
then treating it as a black box. That means callers of ThreadStatusUtil need to
explicitly provide more options, even if they can be derived from the
ColumnFamilyData, since ThreadStatusUtil doesn't include the definition.
This is part of a series of diffs to eliminate circular dependencies between
directories (e.g., db/* files depending on util/* files and vice-versa).
Test Plan:
$ ./db_test --gtest_filter=DBTest.GetThreadStatus
$ make -j32 commit-prereq
Reviewers: sdong, yhchiang, IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53361
Summary:
I split the db-specific test points out into a separate file under db/
directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it
from compiling previously; see https://reviews.facebook.net/D36825.
Test Plan:
compilation works now, below command works, will also run "make xfunc".
$ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32
Reviewers: sdong, yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D53343
Summary: Timing mutex operations can impact scalability of the system. Add a new perf context level that can measure time counters except for mutex.
Test Plan: Add a new unit test case to make sure it is not set.
Reviewers: IslamAbdelRahman, rven, kradhakrishnan, yhchiang, anthony
Reviewed By: anthony
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D53199
Summary: Add a test to fail if someone adds a DB options.
Test Plan: Run the test, run the test with valgrind. Add an option to DB option in the middle or in the end and make sure it fails.
Reviewers: yhchiang, anthony, IslamAbdelRahman, kradhakrishnan, rven, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D53097
Summary: OptionsParserTest.BlockBasedTableOptionsAllFieldsSettable is failiong under CLANG. Disable the test to unblock the build.
Test Plan: Run it both of CLANG and GCC
Reviewers: kradhakrishnan, rven, andrewkr, anthony, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D53157
Summary: Add a test OptionsParserTest.BlockBasedTableOptionsAdded, which will fail if a new option is added to BlockBasedTableOptions but is not settable through GetBlockBasedTableOptionsFromString().
Test Plan: Run the test. Also manually remove and add options and make sure it fails.
Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, rven, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52953
Summary:
In crash test, when coming to each kill point, we start a random class using seed as current second. With this approach, for every second, the random number used is the same. However, in each second, there are multiple kill points with different frequency. It makes it hard to reason about chance of kill point to trigger. With this commit, we use thread local random seed to generate the random number, so that it will take different values per second, hoping it makes chances of killing much easier to reason about.
Also significantly reduce the kill odd to make sure time before kiling is similar as before.
Test Plan: Run white box crash test and see the killing happens as expected and the run time time before killing reasonable.
Reviewers: kradhakrishnan, IslamAbdelRahman, rven, yhchiang, andrewkr, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52971
Makefile adjust paths for solaris build
Makefile enable _GLIBCXX_USE_C99 so that std::to_string is available
db_compaction_test.cc Initialise a variable to avoid a compilation error
db_impl.cc Include <alloca.h>
db_test.cc Include <alloca.h>
Environment.java recognise solaris envrionment
options_bulder.cc Make log unambiguous
geodb_impl.cc Make log and floor unambiguous
Summary: Depending on the order of include paths and versions of various headers we may end up in a situation where we'll encounter a build break caused by redefinition of constants. gcc-4.9-glibc-2.20 header update to include/bits/fcntl-linux.h introduced the definitions of FALLOC_FL_* constants. However, linux/falloc.h from kernel-headers also has FALLOC_FL_* constants defined. Therefore during the compilation we'll get "previously defined" errors.
Test Plan:
Both in the environment where the build break manifests (to make sure that the change fixed the problem) and in the environment where everything builds fine (to make sure that there are no regressions):
make clean
make -j 32
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52821
Summary: In plain table reader's non-mmap mode, we only keep the most recent read buffer. However, for binary search, it is likely we come back to a location to read. To avoid one pread in such a case, we keep two read buffers. It should cover most of the cases.
Test Plan:
1. run tests
2. check the optimization works through strace when running
./table_reader_bench -mmap_read=false --num_keys2=1 -num_keys1=5000 -table_factory=plain_table --iterator --through_db
Reviewers: anthony, rven, kradhakrishnan, igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51171
Summary: TSAN fails on DynamicBloomTest.concurrent_with_perf. This change fixes it. Not sure why though.
Test Plan: Run the test with TSAN and make sure no warning shown.
Reviewers: yhchiang, IslamAbdelRahman, anthony, ngbronson, rven
Reviewed By: rven
Subscribers: rven, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52383
Summary: Adding "__attribute__((__unused__))" after padding fields will pass CLANG build but will fail gcc 4.8.1. Fix it by not generating it under GCC 4.8.1.
Test Plan: Build under four combinations of USE_CLANG=0,1 and ROCKSDB_FBCODE_BUILD_WITH_481=0.1.
Reviewers: yhchiang, rven, ngbronson, anthony, IslamAbdelRahman
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52371
Summary: Fix some CLANG errors introduced in 7d87f02799
Test Plan: Build with both of CLANG and gcc
Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52329
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations. Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention. Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.
Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off). This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex. If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided. This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).
Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield). Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.
Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.
This diff was motivated and inspired by Yahoo's cLSM work. It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.
My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1
thread I get ~440Kops/sec. Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.
Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
Differential Revision: https://reviews.facebook.net/D50589
Summary: We now have a mechanism to further slowdown writes. Double default options.delayed_write_rate to try to keep the default behavior closer to it used to be.
Test Plan: Run all tests.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yhchiang, kradhakrishnan, rven, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52281
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
Summary: Now ZSTD hard code level 1. Change it to use the compression level setting.
Test Plan: Run it with hacked codes of sst_dump and show ZSTD compression sizes with different levels.
Reviewers: rven, anthony, yhchiang, kradhakrishnan, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52041
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted
ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted
Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.
Benchmark results (using https://phabricator.fb.com/P20083553)
```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G /home/tec/local/normal.4K.Snappy/db10077
// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G /home/tec/local/zero.8K.LZ4/db10077
// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
// --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
// --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"
// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 1.73s 576.97m
// BM_StringPiece 103.74% 1.67s 598.55m
// ============================================================================
// Match rate : 1000000 / 1000000
// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 611.99ms 1.63
// BM_StringPiece 203.76% 300.35ms 3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```
Test Plan: Unit tests
Reviewers: sdong, igor, anthony, yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, lovro, adsharma
Differential Revision: https://reviews.facebook.net/D48999
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
Summary:
By default, RocksDB initializes the singletons of ThreadLocalPtr first, then initializes PosixEnv
via static initializer. Destructor terminates objects in reverse order, so terminating PosixEnv
(calling pthread_mutex_lock), then ThreadLocal (calling pthread_mutex_destroy).
However, in certain case, application might initialize PosixEnv first, then ThreadLocalPtr.
This will cause core dump at the end of the program (eg. https://github.com/facebook/mysql-5.6/issues/122)
This patch fix this issue by ensuring the destruction order by moving the global static singletons
to function static singletons. Since function static singletons are initialized when the function is first
called, this property allows us invoke to enforce the construction of the static PosixEnv and the
singletons of ThreadLocalPtr by calling the function where the ThreadLocalPtr singletons belongs
right before we initialize the static PosixEnv.
Test Plan: Verified in the MyRocks.
Reviewers: yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony, sdong, MarkCallaghan
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51789
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.
Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.
Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51117
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.
Test Plan: Add a unit test and run it in valgrind too.
Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51459
Summary:
This patch fix a race condition in persisting options which will cause a crash when:
* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.
Test Plan: Add a test in column_family_test that can reproduce the crash
Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51717
Summary: Change to call the new compression function.
Test Plan: build and run db_bench with the compression to make sure it compresses.
Reviewers: anthony, rven, kradhakrishnan, IslamAbdelRahman, igor, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51603
Summary: DBTest.DynamicCompactionOptions ocasionally fails during valgrind run. We sent a sleeping task to block compaction thread pool but we don't wait it to run.
Test Plan: Run the test multiple times in an environment which can cause failure.
Reviewers: rven, kradhakrishnan, igor, IslamAbdelRahman, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51687
Summary:
This patch fix a race condition in persisting options which will cause a crash when:
* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.
Test Plan: Add a test in column_family_test that can reproduce the crash
Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51609
Summary: This patch moves all posix thread logic to a separate library.
The motivation is to allow another environments to easily reuse posix
threads. HDFS wraps already posix threads; this split would simplify
this code.
Test Plan: No new functionality is added to posix Env or the threading
library, thus the current tests should suffice.
Summary: options.row_cache should already been initialized as null by default. Still try to set it following current convention, because one valgrind failure reports a failure related to it.
Test Plan: Run all unit tests
Reviewers: yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51303
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node. This allows us to remove the pointer from the node
to the key. As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.
For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList. This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index). Taking advantage of inline allocation for
those cases is left to future work.
The perf win is biggest for small values. For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec. For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51123
* conversion from 'size_t' to 'type', by add static_cast
Tested:
* by build solution on Windows, Linux locally,
* run tests
* build CI system successful
Summary:
The commit of option helper refactor broken the build:
(1) a git merge problem
(2) some uncaught compiler warning
Fix it.
Test Plan: Make sure "make all" passes
Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D50943
Summary:
This patch adds OptionsUtil::LoadOptionsFromFile() and
OptionsUtil::LoadLatestOptionsFromDB(), which allow developers
to construct DBOptions and ColumnFamilyOptions from a RocksDB
options file. Note that most pointer-typed options such as
merge_operator will not be constructed.
With this API, developers no longer need to remember all the
options in order to reopen an existing rocksdb instance like
the following:
DBOptions db_options;
std::vector<std::string> cf_names;
std::vector<ColumnFamilyOptions> cf_opts;
// Load primitive-typed options from an existing DB
OptionsUtil::LoadLatestOptionsFromDB(
dbname, &db_options, &cf_names, &cf_opts);
// Initialize necessary pointer-typed options
cf_opts[0].merge_operator.reset(new MyMergeOperator());
...
// Construct the vector of ColumnFamilyDescriptor
std::vector<ColumnFamilyDescriptor> cf_descs;
for (size_t i = 0; i < cf_opts.size(); ++i) {
cf_descs.emplace_back(cf_names[i], cf_opts[i]);
}
// Open the DB
DB* db = nullptr;
std::vector<ColumnFamilyHandle*> cf_handles;
auto s = DB::Open(db_options, dbname, cf_descs,
&handles, &db);
Test Plan:
Augment existing tests in column_family_test
options_test
db_test
Reviewers: igor, IslamAbdelRahman, sdong, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D49095
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.
In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.
// If true, then DB::Open / CreateColumnFamily / DropColumnFamily
// / SetOptions will fail if options file is not detected or properly
// persisted.
//
// DEFAULT: false
bool fail_if_missing_options_file;
Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.
Test Plan:
Add options_file_test.
options_test
column_family_test
Reviewers: igor, IslamAbdelRahman, sdong, anthony
Reviewed By: anthony
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48285
in 64-bit.
Currently, a signed off_t type is being used for the following
interfaces for both offset and the length in bytes:
* `Allocate`
* `RangeSync`
On Linux `off_t` is automatically either 32 or 64-bit depending on
the platform. On Windows it is always a 32-bit signed long which
limits file access and in particular space pre-allocation
to effectively 2 Gb.
Proposal is to replace off_t with uint64_t as a portable type
always access files with 64-bit interfaces.
May need to modify posix code but lack resources to test it.
Summary:
Scoped anonymous enums seem to be better supported than static
constexpr at the moment, so this diff replaces the latter with the former.
Also, this diff removes an incorrect inclusion of pthread.h. MSVC build
was broken starting with D50439.
Test Plan:
1. build
2. observe proper skiplist behavior by absence of pathological slowdown
3. push diff to tmp_try_windows branch to tickle AppVeyor
4. wait for contbuild before committing to master
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D50517
Summary:
Using a TLS random instance for skiplist makes it smaller
(useful for hash_skiplist_rep) and prepares skiplist for concurrent
adds. This diff also modifies the branching factor math to avoid an
unnecessary division.
This diff has the effect of changing the sequence of skip list node
height choices made by tests, so it has the potential to cause unit
test failures for tests that implicitly rely on the exact structure
of the skip list. Tests that try to exactly trigger a compaction are
likely suspects for this problem (these tests have always been brittle to
changes in the skiplist details). I've minimizes this risk by reseeding
the main thread's Random at the beginning of each test, increasing the
universal compaction size_ratio limit from 101% to 105% for some tests,
and verifying that the tests pass many times.
Test Plan: for i in `seq 0 9`; do make check; done
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D50439
Enable C4307 'operator' : integral constant overflow
Longs and ints on Windows are 32-bit hence the overflow
Enable C4309 'conversion' : truncation of constant value
Enable C4512 'class' : assignment operator could not be generated
Enable C4701 Potentially uninitialized local variable 'name' used
Summary:
MyRocks testing found an issue that while iterating over keys
that are outside the prefix, sometimes wrong results were seen for keys
outside the prefix. We now tighten the range of keys seen with a new
read option called prefix_seen_at_start. This remembers the starting
prefix and then compares it on a Next for equality of prefix. If they
are from a different prefix, it sets valid to false.
Test Plan: PrefixTest.PrefixValid
Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony
Reviewed By: anthony
Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D50211
Summary: Fix build for clang
Test Plan:
USE_CLANG=1 make all -j64
make clean
make check -j64
Reviewers: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D50217
Summary:
This patch introduces OptionsSanityCheckLevel internally to enable
sanity check rocksdb options.
Utilities API will be added in the follow-up diffs.
Test Plan: Added more tests in options_test
Reviewers: igor, IslamAbdelRahman, sdong, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D49515
C4101 'identifier' : unreferenced local variable
C4189 'identifier' : local variable is initialized but not referenced
C4100 'identifier' : unreferenced formal parameter
C4296 'operator' : expression is always false
Summary: new_cf_opt.table_factory->Name() is char*, ASSERT_EQ doesn't work with char* directly. Construct a string using it.
Test Plan: Run the test that failed.
Reviewers: igor, kradhakrishnan, rven, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49767
Summary:
CreateLoggerFromOptions have some parameters like db_log_dir and env, these parameters are redundant since they already exist in DBOptions
this patch remove the redundant parameters and expose CreateLoggerFromOptions to users
Test Plan: make check
Reviewers: igor, anthony, yhchiang, rven, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D49713
Summary: Run "make format" for some recent commits.
Test Plan: Build and run tests
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49707
Summary: include/posix/io_posix.h is not a public API. Although include/posix/ is not a public header directory, it is confusing to put non-public headers to under include/. Move it to util/ to be clearer.
Test Plan: Run all tests
Reviewers: rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49611
Summary: It looks like WritableFileWriter::Append() was returning OK() even when there is an error
Test Plan: make check
Reviewers: sdong, yhchiang, anthony, rven, kradhakrishnan, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D49569
Summary: include/posix/io_posix.h should not depend on ROCKSDB_FALLOCATE_PRESENT. Remove it.
Test Plan: Build it with both of ROCKSDB_FALLOCATE_PRESENT defined and not defined.
Reviewers: rven, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49563
introduce a new DBOption random_access_max_buffer_size to limit
the size of the random access buffer used for unbuffered access.
Implement read ahead buffering when enabled.
To that effect propagate compaction_readahead_size and the new option
to the env options to make it available for the implementation.
Add Hint() override so SetupForCompaction() call would call Hint()
readahead can now be setup from both Hint() and EnableReadAhead()
Add new option random_access_max_buffer_size support
db_bench, options_helper to make it string parsable
and the unit test.
Summary: Mac build breaks as include/posix/io_posix.h doesn't include errno. Move the exact function declaration to io_posix.cc
Test Plan: Run all test. Will run on Mac
Reviewers: rven, anthony, yhchiang, IslamAbdelRahman, igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49551
Summary: crash_test still has a very low chance to hit some crash point. Have another mode for covering them more likely.
Test Plan: Run crash_test and see db_stress is called with expected prameters.
Reviewers: kradhakrishnan, igor, anthony, rven, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49473
Summary: IO Posix depends on too many .h files. Move most of them to .cc files.
Test Plan: make all
Reviewers: anthony, rven, IslamAbdelRahman, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49479
Summary: This patch splits the posix storage backend into Env and
the actual *File implementations. The motivation is to allow other Envs
to use posix as a library. This enables a storage backend different from
posix to split its secondary storage between a normal file system
partition managed by posix, and it own media.
Test Plan: No new functionality is added to posix Env or the library,
thus the current tests should suffice.
Summary: We don't yet have a CI build for iOS, so our iOS compile gets broken sometimes. Most of the errors are from assumption that size_t is 64-bit, while it's actually 32-bit on some (all?) iOS platforms. This diff fixes the compile.
Test Plan:
TARGET_OS=IOS make static_lib
Observe there are no warnings
Reviewers: sdong, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D49029
Add an environment method to reuse an existing file. Provide a generic
implementation that does a simple rename + open (writeable), and also a
posix variant that is more careful about error handling (if we fail to
open, do not rename, etc.).
Signed-off-by: Sage Weil <sage@redhat.com>
Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB.
Test Plan: ran the GetPropertiesOfTablesInRange
Reviewers: rven, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48651
Summary:
Add kill points in:
1. after creating a file
2. before writing a manifest record
3. before syncing manifest
4. before creating a new current file
5. after creating a new current file
Test Plan: Run all current tests.
Reviewers: yhchiang, igor, anthony, IslamAbdelRahman, rven, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48855
Summary:
We are cleaning up dependencies.
This diff takes a first step at moving memtable files to their own
directory called memtable. In future diffs, we will move other memtable
files from db to memtable.
Test Plan: make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48915
Summary:
Give a name for every kill point, and allow users to disable some kill points based on prefixes. The kill points can be passed by db_stress through a command line paramter. This provides a way for users to boost the chance of triggering low frequency kill points
This allow follow up changes in crash test scripts to improve crash test coverage.
Test Plan:
Manually run db_stress with variable values of --kill_random_test and --kill_prefix_blacklist. Like this:
--kill_random_test=2 --kill_prefix_blacklist=Posix,WritableFileWriter::Append,WritableFileWriter::WriteBuffered,WritableFileWriter::Sync
Reviewers: igor, kradhakrishnan, rven, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48735
Summary: As part of cleaning up dependencies for tech debt week, we are moving ldb and sst_dump tools from util to tools, since they are tools.
Test Plan: make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48747
Summary:
Two fixes:
1. Wait compaction after generating each L0 file so that we are sure there are one L0 file left.
2. https://reviews.facebook.net/D48423 increased from 500 keys to 700 keys but in verification phase we are still querying the first 500 keys. It is a bug to fix.
Test Plan: Run the test in the same environment that fails by chance of one in tens of times. It doesn't fail after 1000 times.
Reviewers: yhchiang, IslamAbdelRahman, igor, rven, kradhakrishnan
Reviewed By: rven, kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48759
Summary: manual_compaction_test.cc incorrectly in util. Moved to db.
Test Plan: make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48687
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
Test Plan: Run all existing tests.
Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48549
Summary:
In the current implementation, perf_context.db_mutex_lock_nanos and
perf_context.db_condition_wait_nanos also include the mutex-wait time
other than DB Mutex.
This patch fix this issue by incrementing the counters only when it detects
a DB mutex.
Test Plan: perf_context_test
Reviewers: anthony, IslamAbdelRahman, sdong, igor
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48555
Summary:
Long time ago we add InternalDumpCommand to ldb_tool https://reviews.facebook.net/D11517
This command is using TEST_NewInternalIterator although it's not a test. This patch move TEST_NewInternalIterator outside of db_impl_debug.cc
Test Plan:
make check
make static_lib
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48561
Summary:
As part of tech debt week, we are cleaning up dependencies.
This diff moves db_test_util.[h,cc] from util to db directory.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48543
Summary: Fixed an incorrect replace of const value in util/options_helper.cc
Test Plan: options_test
Reviewers: igor, sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48513
Summary:
Introduce TableOptions section and support BlockBasedTable in RocksDB
options file. A TableOptions section has the following format:
[TableOptions/<FactoryClassName> "<ColumnFamily Name>"]
which includes information about its TableFactory class and belonging
column family. Below is an example TableOptions section of a
BlockBasedTableOptions that belongs to the default column family:
[TableOptions/BlockBasedTable "default"]
format_version=0
whole_key_filtering=true
block_size_deviation=10
block_size=4096
block_restart_interval=16
filter_policy=nullptr
no_block_cache=false
checksum=kCRC32c
cache_index_and_filter_blocks=false
index_type=kBinarySearch
hash_index_allow_collision=true
flush_block_policy_factory=FlushBlockBySizePolicyFactory
Currently, Cache-type options (i.e., block_cache and block_cache_compressed)
are not supported.
Test Plan: options_test
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48435
Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently.
Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct.
Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48411
Summary: Add 2 new counters BLOCK_CACHE_BYTES_WRITE, BLOCK_CACHE_BYTES_READ to keep track of how many bytes were written to the cache and how many bytes that we read from cache
Test Plan: make check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48195
Summary:
hit and miss bloom filter stats for memtable and SST
stats added to perf_context struct
key matches and prefix matches combined into one stat
Test Plan: unit test veryfing the functionality added, see BloomStatsTest in db_test.cc for details
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47859
Summary:
If a platform doesn't have ROCKSDB_FALLOCATE_PRESENT, then compiler complains:
util/env_posix.cc:354:8: error: private field 'allow_fallocate_' is not used [-Werror,-Wunused-private-field]
This was caught by travis.
Test Plan: compiles with ROCKSDB_FALLOCATE_PRESENT.
Reviewers: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48327
Summary:
Added boolean variable to guard fallocate() calls.
Set to false to prevent space leaks when tests fail.
Test Plan:
Compliles
Set to false and ran log device tests
Reviewers: sdong, lovro, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48027
Summary:
Since Andres' internship is over, I took over https://reviews.facebook.net/D42555 and rebased and simplified it a bit.
The behavior in this diff is a bit simpler than in D42555:
* only merge operators are passed through FilterMergeValue(). If fitler function returns true, the merge operator is ignored
* compaction filter is *not* called on: 1) results of merge operations and 2) base values that are getting merged with merge operands (the second case was also true in previous diff)
Do we also need a compaction filter to get called on merge results?
Test Plan: make && make check
Reviewers: lovro, tnovak, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: noetzli, kolmike, leveldb, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D47847
Summary:
Handle SST files with both ".sst" and ".ldb" suffix.
This enables user to migrate from leveldb to rocksdb.
Test Plan:
Added unit test with DB operating on SSTs with names schema.
See db/dc_test.cc:SSTsWithLdbSuffixHandling for details
Reviewers: yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48003
Summary:
RocksDBOptionsParser now supports CompressionType and the following
pointer-typed options in RocksDBOptionParser
for sanity check:
prefix_extractor
table_factory
comparator
compaction_filter
compaction_filter_factory
merge_operator
memtable_factory
In the RocksDB Options file, only high level information about pointer-typed
options are serialized, and those information is only used for verification
/ sanity check purpose.
Test Plan: added more tests in options_test
Reviewers: igor, IslamAbdelRahman, sdong, anthony
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47925
This commit adds two new targets to the Makefile: rocksdb.cc and rocksdb.h
These files, when combined with the c.h header, are a self-contained RocksDB
source distribution called an amalgamation. (The name comes from SQLite's, which
is similar in concept.)
The main benefit of an amalgamation is that it's very easy to drop into a
new project. It also compiles faster compared to compiling individual source
files and potentially gives the compiler more opportunity to make optimizations
since it can see all functions at once.
rocksdb.cc and rocksdb.h are generated by a new script, amalgamate.py.
A detailed description of how amalgamate.py works is in a comment at the top of
the file.
There are also some small changes to existing files to enable the amalgamation:
* Use quotes for includes in unity build
* Fix an old header inclusion in util/xfunc.cc
* Move some includes outside ifdef in util/env_hdfs.cc
* Separate out tool sources in Makefile so they won't be included in unity.cc
* Unity build now produces a static library
Closes#733
Summary: Add a missing check for deprecated options in options_helper.cc
Test Plan: options_test
Reviewers: sdong, anthony, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47793
Summary:
Previously, we treat deprecated options as normal options in
RocksDBOptionsParser. However, these deprecated options should
not be verified and serialized.
Test Plan: options_test
Reviewers: igor, sdong, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47775
Summary:
Fixed the following compile warning in options_test.cc under clang
util/options_test.cc:94:12: error: 'Skip' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
Status Skip(uint64_t n) {
^
./include/rocksdb/env.h:368:18: note: overridden virtual function is here
virtual Status Skip(uint64_t n) = 0;
^
Test Plan: options_test
Reviewers: igor, sdong, anthony, IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47763
Summary:
This patch defines the format of RocksDB options file, which
follows the INI file format, and implements functions for its
serialization and deserialization. An example RocksDB options
file can be found in examples/rocksdb_option_file_example.ini.
A typical RocksDB options file has three sections, which are
Version, DBOptions, and more than one CFOptions. The RocksDB
options file in general follows the basic INI file format
with the following extensions / modifications:
* Escaped characters
We escaped the following characters:
- \n -- line feed - new line
- \r -- carriage return
- \\ -- backslash \
- \: -- colon symbol :
- \# -- hash tag #
* Comments
We support # style comments. Comments can appear at the ending
part of a line.
* Statements
A statement is of the form option_name = value.
Each statement contains a '=', where extra white-spaces
are supported. However, we don't support multi-lined statement.
Furthermore, each line can only contain at most one statement.
* Section
Sections are of the form [SecitonTitle "SectionArgument"],
where section argument is optional.
* List
We use colon-separated string to represent a list.
For instance, n1:n2:n3:n4 is a list containing four values.
Below is an example of a RocksDB options file:
[Version]
rocksdb_version=4.0.0
options_file_version=1.0
[DBOptions]
max_open_files=12345
max_background_flushes=301
[CFOptions "default"]
[CFOptions "the second column family"]
[CFOptions "the third column family"]
Test Plan: Added many tests in options_test.cc
Reviewers: igor, IslamAbdelRahman, sdong, anthony
Reviewed By: anthony
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46059
Summary: DeleteSchedulerTests is running the same test with different rates, After the first iteraton sync points become useless because ClearTrace was not being called
Test Plan: Run the test
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D47709
Summary:
Fixed the compile error in util/arena.h caused by not
including TLB related header.
Test Plan: make db_stress
Reviewers: igor, sdong, anthony, IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47697
Summary:
Fixed the following compile warning when hugetlb is not supported.
./util/arena.h:102:10: error: private field 'hugetlb_size_' is not used [-Werror,-Wunused-private-field]
size_t hugetlb_size_ = 0;
^
1 error generated.
Test Plan: make db_stress
Reviewers: igor, sdong, anthony, IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47691
Summary: AllocateFromHugePage() can return nullptr, and then we need to try to allocate the block with AllocateNewBlock()
Test Plan: arena_test
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47607
The previous memory allocation procedures tried to allocate memory
via `new` or `mmap` and inserted the pointer to the memory into an
std::vector afterwards. In case `new` or `mmap` threw or returned
a nullptr, no memory was leaking. If `new` or `mmap` worked ok, the
following `vector::push_back` could still fail and throw an exception.
In this case, the memory just allocated was leaked.
The fix is to reserve space in the target memory pointer block
beforehand. If this throws, then no memory is allocated nor leaked.
If the reserve works but the actual allocation fails, still no
memory is leaked, only the target vector will have space for at
least one more element than actually required (but this may be
reused for the next allocation)
Summary: As title
Test Plan: make check
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46983
Summary:
Fix hex2String performance issues by removing sscanf dependency.
Also fixed some edge case handling (odd length, bad input).
Test Plan: Created a test file which called old and new implementation, and validated results are the same. I'll paste results in the phabricator diff.
Reviewers: igor, rven, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: thatsafunnyname, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D46785
Summary:
PlainTableReader now only allows mmap-mode. Add the support to non-mmap mode for more flexibility.
Refactor the codes to move all logic of reading data to PlainTableKeyDecoder, and consolidate the calls to Read() call and ReadVarint32() call. Implement the calls for both of mmap and non-mmap case seperately. For non-mmap mode, make copy of keys in several places when we need to move the buffer after reading the keys.
Test Plan: Add the mode of non-mmap case in plain_table_db_test. Run it in valgrind mode too.
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D47187
Summary: RandomAccessFileReader unnecessarily inherited RandomAccessFile, which can introduce unnecessarily extra costs. Remove it.
Test Plan: Run all existing tests
Reviewers: yhchiang, anthony, igor, kradhakrishnan, rven, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D47409
Summary:
Add options.compaction_pri, which specifies the policy about which file to compact first.
kCompactionPriByLargestSeq will compact oldest files first.
Verified the behavior in db_bench but did not write unit tests yet. Also need to make it settable through option string and dynamically changeable.
Test Plan: Will write unit tests
Reviewers: igor, rven, anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, MarkCallaghan
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45951
Summary:
This patch fixes#7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
Summary: Fix two constants in WaitFor() that multiply a value with 000 instead of 1000.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, aekmekji, sdong, MarkCallaghan, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47091
Summary: Change the log level of DB start-up log from Warn to Header.
Test Plan: db_bench and observe the LOG header
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47067
Summary: If we skip a test, we shouldn't mark `make check` as failure. This fixes travis CI test.
Test Plan: Travis CI
Reviewers: noetzli, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47031
Summary:
RocksDB options can be dumped to the log file, and
up to this point the max_subcompactions option was not included
in this dump. This fixes that.
Test Plan: makek all && make check
Reviewers: MarkCallaghan, igor, noetzli, anthony, yhchiang, sdong
Reviewed By: yhchiang, sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46971
Summary:
This patch finally fixes the ColumnFamilyTest.ReadDroppedColumnFamily test. The test has been failing very sporadically and it was hard to repro. However, I managed to write a new tests that reproes the failure deterministically.
Here's what happens:
1. We start the flush for the column family
2. We check if the column family was dropped here: a3fc49bfdd/db/flush_job.cc (L149)
3. This check goes through, ends up in InstallMemtableFlushResults() and it goes into LogAndApply()
4. At about this time, we start dropping the column family. Dropping the column family process gets to LogAndApply() at about the same time as LogAndApply() from flush process
5. Drop column family goes through LogAndApply() first, marking the column family as dropped.
6. Flush process gets woken up and gets a chance to write to the MANIFEST. However, this is where it gets stuck: a3fc49bfdd/db/version_set.cc (L1975)
7. We see that the column family was dropped, so there is no need to write to the MANIFEST. We return OK.
8. Flush gets OK back from LogAndApply() and it deletes the memtable, thinking that the data is now safely persisted to sst file.
The fix is pretty simple. Instead of OK, we return ShutdownInProgress. This is not really true, but we have been using this status code to also mean "this operation was canceled because the column family has been dropped".
The fix is only one LOC. All other code is related to tests. I added a new test that reproes the failure. I also moved SleepingBackgroundTask to util/testutil.h (because I needed it in column_family_test for my new test). There's plenty of other places where we reimplement SleepingBackgroundTask, but I'll address that in a separate commit.
Test Plan:
1. new test
2. make check
3. Make sure the ColumnFamilyTest.ReadDroppedColumnFamily doesn't fail on Travis: https://travis-ci.org/facebook/rocksdb/jobs/79952386
Reviewers: yhchiang, anthony, IslamAbdelRahman, kradhakrishnan, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46773
Summary:
The default behavior for atomic operations is sequentially consistent ordering
which is not needed for simple counters (see:
http://en.cppreference.com/w/cpp/atomic/memory_order). Change the memory order
to std::memory_order_relaxed for better performance.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, aekmekji, sdong, MarkCallaghan, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46953
Summary: Add an option to stop writes if compaction lefts behind. If estimated pending compaction bytes is more than threshold specified by options.hard_pending_compaction_bytes_liimt, writes will stop until compactions are cleared to under the threshold.
Test Plan: Add unit test DBTest.HardLimit
Reviewers: rven, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45999
Summary:
Commit c67d206898 did not fix all test conditions
which use Arena::MemoryAllocatedBytes() (see Travis failure
https://travis-ci.org/facebook/rocksdb/jobs/79957700). The assumption of that
commit was that aligned allocations do not call Arena::AllocateNewBlock(), so
malloc_usable_block_size() would not be used for Arena::MemoryAllocatedBytes().
However, there is a code path where Arena::AllocateAligned() calls
AllocateFallback() which in turn calls Arena::AllocateNewBlock(), so
Arena::MemoryAllocatedBytes() may return a greater value than expected even for
aligned requests.
Test Plan: make arena_test && ./arena_test
Reviewers: rven, anthony, yhchiang, aekmekji, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46869
Summary:
ArenaTest.MemoryAllocatedBytes on Travis failed:
https://travis-ci.org/facebook/rocksdb/jobs/79887849 . This is probably due to
malloc_usable_size() returning a value greater than the requested size. From
the man page:
The value returned by malloc_usable_size() may be greater than the requested
size of the allocation because of alignment and minimum size constraints.
Although the excess bytes can be overwritten by the application without ill
effects, this is not good programming practice: the number of excess bytes
in an allocation depends on the underlying implementation.
Test Plan: make arena_test && ./arena_test
Reviewers: rven, anthony, yhchiang, aekmekji, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46743
Summary:
Refactoring NewTableReader to accept TableReaderOptions
This will make it easier to add new options in the future, for example in this diff https://reviews.facebook.net/D46071
Test Plan: run existing tests
Reviewers: igor, yhchiang, anthony, rven, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46179
Summary. A change https://reviews.facebook.net/differential/diff/224721/
Has attempted to move common functionality out of platform dependent
code to a new facility called file_reader_writer.
This includes:
- perf counters
- Buffering
- RateLimiting
However, the change did not attempt to refactor Windows code.
To mitigate, we introduce new quering interfaces such as UseOSBuffer(),
GetRequiredBufferAlignment() and ReaderWriterForward()
for pure forwarding where required.
Introduce WritableFile got a new method Truncate(). This is to communicate
to the file as to how much data it has on close.
- When space is pre-allocated on Linux it is filled with zeros implicitly,
no such thing exist on Windows so we must truncate file on close.
- When operating in unbuffered mode the last page is filled with zeros but we still want to truncate.
Previously, Close() would take care of it but now buffer management is shifted to the wrappers and the file has
no idea about the file true size.
This means that Close() on the wrapper level must always include
Truncate() as well as wrapper __dtor should call Close() and
against double Close().
Move buffered/unbuffered write logic to the wrapper.
Utilize Aligned buffer class.
Adjust tests and implement Truncate() where necessary.
Come up with reasonable defaults for new virtual interfaces.
Forward calls for RandomAccessReadAhead class to avoid double
buffering and locking (double locking in unbuffered mode on WIndows).
Summary:
CompressionTypeSupported was returning LZ4_Supported() for
kZSTDNotFinalCompression. This patch changes it to ZSTD_Supported().
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46521
Summary:
Clang expects %llu for uint64_t, while gcc expects %lu. Replaced the format
specifier with a format macro. This should fix the build on gcc and Clang.
Test Plan: Build on gcc and clang.
Reviewers: rven, anthony, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46431
Summary:
In some cases, equality comparisons can be done more efficiently than three-way
comparisons. There are quite a few places in the code where we only care about
equality. This patch adds an Equal() method that defaults to using the
Compare() method.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, igor, sdong
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46233
Summary:
Added tests for two LDBCommands namely i) ManifestDumpCommand and ii) ListColumnFamiliesCommand.
+ Minor fix in the sscanf formatter (along relace C cast with C++ cast) + replacing localtime with localtime_r which is thread safe.
Test Plan: make all && ./tools/ldb_test.py
Reviewers: anthony, igor, IslamAbdelRahman, kradhakrishnan, lgalanis, rven, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45819
Summary: BackupRateLimiter removed and uses replaced with the existing GenericRateLimiter
Test Plan:
make all check
make clean
USE_CLANG=1 make all
make clean
OPT=-DROCKSDB_LITE make release
Reviewers: leveldb, igor
Reviewed By: igor
Subscribers: igor, dhruba
Differential Revision: https://reviews.facebook.net/D46095
Summary:
This diff is a collection of cleanups that were initially part of D43179.
Additionally it adds a unified way of defining key-value maps that use a
Comparator for sorting (this was previously implemented in four different
places).
Test Plan: make clean check all
Reviewers: rven, anthony, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45993
Summary:
Fixed the following compile warning in linux32 environment.
==> linux32: util/sst_dump_tool.cc: In member function ‘int
rocksdb::SstFileReader::ShowAllCompressionSizes(size_t)’:
==> linux32: util/sst_dump_tool.cc:167:50: warning: format ‘%lu’ expects
argument of type ‘long unsigned int’, but argument 3 has type
‘size_t {aka unsigned int}’ [-Wformat=]
==> linux32: fprintf(stdout, "Block Size: %lu\n", block_size);
Test Plan: make sst_dump
Reviewers: anthony, IslamAbdelRahman, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45885
Summary: arena_test is failing with glibc-2.17. Make it more robust
Test Plan: Run arena_test using both of glibc-2.17 and 2.2 and make sure both passes.
Reviewers: yhchiang, rven, IslamAbdelRahman, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45879
Summary: Provide a way to specify a detailed static error message for a Status without incurring a memcpy. Let me know what people think of this approach.
Test Plan: added simple test
Reviewers: igor, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D44259
Summary:
Now that the approach to parallelizing L0-L1 level-based
compactions by breaking the compaction job into subcompactions is
being extended to apply to universal compactions as well, the unit
tests need to account for this and run the universal compaction
tests with subcompactions both enabled and disabled.
Test Plan: make all && make check
Reviewers: sdong, igor, noetzli, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D45657
Summary: malloc_usable_size() gets a better estimation of memory usage. It is already used to calculate block cache memory usage. Use it in arena too.
Test Plan: Run all unit tests
Reviewers: anthony, kradhakrishnan, rven, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43317
Summary: Add ZSTD compression type. The same way as adding LZ4.
Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests.
Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor
Reviewed By: igor
Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45747
Summary:
This patch adds the helper functions and variables to allow a backend
implementing WritableFile to support direct IO when persisting a
memtable.
Test Plan:
Since there is no upstream implementation of WritableFile supporting
direct IO, the new behavior is disabled.
Tests should be provided by the backend implementing WritableFile.
Summary:
This patch adds GetStringFromColumnFamilyOptions(), the inverse function
of the existing GetColumnFamilyOptionsFromString(), and improves
the implementation of GetColumnFamilyOptionsFromString().
Test Plan: Add a test in options_test.cc
Reviewers: igor, sdong, anthony, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: noetzli, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45009
Summary:
ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS.
We add ReadaheadRandomAccessFile layer only when file is read during compactions.
D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff.
Test Plan: make check
Reviewers: MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45123
Summary:
Currently, mmap returns IOError when user tries to read data past the end of the file. This diff changes the behavior. Now, we return just the bytes that we can, and report the size we returned via a Slice result. This is consistent with non-mmap behavior and also pread() system call.
This diff is taken out of D45123.
Test Plan: make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45645
Summary:
See task #7983654. The example was triggering an assert in compaction job
because the compaction was not marked as manual. With this patch,
CompactionPicker::FormCompaction() marks compactions as manual. This patch
also fixes a couple of typos, adds optimistic_transaction_example to
.gitignore and librocksdb as a dependency for examples. Adding librocksdb as
a dependency makes sure that the examples are built with the latest changes
in librocksdb.
Test Plan: make clean && cd examples && make all && ./compact_files_example
Reviewers: rven, sdong, anthony, igor, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45117
Summary:
Up until this point we had DbOptions.num_subcompactions, but
it is semantically more correct to call this max_subcompactions since
we will schedule *up to* DbOptions.max_subcompactions smaller compactions
at a time during a compaction job.
I also added a --subcompactions option to db_bench
Test Plan: make all make check
Reviewers: sdong, igor, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D45069
Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it.
Test Plan: Add the option.
Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang
Reviewed By: igor
Subscribers: igor, MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43311
Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately
Test Plan: added a unittest
Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong
Reviewed By: sdong
Subscribers: jeanxu2012, dhruba
Differential Revision: https://reviews.facebook.net/D44955
Summary:
This patch implements DBOptions deserialization and improve
the current implementation of DBOptions serialization by
using a static structure that stores the offset of each
DBOptions member variables to perform serialization and
deserialization instead of using tons of if-then-branch
to determine the mapping between string and variables.
Test Plan: Added test in options_test.cc
Reviewers: igor, anthony, sdong, IslamAbdelRahman
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44097
Summary:
HashCuckooRep::ApproximateMemoryUsage() previously return
std::numeric_limits<size_t>::max() when it cannot accept more
entries. This patch makes it return a more reasonable estimation.
This change is necessary in order to make GetIntProperty("rocksdb.cur-size-all-mem-tables")
handles HashCuckooRep properly in diff https://reviews.facebook.net/D44229.
Test Plan: db_test
Reviewers: sdong, anthony, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44241
Summary:
In prepration for running multiple threads at the same time during
a compaction job, this patch assigns each subcompaction its own state
(instead of sharing the one global CompactionState). Each subcompaction then
uses this state to update its statistics, keep track of its snapshots, etc.
during the course of execution. Then at the end of all the executions the
statistics are aggregated across the subcompactions so that the final result
is the same as if only one larger compaction had run.
Test Plan: ./db_test ./db_compaction_test ./compaction_job_test
Reviewers: sdong, anthony, igor, noetzli, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43239
Summary:
While working on supporting mixing merge operators with
single deletes ( https://reviews.facebook.net/D43179 ),
I realized that returning and dealing with merge results
can be made simpler. Submitting this as a separate diff
because it is not directly related to single deletes.
Before, callers of merge helper had to retrieve the merge
result in one of two ways depending on whether the merge
was successful or not (success = result of merge was single
kTypeValue). For successful merges, the caller could query
the resulting key/value pair and for unsuccessful merges,
the result could be retrieved in the form of two deques of
keys and values. However, with single deletes, a successful merge
does not return a single key/value pair (if merge
operands are merged with a single delete, we have to generate
a value and keep the original single delete around to make
sure that we are not accidentially producing a key overwrite).
In addition, the two existing call sites of the merge
helper were taking the same actions independently from whether
the merge was successful or not, so this patch simplifies that.
Test Plan: make clean all check
Reviewers: rven, sdong, yhchiang, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43353
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
Summary:
Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs:
2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]}
Add two more counters in iostats_context.
Also add a parameter of db_bench.
Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44115
Summary: RandomRWFile is not used anywhere in out code base, this patch remove RandomRWFile
Test Plan:
make check -j64
USE_CLANG=1 make all -j64
OPT=-DROCKSDB_LITE make release -j64
Reviewers: sdong, yhchiang, anthony, kradhakrishnan, rven, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44091
Summary:
Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it.
MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.
Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable.
Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.
Test Plan: Unit tests, db_bench parallel testing.
Reviewers: igor, rven, sdong, yhchiang, yoshinorim
Reviewed By: sdong
Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40869
Summary:
This patch fixes the following clang-build error in util/thread_local.cc by using a cleaner macro blocker:
12:26:31 util/thread_local.cc:157:19: error: declaration shadows a static data member of 'rocksdb::ThreadLocalPtr::StaticMeta' [-Werror,-Wshadow]
12:26:31 ThreadData* tls_ =
12:26:31 ^
12:26:31 util/thread_local.cc:19:66: note: previous declaration is here
12:26:31 __thread ThreadLocalPtr::ThreadData* ThreadLocalPtr::StaticMeta::tls_ = nullptr;
12:26:31 ^
Test Plan: db_test
Reviewers: sdong, anthony, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44043
Summary: Add a new option that all LoadTableHandlers to use multiple threads to load files on DB Open and Recover
Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
DISABLE_JEMALLOC=1 make all valgrind_check -j64 (still running)
Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43755
Summary:
This patch provides a simplier solution to the memory leak
issue identified in patch https://reviews.facebook.net/D43677,
where a static function local variable can be used instead of
using a global static unique_ptr.
Test Plan: run db_stress on mac
Reviewers: igor, sdong, anthony, IslamAbdelRahman, maykov
Reviewed By: maykov
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43995
Summary:
wal_recovery_mode setting was not written to LOG. This diff
adds the log message
Test Plan: manually checked
Reviewers: kradhakrishnan, sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43953
Commit 257ee89 added a static destruction helper to avoid notional
"leaks" of TLS on main thread exit. This helper fails to compile on
OS X (and presumably Windows, though I haven't checked), which lacks
the __thread storage class StaticMeta::tls_ member.
This patch fixes the builds. Do note that the static cleanup mechanism
may be somewhat brittle and atexit(3) may be a more suitable approach
to releasing the main thread's TLS if it's highly desirable for this
memory to not be reported "reachable" by Valgrind at exit.
Summary:
MyRocks valgrind run was showing memory leaks. The fixes are mostly self-explaining.
There is only a single usage of ThreadLocalPtr. Potentially, we may think about replacing this use with thread_local, but it will be a bigger change. Another option to consider is using thread_local instead of __thread in ThreadLocalPtr implementation. This way, tls_ can be stored using std::unique_ptr and no destructor would be required.
Test Plan:
- make check
- MyRocks valgrind run doesn't report leaks
Reviewers: rven, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43677
Summary: Update DeleteScheduler tests so that they verify the used penalties for waiting instead of measuring the time spent which is not reliable
Test Plan:
make -j64 delete_scheduler_test && ./delete_scheduler_test
COMPILE_WITH_TSAN=1 make -j64 delete_scheduler_test && ./delete_scheduler_test
COMPILE_WITH_ASAN=1 make -j64 delete_scheduler_test && ./delete_scheduler_test
make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"
COMPILE_WITH_TSAN=1 make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"
COMPILE_WITH_ASAN=1 make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43635
Summary:
Add two unit tests for SyncWAL(). One makes sure SyncWAL() doesn't block writes in the other thread. Another one makes sure SyncWAL() doesn't wait ongoing writes to finish before being executed.
Create a new test file db_wal_test and move two WAL related tests from db_test to here.
Test Plan: Run the new tests
Reviewers: IslamAbdelRahman, rven, kradhakrishnan, kolmike, tnovak, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43605
Summary: Measure read latency histogram and put in statistics. Compaction inputs are excluded from it when possible (unfortunately usually no possible as we usually take table reader from table cache.
Test Plan:
Run db_bench and it shows the stats, like:
rocksdb.sst.read.micros statistics Percentiles :=> 50 : 1.238522 95 : 2.529740 99 : 3.912180
Reviewers: kradhakrishnan, rven, anthony, IslamAbdelRahman, MarkCallaghan, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43275
Summary: Fixing TSAN false positive and relaxing the conditions when we are running under TSAN
Test Plan: COMPILE_WITH_TSAN=1 make -j64 delete_scheduler_test && ./delete_scheduler_test
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43593
Summary:
While working on https://reviews.facebook.net/D43179 , I found
duplicate code in the tests. This patch removes it.
Test Plan: make clean all check
Reviewers: igor, sdong, rven, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43263
Summary:
Subj. We really need this feature.
Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.
Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.
Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba
Differential Revision: https://reviews.facebook.net/D40905
Summary:
Updated DBTest DBCompactionTest and CompactionJobStatsTest
to run compaction-related tests once with subcompactions enabled and
once disabled using the TEST_P test type in the Google Test suite.
Test Plan: ./db_test ./db_compaction-test ./compaction_job_stats_test
Reviewers: sdong, igor, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43443
Summary:
Introduce DeleteScheduler that allow enforcing a rate limit on file deletion
Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed.
I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile
Test Plan:
added delete_scheduler_test
existing unit tests
Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43221
Summary:
UpdateAccumulatedStats() is used to optimize compaction decision
esp. when the number of deletion entries are high, but this function
can slowdown DBOpen esp. in disk environment.
This patch adds DBOptions::skip_sats_update_on_db_open, which skips
UpdateAccumulatedStats() in DB::Open() time when it's set to true.
Test Plan: Add DBCompactionTest.SkipStatsUpdateTest
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: tnovak, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42843
Summary: So I took a look and I used a pointer to TableBuilder. Changed it to a unique_ptr. I think this should work, but I cannot run valgrind correctly on my local machine to test it.
Test Plan: Run valgrind, but it's not working locally. It says I'm executing an unrecognized instruction.
Reviewers: yhchiang
Subscribers: dhruba, sdong
Differential Revision: https://reviews.facebook.net/D43485
Summary: In "sst_dump --show_compression_sizes", a reference of CompressionOptions is kept in TableBuilderOptions, which is destroyed later, causing a memory issue.
Test Plan: Run valgrind against SSTDumpToolTest.CompressedSizes and make sure it is fixed
Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43497
Summary:
As of now compactions involving files from Level 0 and Level 1 are single
threaded because the files in L0, although sorted, are not range partitioned like
the other levels. This means that during L0-L1 compaction each file from L1
needs to be merged with potentially all the files from L0.
This attempt to parallelize the L0-L1 compaction assigns a thread and a
corresponding iterator to each L1 file that then considers only the key range
found in that L1 file and only the L0 files that have those keys (and only the
specific portion of those L0 files in which those keys are found). In this way
the overlap is minimized and potentially eliminated between different iterators
focusing on the same files.
The first step is to restructure the compaction logic to break L0-L1 compactions
into multiple, smaller, sequential compactions. Eventually each of these smaller
jobs will be run simultaneously. Areas to pay extra attention to are
# Correct aggregation of compaction job statistics across multiple threads
# Proper opening/closing of output files (make sure each thread's is unique)
# Keys that span multiple L1 files
# Skewed distributions of keys within L0 files
Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test
Reviewers: igor, noetzli, anthony, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42699
Summary: Now ldb dump_manifest refuses to work if there are 20 levels. Extend the limit to 64.
Test Plan: Run the tool with 20 number of levels
Reviewers: kradhakrishnan, anthony, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42879
Summary:
sst_dump_tool contains two instances of `fprintf`s where the `format` argument is not
a string literal. This prevents the code from compiling with some compilers/compiler
options because of the potential security risks associated with printing non-literals.
Test Plan: make all
Reviewers: rven, igor, yhchiang, sdong, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43305
Summary:
Added a new feature to sst_dump_tool.cc to allow a user to see the sizes of the different compression algorithms on an .sst file.
Usage:
./sst_dump --file=<filename> --show_compression_sizes
./sst_dump --file=<filename> --show_compression_sizes --set_block_size=<block_size>
Note: If you do not set a block size, it will default to 16kb
Test Plan: manual test and the write a unit test
Reviewers: IslamAbdelRahman, anthony, yhchiang, rven, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D42963
Summary:
For task #7771355, we would like to log the number of corrupt keys
during a compaction. This patch implements and tests the count
as part of CompactionJobStats.
Test Plan: make && make check
Reviewers: rven, igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42921
Summary: Fix for universal compaction with trivial move, when the ouput level is 0. The tests where failing. Fixed by allowing normal compaction when output level is 0.
Test Plan: modified test cases run successfully.
Reviewers: sdong, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: anthony, kradhakrishnan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42933
* std::chrono does not provide enough granularity for microsecs and periodically emits
duplicates
* the bug is manifested in log rotation logic where we get duplicate
log file names and loose previous log content
* msvc does not imlement COW on std::strings adjusted the test to use
refs in the loops as auto does not retain ref info
* adjust auto_log rotation test with Windows specific command to remove
a folder. The test previously worked because we have unix utils installed
in house but this may not be the case for everyone.
Summary: https://reviews.facebook.net/D42321 has left PosixMmapFile in some weird state. This diff removes pending_sync_ that was now unused, fixes indentation and prevents Fsync() from calling both fsync() and fdatasync().
Test Plan: `make -j check`
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D42885
Summary: Directly using TMPDIR can cause problems when running tests using parallel option. Fix them.
Test Plan: Run all tests in parallel
Reviewers: kradhakrishnan, yhchiang, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42807
Summary:
From other ones' investigation:
"sync_file_range() behavior highly depends on kernel version and filesystem.
xfs does neighbor page flushing outside of the specified ranges. For example, sync_file_range(fd, 8192, 16384) does not only trigger flushing page #3 to #4, but also flushing many more dirty pages (i.e. up to page#16)... Ranges of the sync_file_range() should be far enough from write() offset (at least 1MB)."
Test Plan: make all check
Reviewers: igor, rven, kradhakrishnan, yhchiang, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: yoshinorim, MarkCallaghan, sumeet, domas, dhruba, leveldb, ljin
Differential Revision: https://reviews.facebook.net/D15807
Summary: Add new CheckFileExists method. Considered changing the FileExists api but didn't want to break anyone's builds.
Test Plan: unit tests
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42003
Summary:
Skipping these tests in ROCKSDB_LITE since they are not supported
json_document_test
wal_manager_test
ttl_test
sst_dump_test
deletefile_test
compact_files_test
prefix_test
checkpoint_test
Test Plan:
json_document_test
wal_manager_test
ttl_test
sst_dump_test
deletefile_test
compact_files_test
prefix_test
checkpoint_test
Reviewers: igor, sdong, yhchiang, kradhakrishnan, anthony
Reviewed By: anthony
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D42573
Summary: Make mock_env_test runnable in ROCKSDB_LITE
Test Plan: mock_env_test
Reviewers: igor, sdong, yhchiang, kradhakrishnan, anthony
Reviewed By: anthony
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D42585
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
Summary: CYGWIN avoided fread_unlocked in a wrong way. Fix it to the standard way.
Test Plan: Run tests
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42549
Summary:
When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.
We have a special case that when you set max_background_flushes to 0, we
schedule the flush to execute on the compaction thread.
Test Plan: make check (there might be some unit tests that depend on this behavior)
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41931
Summary:
ROCKSDB_WARNING is only defined if either ROCKSDB_PLATFORM_POSIX or OS_WIN is defined. This works well for building rocksdb with its own build scripts. But this won't work when an outside project(like mongodb) doesn't define ROCKSDB_PLATFORM_POSIX.
This fix defines ROCKSDB_WARNING for all platforms. No idea if its defined correctly on non-posix,non-windows platforms but this is no worse that the current situation where this macro is missing on unexpected platforms.
This fix should hopefully fix anyone whose build broke now that we've switched from using #warning to Pragma (to support windows). Unfortunately, while mongo-rocks compiles, it ignores the Pragma and doesn't print a warning. I have not been able to figure out a way to implement this portably on all platforms.
Of course, an alternate solution would be to just get rid of ROCKSDB_WARNING and live with include file redirects indefinitely. Thoughts?
Test Plan: build rocks, build mongorocks
Reviewers: igor, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42477
Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature.
Test Plan: make check
Reviewers: noetzli, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42405
Summary:
MergeUntil was not reporting a success when merging an operand with
a Value/Deletion despite the comments in MergeHelper and CompactionJob
indicating otherwise. This lead to operands being written to the compaction
output unnecessarily:
M1 M2 M3 P M4 M5 --> (P+M1+M2+M3) M2 M3 M4 M5 (before the diff)
M1 M2 M3 P M4 M5 --> (P+M1+M2+M3) M4 M5 (after the diff)
In addition, the code handling Values/Deletion was basically identical.
This patch unifies the code. Finally, this patch also adds testing for
merge_helper.
Test Plan: make && make check
Reviewers: sdong, rven, yhchiang, tnovak, igor
Reviewed By: igor
Subscribers: tnovak, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42351
- Remove make file defines from public headers and use _WIN32 because it is compiler defined
- use __GNUC__ and __clang__ to guard non-portable attributes
- add #include "port/port.h" to some new .cc files.
- minor changes in CMakeLists to reflect recent changes
Summary: Moved convenience.h out of utilities to remove a dependency on utilities in db.
Test Plan: unit tests. Also compiled a link to the old location to verify the _Pragma works.
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42201
Summary:
Trvial move in universal compaction was failing when trying to move files from levels other than 0.
This was because the DeleteFile while trivially moving, was only deleting files of level 0 which caused duplication of same file in different levels.
This is fixed by passing the right level as argument in the call of DeleteFile while doing trivial move.
Test Plan: ./db_test ran successfully with the new test cases.
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42135
Summary: Make TransactionLogIterator related tests from db_test.cc to db_log_iter_test.cc
Test Plan:
db_test
db_log_iter_test
Reviewers: sdong, IslamAbdelRahman, igor, anthony
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42045
Summary: This option is guarding the feature implemented 2 and a half years ago: D8991. The feature was enabled by default back then and has been running without issues. There is no reason why any client would turn this feature off. I found no reference in fbcode.
Test Plan: none
Reviewers: sdong, yhchiang, anthony, dhruba
Reviewed By: dhruba
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42063
Summary:
In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments:
* Less code == better icache hit rate, smaller builds, simpler code
* The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future.
* Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users.
Test Plan: make check
Reviewers: anthony, kradhakrishnan, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41937
Summary:
Move global static functions in db_test_util to DBTestBase.
This is to prevent unused function warning when decoupling
db_test.cc into multiple files.
Test Plan: db_test
Reviewers: igor, sdong, anthony, IslamAbdelRahman, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D42009
Summary:
Move reusable part of db_test.cc to util/db_test_util.h.
This makes it more possible to partition db_test.cc into
multiple smaller test files.
Also, fixed many old lint errors in db_test.
Test Plan: db_test
Reviewers: igor, anthony, IslamAbdelRahman, sdong, kradhakrishnan
Reviewed By: sdong, kradhakrishnan
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41973
Summary:
Added new statistics in CompactionJobStats to keep track of
deletion entries and the expiration of those entries. Updated these
fields in compaction_job.cc as compaction took place and wrote a new
test in compaction_job_stats_test.cc to verify accuracy.
Test Plan:
Wrote new test DeletionStatsTest in
compaction_job_stats_test.cc to verify
Reviewers: sdong, igor, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41355
Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more.
Test Plan: Build it.
Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41961
Summary:
Public API depends on port/port.h which is wrong. Fix it.
Also with gcc 4.8.1 build was broken as MAX_INT32 was not recognized. Fix it by using ::max in linux.
Test Plan: Build it and try to build an external project on top of it.
Reviewers: anthony, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41745
Summary: Print whether fast CRC32 is supported in DB info LOG
Test Plan: Run db_bench and see it prints out correctly.
Reviewers: yhchiang, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: MarkCallaghan, yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41733
Summary:
new_table_iterator_nanos is not cleaned in PerfContext::Reset() while new_table_block_iter_nanos is cleaned twice. Fix it.
Also fix a comment.
Test Plan: Build and db_bench with --perf_context to see the value shown.
Reviewers: kradhakrishnan, anthony, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41721
Summary: Add a perf context counter to help users figure out time spent on reading indexes and bloom filter blocks.
Test Plan: Will write a unit test
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41433
Summary:
While profiling compaction in our service I noticed a lot of CPU (~15% of compaction) being spent in MergingIterator and key comparison. Looking at the code I found MergingIterator was (understandably) using std::priority_queue for the multiway merge.
Keys in our dataset include sequence numbers that increase with time. Adjacent keys in an L0 file are very likely to be adjacent in the full database. Consequently, compaction will often pick a chunk of rows from the same L0 file before switching to another one. It would be great to avoid the O(log K) operation per row while compacting.
This diff replaces std::priority_queue with a custom binary heap implementation. It has a "replace top" operation that is cheap when the new top is the same as the old one (i.e. the priority of the top entry is decreased but it still stays on top).
Test Plan:
make check
To test the effect on performance, I generated databases with data patterns that mimic what I describe in the summary (rows have a mostly increasing sequence number). I see a 10-15% CPU decrease for compaction (and a matching throughput improvement on tmpfs). The exact improvement depends on the number of L0 files and the amount of locality. Performance on randomly distributed keys seems on par with the old code.
Reviewers: kailiu, sdong, igor
Reviewed By: igor
Subscribers: yoshinorim, dhruba, tnovak
Differential Revision: https://reviews.facebook.net/D29133
Summary:
Introduced a new category in the enum InfoLogLevel in env.h.
Modifed Log() in env.cc to use the Header()
when the InfoLogLevel == HEADER_LEVEL.
Updated tests in auto_roll_logger_test to ensure
the header is handled properly in these cases.
Test Plan: Augment existing tests in auto_roll_logger_test
Reviewers: igor, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41067
Summary:
Add a new field: BackupableDBOptions.max_background_copies.
CreateNewBackup() and RestoreDBFromBackup() will use this number of threads to perform copies.
If there is a backup rate limit, then max_background_copies must be 1.
Update backupable_db_test.cc to test multi-threaded backup and restore.
Update backupable_db_test.cc to test backups when the backup environment is not the same as the database environment.
Test Plan:
Run ./backupable_db_test
Run valgrind ./backupable_db_test
Run with TSAN and ASAN
Reviewers: yhchiang, rven, anthony, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, anthony, sdong, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40725
invalid suffix on literal
no return statement in function returning non-void CuckooStep::operator=
extra qualification ‘rocksdb::spatial::Variant::
dereferencing type-punned pointer will break strict-aliasing rules
Summary: Make RocksDb build and run on Windows to be functionally
complete and performant. All existing test cases run with no
regressions. Performance numbers are in the pull-request.
Test plan: make all of the existing unit tests pass, obtain perf numbers.
Co-authored-by: Praveen Rao praveensinghrao@outlook.com
Co-authored-by: Sherlock Huang baihan.huang@gmail.com
Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com
Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com
Summary:
Implementation of a table-level row cache.
It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.
Supports snapshots and merge operations.
Test Plan: Ran `make valgrind_check commit-prereq`
Reviewers: igor, philipp, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39849
Summary:
The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL.
1. RecoverAfterRestart (kTolerateCorruptedTailRecords)
This mocks the current recovery mode.
2. RecoverAfterCleanShutdown (kAbsoluteConsistency)
This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes.
3. RecoverPointInTime (kPointInTimeRecovery)
This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write.
4. RecoverAfterDisaster (kSkipAnyCorruptRecord)
This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible.
Test Plan:
(1) Run added unit test to cover all levels.
(2) Run make check.
Reviewers: leveldb, sdong, igor
Subscribers: yoshinorim, dhruba
Differential Revision: https://reviews.facebook.net/D38487
Summary: MyRocks need a mechanism to track read outliers. We need to expose this
stat.
Test Plan: None
Reviewers: sdong
CC: leveldb
Task ID: #7152512
Blame Rev:
Summary: See title
Test Plan: Run valgrind ./cache_test
Reviewers: igor
Reviewed By: igor
Subscribers: anthony, dhruba
Differential Revision: https://reviews.facebook.net/D40419
Summary: Make autovector_test runnable in ROCKSDB_LITE
Test Plan: autovector_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40245
Summary:
Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.
This patch fails DB::Open() if we don't support the compression that is specified in the options.
Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.
Reviewers: sdong, MarkCallaghan, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39687
Summary:
Add the funcion Cache.GetPinnedUsage() to return the memory size of entries
that are in use by the system (that is, all the entries not in the LRU list).
Test Plan:
Run ./cache_test and examine PinnedUsageTest.
Reviewers: tnovak, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40305
Summary: Currently we dump DBOptions for each column family options we dump. This leads to duplicate lines in our LOG file. This diff fixes that.
Test Plan: Check out the LOG
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: IslamAbdelRahman, yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39729
Summary:
This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters
Old CompactRange is still available but deprecated
Test Plan:
make all check
make rocksdbjava
USE_CLANG=1 make all
OPT=-DROCKSDB_LITE make release
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40209
Summary:
Before this patch, any function call to ThreadStatusUtil might automatically initialize and register the thread status data. However, if it is the user-thread making this call, the allocated thread-status-data will never be released as such threads are not managed by rocksdb.
In this patch, I remove the automatic-initialization part. Thread-status data is only initialized and uninitialized in Env during the thread creation and destruction.
Test Plan:
db_test
thread_list_test
listener_test
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40017
Summary:
Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables.
To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>.
Test Plan: Add a test case
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40119
Summary: Removed two unused macros in iostats_context
Test Plan: make all check
Reviewers: sdong, rven, IslamAbdelRahman, kradhakrishnan, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40005
Summary:
We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.
The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work
hard_rate_limit is deprecated.
options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.
Test Plan: Add new unit tests in db_test
Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor
Reviewed By: igor
Subscribers: ikabiljo, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D36351
Summary:
Add Env::GetThreadID(), which returns the ID of the current thread.
In addition, make GetThreadList() and InfoLog use same unique ID for the same thread.
Test Plan:
db_test
listener_test
Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39735
Summary:
Replacing the default value for compaction_filter_factory and compaction_filter_factory_v2 to be nullptr instead of DefaultCompactionFilterFactory / DefaultCompactionFilterFactoryV2
The reason for this is to be able to determine easily if we have compaction filter factory or not without depending on RTTI
Test Plan: make check
Reviewers: yoshinorim, ott, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39693
Summary:
The type of smallest_output_key_prefix and largest_output_key_prefix
have been changed to std::string in https://reviews.facebook.net/D39537.
As a result, we shouldn't do smallest_output_key_prefix[0] = 0 in the
initialization.
Test Plan: compile db_test with tsan enabled and repeat DBTest.CompactionDeletionTrigger test to verify the tsan issue has been gone.
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39645
Summary:
Allow EventListener::OnCompactionCompleted to return CompactionJobStats,
which contains useful information about a compaction.
Example CompactionJobStats returned by OnCompactionCompleted():
smallest_output_key_prefix 05000000
largest_output_key_prefix 06990000
elapsed_time 42419
num_input_records 300
num_input_files 3
num_input_files_at_output_level 2
num_output_records 200
num_output_files 1
actual_bytes_input 167200
actual_bytes_output 110688
total_input_raw_key_bytes 5400
total_input_raw_value_bytes 300000
num_records_replaced 100
is_manual_compaction 1
Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats.
Reviewers: rven, igor, anthony, sdong
Reviewed By: sdong
Subscribers: tnovak, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38463
Summary:
We occasionally get write stalls (>1s Write() calls) on HDD under read load. The following timers explain almost all of the stalls:
- perf_context.db_mutex_lock_nanos
- perf_context.db_condition_wait_nanos
- iostats_context.open_time
- iostats_context.allocate_time
- iostats_context.write_time
- iostats_context.range_sync_time
- iostats_context.logger_time
In my experiments each of these occasionally takes >1s on write path under some workload. There are rare cases when Write() takes long but none of these takes long.
Test Plan: Added code to our application to write the listed timings to log for slow writes. They usually add up to almost exactly the time Write() call took.
Reviewers: rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: march, dhruba, tnovak
Differential Revision: https://reviews.facebook.net/D39177
Summary: It used to be no good (known to me) non-intrusive way to wrap WritableFile - you can't call protected virtual methods of the wrapped pointer to WritableFile. This diff adds a convenience class WritableFileWrapper that makes wrapping WritableFile both possible and easy.
Test Plan: `make clean; make -j release`, `make clean; OPT=-DROCKSDB_LITE make release`, `make clean; USE_CLANG=1 make -j all`.
Reviewers: sdong, yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, tnovak, march
Differential Revision: https://reviews.facebook.net/D39147
Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported.
Test Plan: Added a new test, but still need to look into stress testing.
Reviewers: yhchiang, igor, rven, sdong
Reviewed By: sdong
Subscribers: adamretter, MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D33435
Summary:
For transactions, we are using the memtables to validate that there are no write conflicts. But after flushing, we don't have any memtables, and transactions could fail to commit. So we want to someone keep around some extra history to use for conflict checking. In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.
After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure). It seems like the best place for this is abstracted inside the memtable_list. I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.
This diff adds a new parameter to control how much memtable history to keep around after flushing. However, it sounds like people aren't too fond of adding new parameters. So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers. This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit. (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached). So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).
However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.
Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests. Added testing in memtablelist_test and planning on adding more testing here.
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37443
Summary:
Allow EventLogger to directly log from a JSONWriter. This allows
the JSONWriter to be shared by EventLogger and potentially EventListener,
which is an important step to integrate EventLogger and EventListener.
This patch also rewrites EventLoggerHelpers::LogTableFileCreation(),
which uses the new API to generate identical log.
Test Plan:
Run db_bench in debug mode and make sure the log is correct and no
assertions fail.
Reviewers: sdong, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38709
Summary: Having stats in our LOG more often will help a lot with perf debugging.
Test Plan: none
Reviewers: sdong, MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38781
Summary: Rename JSONWritter to JSONWriter
Test Plan: make
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38733
Summary:
sync_file_range is not always asyncronous and thus can block writes if we do this for WAL in the foreground thread. See more here: http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html
Some users don't want us to call sync_file_range on WALs. Some other do.
Thus, I'm adding a separate option wal_bytes_per_sync to control calling
sync_file_range on WAL files. bytes_per_sync will apply only to table
files now.
Test Plan: no more sync_file_range for WAL as evidenced by strace
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38253
Summary:
Fixed the following compile errors due to some gcc does not have std::map::emplace
util/thread_status_impl.cc: In static member function ‘static std::map<std::basic_string<char>, long unsigned int> rocksdb::ThreadStatus::InterpretOperationProperties(rocksdb::ThreadStatus::OperationType, const uint64_t*)’:
util/thread_status_impl.cc:88:20: error: ‘class std::map<std::basic_string<char>, long unsigned int>’ has no member named ‘emplace’
util/thread_status_impl.cc:90:20: error: ‘class std::map<std::basic_string<char>, long unsigned int>’ has no member named ‘emplace’
util/thread_status_impl.cc:94:20: error: ‘class std::map<std::basic_string<char>, long unsigned int>’ has no member named ‘emplace’
util/thread_status_impl.cc:96:20: error: ‘class std::map<std::basic_string<char>, long unsigned int>’ has no member named ‘emplace’
util/thread_status_impl.cc:98:20: error: ‘class std::map<std::basic_string<char>, long unsigned int>’ has no member named ‘emplace’
util/thread_status_impl.cc:101:20: error: ‘class std::map<std::basic_string<char>, long unsigned int>’ has no member named ‘emplace’
make: *** [util/thread_status_impl.o] Error 1
Test Plan: make db_bench
Reviewers: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38643
Summary: Call Flush() function instead
Test Plan: make all check
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D38583
Summary:
Allow GetThreadList to report Flush properties, which includes:
* job id
* number of bytes that has been written since flush started.
* total size of input mem-tables
Test Plan:
./db_bench --threads=30 --num=1000000 --benchmarks=fillrandom --thread_status_per_interval=100 --value_size=1000
Sample output from db_bench which tracks same flush job
ThreadID ThreadType cfName Operation ElapsedTime Stage State OperationProperties
140213879898240 High Pri default Flush 5789 us FlushJob::WriteLevel0Table BytesMemtables 4112835 | BytesWritten 577104 | JobID 8 |
ThreadID ThreadType cfName Operation ElapsedTime Stage State OperationProperties
140213879898240 High Pri default Flush 30.634 ms FlushJob::WriteLevel0Table BytesMemtables 4112835 | BytesWritten 1734865 | JobID 8 |
Reviewers: rven, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38505
Summary: Make ThreadStatus::InterpretOperationProperties take const uint64_t*
Test Plan:
make
make OPT=-DROCKSDB_LITE shared_lib
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38445