Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()
Also add tests in db_iter_test and db_iterator_test
Pass all tests
Cheers!
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64149
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary:
MyRocks build is broken because they are using "-Werror=missing-field-initializers"
We should fix that by explicitly passing these arguments
Test Plan: Build MyRocks
Reviewers: sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64161
Summary: 0.9 can make the test flaky since just found one test fail with 0.88
Test Plan: make all check
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63939
Summary: EnvPosixTestWithParam.TwoPools relies on explicit sleeping, so it sometimes fail. Fix it.
Test Plan: Run tests with high parallelism many times and make sure the test passes.
Reviewers: yiwu, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63417
Summary: Add mutable options info into `OptionsTypeInfo` and use it to parse mutable options map. Also support `max_bytes_for_level_multiplier_additional` in option file.
Test Plan: unit test
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63843
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.
Test Plan: existing unit tests.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63945
Summary:
Add Env::RandomRWFile in env.h and implement it for POSIX
RandomRWFile is a file that allow us to read from / write to random offsets in the file
I will implement it for other Envs later after finishing the whole task for AddFile()
Test Plan: unit tests
Reviewers: andrewkr, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62433
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.
Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63141
Summary: Fix two Windows build problems.
Test Plan: Build on Windows and run all Linux tests.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63189
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63177
* Fix StatsLevel so that kExceptTimeForMutex leaves compression stats enabled and kExceptDetailedTimers disables mutex lock stats. Also change default stats level to kExceptDetailedTimers (disabling both compression and mutex timing).
* Changed order of StatsLevel enum to simplify logic for determining what stats to record.
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
Test Plan: Add two new unit tests. Run all existing tests, including jtest.
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59829
Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor.
Test Plan: make all check
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62193
Summary:
To reduce contention for atomics when HistogramStats are shared across
threads, this diff makes them thread-specific so updates are faster. This comes
at the expense of slower reads (much less frequent), which now require merging
all histograms. In this diff,
- Thread-specific HistogramImpl is created upon the thread's first measureTime()
- Thread-specific HistogramImpl are merged and deleted upon thread termination or ThreadLocalPtr destruction, whichever comes first
- getHistogramString() and histogramData() merge all histograms, both thread-specific and previously merged ones
Test Plan:
unit tests, ran db_bench and verified histograms look similar
before:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
...
+ 7.63% db_bench db_bench [.] rocksdb::HistogramStat::Add
after:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
...
+ 0.98% db_bench db_bench [.] rocksdb::HistogramStat::Add
Reviewers: sdong, MarkCallaghan, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62649
Summary:
Fix ClockCache memory leak found by valgrind:
# Add destructor to cleanup cached values.
# Delete key with cache handle immediately after handle is recycled, and erase table entry immediately if duplicated cache entry is inserted.
Test Plan:
make DISABLE_JEMALLOC=1 valgrind_check
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62973
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.
We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification
Test Plan: added new tests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: yiwu, leveldb, march, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58707
Summary:
This diff split ThreadPool to
-ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h)
-ThreadPoolImpl (actual implementation in util/threadpool_imp.h)
This allow us to expose ThreadPool to the user so we can use it as an option later
Test Plan: existing unit tests
Reviewers: andrewkr, yiwu, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62085
Summary:
The global atomics we previously used for tickers had poor cache performance
since they were typically updated from different threads, causing frequent
invalidations. In this diff,
- recordTick() updates a local ticker value specific to the thread in which it was called
- When a thread exits, its local ticker value is added into merged_sum
- getTickerCount() returns the sum of all threads' local ticker values and the merged_sum
- setTickerCount() resets all threads' local ticker values and sets merged_sum to the value provided by the caller.
In a next diff I will make a similar change for histogram stats.
Test Plan:
before:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
$ perf report -g --stdio | grep recordTick
7.59% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick
...
after:
$ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
$ perf report -g --stdio | grep recordTick
1.46% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick
...
Reviewers: kradhakrishnan, MarkCallaghan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: yiwu, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62337
Summary:
We used to allow insert into full block cache as long as `strict_capacity_limit=false`. This diff further restrict insert to full cache if caller don't intent to hold handle to the cache entry after insert.
Hope this diff fix the assertion failure with db_stress: https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=211853102&step_id=2475070014
db_stress: util/lru_cache.cc:278: virtual void rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*): Assertion `lru_.next == &lru_' failed.
The assertion at lru_cache.cc:278 can fail when an entry is inserted into full cache and stay in LRU list.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62325
Summary:
Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted.
Depends on D61977.
Test Plan: See unit test.
Reviewers: lightmark, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D62241
Summary:
This function allows the user to provide a custom function to fold all
threads' local data. It will be used in my next diff for aggregating statistics
stored in thread-local data. Note the test case uses atomics as thread-local
values due to the synchronization requirement (documented in code).
Test Plan: unit test
Reviewers: yhchiang, sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62049
Summary:
Add mid-point insertion functionality to LRU cache. Caller of `Cache::Insert()` can set an additional parameter to make a cache entry have higher priority. The LRU cache will reserve at most `capacity * high_pri_pool_pct` bytes for high-pri cache entries. If `high_pri_pool_pct` is zero, the cache degenerates to normal LRU cache.
Context: If we are to put index and filter blocks into RocksDB block cache, index/filter block can be swap out too early. We want to add an option to RocksDB to reserve some capacity in block cache just for index/filter blocks, to mitigate the issue.
In later diffs I'll update block based table reader to use the interface to cache index/filter blocks at high priority, and expose the option to `DBOptions` and make it dynamic changeable.
Test Plan: unit test.
Reviewers: IslamAbdelRahman, sdong, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D61977
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader
Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )
Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61473
Summary:
Clock-based cache implemenetation aim to have better concurreny than
default LRU cache. See inline comments for implementation details.
Test Plan:
Update cache_test to run on both LRUCache and ClockCache. Adding some
new tests to catch some of the bugs that I fixed while implementing the
cache.
Reviewers: kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61647
* Create rate limiter using factory function in the test.
* Convert function local statics in option helper to a C array
that does not perform dynamic memory allocation. This is helpful
when you try to memory isolate different DB instances.
Summary: ... so that I can include the header and create LRUCache specific tests for D61977
Test Plan:
make check
Reviewers: lightmark, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62145
Summary:
Added 2 statistics in compaction job statistics, to
identify if single deletes are not meeting a matching key
(fallthrough) or single deletes are meeting a merge, delete or
another single delete (i.e. not the expected case of put).
Test Plan: Tested the statistics using write_stress and compaction_job_stats_test
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61749
Summary:
We were frequently seeing a race between SyncPoint::Process() and
SyncPoint::~SyncPoint() (e.g.,
https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=207289975&step_id=2412725431).
The issue was marked_thread_id_ gets deleted when the main thread is exiting and
simultaneously background threads may access it. We can prevent this race
condition by checking whether sync points are disabled (assuming the test terminates
with them disabled) before attempting to access that member. I do not understand
why accesses to other members (mutex_ and enabled_) are ok but anyways the
test no longer fails tsan.
Test Plan: ran tests
Reviewers: sdong, yhchiang, IslamAbdelRahman, yiwu, wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62133
Summary:
Env holds a pointer of ThreadStatusUpdater, which will be deleted when
Env is deleted. However, in case a rocksdb database is deleted after
Env is deleted. Then this will introduce a free-after-use of this
ThreadStatusUpdater.
This patch fix this by never deleting the ThreadStatusUpdater in Env,
which is in general safe as Env is a singleton in most cases.
Test Plan: thread_list_test
Reviewers: andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59187
Summary: preparation for detecting Cache type. If SimCache, we then may trigger some command like "setSimCapacity()" with setOptions()
Test Plan: make all check
Reviewers: yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61953
Summary:
This diff improves the documentation for GetOptionsFromMap APIs and
fixes a bug in GetOptionsFromMap functions in convenience.h
where new_options will still be changed when the function
call is not successful.
Test Plan: options_test
Reviewers: IslamAbdelRahman, kradhakrishnan, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61731
From the Linux manual:
MAP_ANONYMOUS
The mapping is not backed by any file; its contents
are initialized to zero. The fd and offset arguments are
ignored; however, some implementations require fd to be -1
if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable
applications should ensure this.
FreeBSD is such a case, it wil just return an error.
Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows).
Test Plan: Add `date_tiered_test`.
Reviewers: dhruba, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61653
Summary:
patch for diff https://reviews.facebook.net/D58587
Also change StopWatch class to add a fifth param named overwrite which decides whether to overwrite *elapse or add on it.
Test Plan: make all check -j64
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61239
Summary:
The patch is a continuation of part 5. It glues the abstraction for
file layout and metadata, and flush out the implementation of the API. It
adds unit tests for the implementation.
Test Plan: Run unit tests
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57549
Summary:
The call stack used to look like this during static initialization:
#0 0x00000000008032d1 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff683b300) at util/thread_local.cc:172
#1 0x00000000008030a7 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:135
#2 0x000000000080310f in rocksdb::ThreadLocalPtr::StaticMeta::Mutex() () at util/thread_local.cc:141
#3 0x0000000000803103 in rocksdb::ThreadLocalPtr::StaticMeta::InitSingletons() () at util/thread_local.cc:139
#4 0x000000000080305d in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:106
It involves outer/inner classes and the call stacks goes
outer->inner->outer->inner, which is too difficult to understand. We can avoid
a level of back-and-forth by skipping StaticMeta::InitSingletons(), which
doesn't initialize anything beyond what ThreadLocalPtr::Instance() already
initializes.
Now the call stack looks like this during static initialization:
#0 0x00000000008032c5 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff683b300) at util/thread_local.cc:170
#1 0x00000000008030a7 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:135
#2 0x000000000080305d in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:106
Test Plan:
unit tests
verify StaticMeta::mutex_ is still initialized in DefaultEnv() (StaticMeta::mutex_ is the only variable intended to be initialized via StaticMeta::InitSingletons() which I removed)
#0 0x00000000005cee17 in rocksdb::port::Mutex::Mutex(bool) (this=0x7ffff69500b0, adaptive=false) at port/port_posix.cc:52
#1 0x0000000000769cf8 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff6950000) at util/thread_local.cc:168
#2 0x0000000000769a53 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:133
#3 0x0000000000769a09 in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:105
#4 0x0000000000647d98 in rocksdb::Env::Default() () at util/env_posix.cc:845
Reviewers: lightmark, yhchiang, sdong
Reviewed By: sdong
Subscribers: arahut, IslamAbdelRahman, yiwu, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60813
Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.
Test Plan: Run all existing tests.
Reviewers: IslamAbdelRahman, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60513
Summary:
TickersNameMap is not consistent with Tickers enum.
this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics)
Test Plan: added new unit test
Reviewers: sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61083
Summary:
Persistent cache tier is the tier abstraction that can work for any block
device based device mounted on a file system. The design/implementation can
handle any generic block device.
Any generic block support is achieved by generalizing the access patten as
{io-size, q-depth, direct-io/buffered}.
We have specifically tested and adapted the IO path for NVM and SSD.
Persistent cache tier consists of there parts :
1) File layout
Provides the implementation for handling IO path for reading and writing data
(key/value pair).
2) Meta-data
Provides the implementation for handling the index for persistent read cache.
3) Implementation
It binds (1) and (2) and flushed out the PersistentCacheTier interface
This patch provides implementation for (1)(2). Follow up patch will provide (3)
and tests.
Test Plan: Compile and run check
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57117
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).
* Prevent incrementing compression counter when compression is turned off in options.
* Prevent incrementing compression counter when compression is turned off in options.
* Added two more supported compression types to test code in db_test.cc
* Prevent incrementing compression counter when compression is turned off in options.
* Added new StatsLevel that excludes compression timing.
* Fixed casting error in coding.h
* Fixed CompressionStatsTest for new StatsLevel.
* Removed unused variable that was breaking the Linux build
Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc.
Test Plan:
make check -j64
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59655
Summary: In Zlib_Compress and BZip2_Compress the calculation for size was slightly off when using compression_foramt_version 2 (which includes the decompressed size in the output). Also there were unnecessary loops around the deflate/BZ2_bzCompress calls. In Zlib_Compress there was also a possible exit from the function after calling deflateInit2 that didn't call deflateEnd.
Test Plan: Standard tests
Reviewers: sdong, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: sdong, IslamAbdelRahman, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60537
Summary:
I was investigating performance issues in the SstFileWriter and found all of the following:
- The SstFileWriter::Add() function created a local InternalKey every time it was called generating a allocation and free each time. Changed to have an InternalKey member variable that can be reset with the new InternalKey::Set() function.
- In SstFileWriter::Add() the smallest_key and largest_key values were assigned the result of a ToString() call, but it is simpler to just assign them directly from the user's key.
- The Slice class had no move constructor so each time one was returned from a function a new one had to be allocated, the old data copied to the new, and the old one was freed. I added the move constructor which also required a copy constructor and assignment operator.
- The BlockBuilder::CurrentSizeEstimate() function calculates the current estimate size, but was being called 2 or 3 times for each key added. I changed the class to maintain a running estimate (equal to the original calculation) so that the function can return an already calculated value.
- The code in BlockBuilder::Add() that calculated the shared bytes between the last key and the new key duplicated what Slice::difference_offset does, so I replaced it with the standard function.
- BlockBuilder::Add() had code to copy just the changed portion into the last key value (and asserted that it now matched the new key). It is more efficient just to copy the whole new key over.
- Moved this same code up into the 'if (use_delta_encoding_)' since the last key value is only needed when delta encoding is on.
- FlushBlockBySizePolicy::BlockAlmostFull calculated a standard deviation value each time it was called, but this information would only change if block_size of block_size_deviation changed, so I created a member variable to hold the value to avoid the calculation each time.
- Each PutVarint??() function has a buffer and calls std::string::append(). Two or three calls in a row could share a buffer and a single call to std::string::append().
Some of these will be helpful outside of the SstFileWriter. I'm not 100% the addition of the move constructor is appropriate as I wonder why this wasn't done before - maybe because of compiler compatibility? I tried it on gcc 4.8 and 4.9.
Test Plan: The changes should not affect the results so the existing tests should all still work and no new tests were added. The value of the changes was seen by manually testing the SstFileWriter class through MyRocks and adding timing code to identify problem areas.
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59607
Summary: this test support CLI option to select HdfsEnv/NativeHdfsEnv now. The latter one is default. add test about when Rename(src, target) should overwrite target
Test Plan: existing test
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60399
Summary:
Use @omegaga's awesome feature to avoid use of callbacks for ensuring
SyncPoints happen in a particular thread.
Depends on D60375.
Test Plan:
$ ./auto_roll_logger_test
Reviewers: omegaga, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, omegaga, leveldb
Differential Revision: https://reviews.facebook.net/D60471
Summary: Add markers to sync points. A marked sync point will only be active when it is on the same thread as the marker sync point.
Test Plan: Write a unit test to validate.
Reviewers: sdong, IslamAbdelRahman, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60375
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.
Test Plan: Add a new unit test.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59925
Summary: LockFile is unnecessary in unit test
Test Plan: env_basic_test.cc
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60285
Summary:
Previously we couldn't run env_basic_test on Env::Default (PosixEnv on
our platforms) since GetChildren*() behavior was inconsistent with our other
Envs. We can normalize the output of GetChildren*() such that these test cases
work on PosixEnv too.
Test Plan: ran env_basic_test
Reviewers: wanning
Reviewed By: wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59943
Summary:
move cleanup to TearDown and handle directories, so cleanup will happen
even if a test fails in the middle.
Test Plan: ./env_basic_test
Reviewers: wanning
Reviewed By: wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60243
Summary: ensure no 2nd level children under test_dir_
Test Plan: env_basic_test on 4 envs
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59979
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.
Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59499
Summary: filter_deltes is not a frequently used feature. Remove it.
Test Plan: Run all test suites.
Reviewers: igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59427
Summary: We introduced default slow down and stop condition, but didn't reset it in bulk load mode. Fix it.
Test Plan: N/A
Reviewers: igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59757
Summary: Fixed a crash bug that incorrectly parse deprecated options in options_helper
Test Plan:
run db_bench with an old options file with memtable_prefix_bloom_probes
./db_bench --options_file=AN_OLD_OPTIONS_FILE --num=100 --benchmarks=fillseq
Reviewers: sdong, IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59787
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.
Test Plan: See unit test.
Reviewers: dhruba, horuff, sdong
Reviewed By: sdong
Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57813
Summary:
Try to decompress compressed blocks when a special flag is set.
assert and crash in debug builds if we can't decompress the just-compressed input.
Test Plan: Run unit-tests.
Reviewers: dhruba, andrewkr, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59145
Summary: It confuses some compilers to have slice.cc under multiple directories. Merge them.
Test Plan: Run existing tests
Reviewers: andrewkr, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59409
Summary: delete_scheduler_test and db_sst_test share a same directory name, causing possible fails on both tests when running in parallel. Fixed by changing directory name.
Test Plan: Run the two tests in parallel: `parallel -u ./{} ::: delete_scheduler_test db_sst_test`
Reviewers: sdong, andrewkr
Reviewed By: sdong, andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59529
Summary:
memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio
Test Plan: Run all existing tests
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59199
Summary: AFIK, options builder is not used by anyone. Remove it.
Test Plan: Run all existing tests.
Reviewers: IslamAbdelRahman, andrewkr, igor
Reviewed By: igor
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59319
GCC 5.4 will complain (see also options_parser.cc):
/home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc: In function 'rocksdb::CompactionStyle rocksdb::{anonymous}::PickCompactionStyle(size_t, int, int, uint64_t)':
/home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:29:7: error: 'log' is not a member of 'std'
std::log(target_db_size / write_buffer_size) / std::log(kBytesForLevelMultiplier)));
^
/home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:29:7: note: suggested alternative:
In file included from /usr/include/features.h:365:0,
from /usr/include/math.h:26,
from /home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:6:
/usr/include/bits/mathcalls.h:109:1: note: 'log'
__MATHCALL_VEC (log,, (_Mdouble_ __x));
Summary:
- Provide env_test as a static library. We will build it for future releases so internal Envs can use env_test by linking against this library.
- Add tests for CustomEnv, which is configurable via ENV_TEST_URI environment variable. It uses the URI-based Env lookup (depends on D58449).
- Refactor env_basic_test cases to use a unique/configurable directory for test files.
Test Plan:
built a test binary against librocksdb_env_test.a. It registered the
default Env with URI prefix "a://".
- verify runs all CustomEnv tests when URI with correct prefix is provided
```
$ ENV_TEST_URI="a://ok" ./tmp --gtest_filter="CustomEnv/*"
...
[ PASSED ] 12 tests.
```
- verify runs no CustomEnv tests when URI with non-matching prefix is provided
```
$ ENV_TEST_URI="b://ok" ./tmp --gtest_filter="CustomEnv/*"
...
[ PASSED ] 0 tests.
```
Reviewers: ldemailly, IslamAbdelRahman, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58485
Summary:
Extracted basic Env-related tests from mock_env_test and memenv_test into a
parameterized test for Envs: env_basic_test.
Depends on D58449. (The dependency is here only so I can keep this series of
diffs in a chain -- there is no dependency on that diff's code.)
Test Plan: ran tests
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58635
Summary:
Direct IO checkin breaks Windows build. Fixing the code to work for
Windows.
Test Plan: Run env_test in Windows 10 and make check in Linux
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59073
Summary: When rate_bytes_per_sec * refill_period_us_ overflows, the actual limited rate is very low. Handle this case so the rate will be large.
Test Plan: Add a unit test for it.
Reviewers: IslamAbdelRahman, andrewkr
Reviewed By: andrewkr
Subscribers: yiwu, lightmark, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58929
Summary:
Made it consistent with posix Env, which uses pread() that returns 0
(success) when an offset is given beyond EOF. The purpose of making these Envs
behave consistently is I am repurposing the in-memory Envs' tests for the basic
Env tests in D58635.
Test Plan: ran mock_env_test and memenv_test
Reviewers: IslamAbdelRahman, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58845
Summary: util/threadpool.cc's function name is the same as a well-known class name. It breaks unity build. Rename it.
Test Plan: Run all existing test.
Reviewers: yiwu, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58881
Summary:
O_DIRECT is not available in Mac as a flag for open. The fix is to make
use of fctl after the file is opened
Test Plan: Run the tests on mac and Linux
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58665
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options
Test Plan: UT
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, lgalanis, dhruba
Differential Revision: https://reviews.facebook.net/D58653
Summary:
This patch adds direct IO capability to RocksDB Env.
The direct IO capability is required for persistent cache since NVM is best
accessed as 4K direct IO. SSDs can leverage direct IO for reading.
Direct IO requires the offset and size be sector size aligned, and memory to
be kernel page aligned. Since neither RocksDB/Persistent read cache data
layout is aligned to sector size, the code can accommodate reading unaligned IO size
(or unaligned memory) at the cost of an alloc/copy.
The write code path expects the size and memory to be aligned.
Test Plan: Run RocksDB unit tests
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57393
Summary:
Persistent read cache relies on the accuracy of the GetUniqueIdFromFile
to generate a unique key for a given block of data. Currently we don't have an
implementation for Mac.
This patch adds an implementation.
Test Plan: Run unit tests
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58413
Summary:
Expose a simple function to convert CompressionType to it's corresponding option string
This is for a diff @yoshinorim is working on for MyRocks
Test Plan: unittest
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D58215
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.
RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode.
Blame Rev:
Test Plan: Run unit tests
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55707
Conditionally retrofit thread_posix for use with std::thread
and reuse the same logic. Posix users continue using Posix interfaces.
Enable XPRESS compression in test runs.
Fix master introduced signed/unsigned mismatch.
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]
The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.
Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.
We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))
So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?
Test Plan: provided test.
Reviewers: sdong
Subscribers: andrewkr, leveldb, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D57645
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.
modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
- log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.
- Recovery/writepath is now aware of prepared batches and commit and rollback markers.
Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.
Reviewers: IslamAbdelRahman, sdong
Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D56919