Summary:
When seek target is a merge key (`kTypeMerge`), `DBIter::FindNextUserEntry()`
advances the underlying iterator _past_ the current key (`saved_key_`); see
`MergeValuesNewToOld()`. However, `FindPrevUserKey()` assumes that `iter_`
points to an entry with the same user key as `saved_key_`. As a result,
`it->Seek(key) && it->Prev()` can cause the iterator to be positioned at the
_next_, instead of the previous, entry (new test, written by @lovro, reproduces
the bug).
This diff changes `FindPrevUserKey()` to also skip keys that are _greater_ than
`saved_key_`.
Test Plan: db_test
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba, lovro
Differential Revision: https://reviews.facebook.net/D40791
Summary: Make column_family_test runnable in ROCKSDB_LITE.
Test Plan: column_family_test
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40251
Summary: Based on @anthony's feedback, we want to fail early if our static linking fails.
Test Plan: none
Reviewers: anthony
Reviewed By: anthony
Subscribers: dhruba, anthony, leveldb
Differential Revision: https://reviews.facebook.net/D40839
Summary:
Currently, when we insert something into block cache, we say that the block cache capacity decreased by the size of the block. However, size of the block might be less than the actual memory used by this object. For example, 4.5KB block will actually use 8KB of memory. So even if we configure block cache to 10GB, our actually memory usage of block cache will be 20GB!
This problem showed up a lot in testing and just recently also showed up in MongoRocks production where we were using 30GB more memory than expected.
This diff will fix the problem. Instead of counting the block size, we will count memory used by the block. That way, a block cache configured to be 10GB will actually use only 10GB of memory.
I'm using non-portable function and I couldn't find info on portability on Google. However, it seems to work on Linux, which will cover majority of our use-cases.
Test Plan:
1. fill up mongo instance with 80GB of data
2. restart mongo with block cache size configured to 10GB
3. do a table scan in mongo
4. memory usage before the diff: 12GB. memory usage after the diff: 10.5GB
Reviewers: sdong, MarkCallaghan, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40635
Summary: It's not really nice to call user's API with garbage data in new_value. This diff makes sure that new_value is empty before calling the merge operator.
Test Plan: Added assert to Merge operator in merge_test
Reviewers: sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40773
Summary: If we create a new temp directory for each build, scons will recompile everything because we have different parameters. Instead, let's set up a constant path to our static lib. That way we won't have to recompile.
Test Plan: Run fb_compile_mongo.sh twice -- second time it didn't recompile everything
Reviewers: MarkCallaghan, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40707
Summary:
Fixes task 7156865 where a compaction causes a hang in flush
memtable if CancelAllBackgroundWork was called prior to it.
Stack trace is in : https://phabricator.fb.com/P19848829
We end up waiting for a flush which will never happen because there are no background threads.
Test Plan: PreShutdownFlush
Reviewers: sdong, igor
Reviewed By: sdong, igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40617
Summary: #7124486: RocksDB's Iterator.SeekToLast should seek to the last key before iterate_upper_bound if presents
Test Plan: ./db_iter_test run successfully with the new testcase
Reviewers: rven, yhchiang, igor, anthony, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40425
Summary:
Added a script that will compile MongoRocks with the same flags as RocksDB binary. On FB infra, we can now do:
cd ~/rocksdb; make static_lib
cd ~/mongo; ~/rocksdb/build_tools/fb_compile_mongo.sh
No need to upgrade the g++ on the devbox (like Aaron and I did) or maintain a separate script to compile (like Mark did)
fb_compile_mongo.sh gets the settings from fbcode_config.sh, so it also makes it easier to upgrade the environment one day.
Test Plan: Compiled mongod with new script. Also, ldd output looks good: https://phabricator.fb.com/P19891602
Reviewers: AaronFeldman, MarkCallaghan, anthony
Reviewed By: anthony
Subscribers: anthony, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40659
Summary: Make stringappend_test runnable in ROCKSDB_LITE
Test Plan: stringappend_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40593
Summary:
BYTES_READ only count the number of logical bytes read from
the DB::Get() function. It neither includes all logical bytes read
nor indicates IO read bytes.
This patch improves the comment for BYTES_READ.
Test Plan: Only change comment.
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40599
Summary: Make table_properties_collector_test runnable in ROCKSDB_LITE
Test Plan: table_properties_collector_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40581
Summary:
Remove -Wl,--no-as-needed flag when making shared_lib in OSX and IOS as
those environment doe not have compile option --no-as-needed
ld: unknown option: --no-as-needed
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Test Plan: make shared_lib
Reviewers: meyering, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40353
Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction.
Test Plan: make check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40527
Summary:
Implementation of a table-level row cache.
It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.
Supports snapshots and merge operations.
Test Plan: Ran `make valgrind_check commit-prereq`
Reviewers: igor, philipp, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39849
Summary:
The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL.
1. RecoverAfterRestart (kTolerateCorruptedTailRecords)
This mocks the current recovery mode.
2. RecoverAfterCleanShutdown (kAbsoluteConsistency)
This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes.
3. RecoverPointInTime (kPointInTimeRecovery)
This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write.
4. RecoverAfterDisaster (kSkipAnyCorruptRecord)
This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible.
Test Plan:
(1) Run added unit test to cover all levels.
(2) Run make check.
Reviewers: leveldb, sdong, igor
Subscribers: yoshinorim, dhruba
Differential Revision: https://reviews.facebook.net/D38487
Summary: Fixing bad merge
Test Plan: make -j64 check (this is not enough to verify the fix)
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40521
Summary: MyRocks need a mechanism to track read outliers. We need to expose this
stat.
Test Plan: None
Reviewers: sdong
CC: leveldb
Task ID: #7152512
Blame Rev:
Summary: Fix broken gflags link
Test Plan: Follow the link
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40503
Summary: Fixed a valgrind issue in checkpoint_test
Test Plan: valgrind on checkpoint_test
Reviewers: igor, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40455
Summary: Recent checkin added ldb_test.py to the make check target but the test fails. Remove it again for now and make task.
Test Plan: No more ldb_tests.py running
Reviewers: igor, anthony
Reviewed By: anthony
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40449
Summary: Hack up rocksdb_dump and rocksdb_undump utilities to get this task rolling/promote discussion.
Test Plan: Dump/undump databases recursively to see if nothing is lost.
Reviewers: sdong, yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37269
Summary:
When there are multiple column families, the flush in
GetLiveFiles is not atomic, so that there are entries in the wal files
which are needed to get a consisten RocksDB. We now add the log files to
the checkpoint.
Test Plan:
CheckpointCF - This test forces more data to be written to
the other column families after the flush of the first column family but
before the second.
Reviewers: igor, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40323
Summary: See title
Test Plan: Run valgrind ./cache_test
Reviewers: igor
Reviewed By: igor
Subscribers: anthony, dhruba
Differential Revision: https://reviews.facebook.net/D40419
Summary: CompressLevelCompaction() depends on Zlib. We should skip it when zlib is not present.
Test Plan: `make check` without zlib
Reviewers: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40401
Summary: Make autovector_test runnable in ROCKSDB_LITE
Test Plan: autovector_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40245
Summary:
Block geodb_test in ROCKSDB_LITE as geodb is not supported
in ROCKSDB_LITE
Test Plan: geodb_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40335
Summary:
Remove compactor_test, which depends on a directory not exist
in our code base.
make compactor_test
GEN util/build_version.cc
GEN util/build_version.cc
make: *** No rule to make target `utilities/compaction/compactor_test.o', needed by `compactor_test'. Stop.
Test Plan: verify the output message of make compactor_test
Reviewers: rven, anthony, kradhakrishnan, igor, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40341
Summary:
Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.
This patch fails DB::Open() if we don't support the compression that is specified in the options.
Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.
Reviewers: sdong, MarkCallaghan, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39687
Summary:
Add the funcion Cache.GetPinnedUsage() to return the memory size of entries
that are in use by the system (that is, all the entries not in the LRU list).
Test Plan:
Run ./cache_test and examine PinnedUsageTest.
Reviewers: tnovak, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40305
Summary:
This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level
Changes in this patch
- Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction
- Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set
Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that should use force_bottommost_level_compaction to pass in a deterministic way
Test Plan:
make check
adding new tests
Reviewers: igor, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40059
Summary: Currently we dump DBOptions for each column family options we dump. This leads to duplicate lines in our LOG file. This diff fixes that.
Test Plan: Check out the LOG
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: IslamAbdelRahman, yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39729
Summary:
Universal compaction can involves in multiple levels. However,
the current implementation of bytes_readn and bytes_readnp1
(and some other stats with postfix `n` and `np1`) assumes compaction
can only have two levels.
This patch fixes this bug and redefines bytes_readn and bytes_readnp1:
* bytes_readnp1: the number of bytes read in the compaction output level.
* bytes_readn: the total number of bytes read minus bytes_readnp1
Test Plan: Add a test in compaction_job_stats_test
Reviewers: igor, sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40239
Summary:
So far, we benchmarked RocksDB by writing as fast as possible. With this change, we're able to limit our write throughput, which should help us better understand how RocksDB performes under varying write workloads.
Specifically, I'm currently interested in the shape of the graph that has write throughput on one axis and write rate on another. This should help us with designing our stall system, as we have started to do with D36351.
Test Plan:
$ ./db_bench --benchmarks=fillrandom --benchmark_write_rate_limit=1000000
fillrandom : 118.523 micros/op 8437 ops/sec; 0.9 MB/s
$ ./db_bench --benchmarks=fillrandom --benchmark_write_rate_limit=2000000
fillrandom : 59.136 micros/op 16910 ops/sec; 1.9 MB/s
Reviewers: MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39759
Summary:
This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters
Old CompactRange is still available but deprecated
Test Plan:
make all check
make rocksdbjava
USE_CLANG=1 make all
OPT=-DROCKSDB_LITE make release
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40209
Summary:
We go to great lengths to make sure MaybeScheduleFlushOrCompaction() is called outside of write thread. But anyway, it's still called in the mutex, so it's not that much cheaper.
This diff removes the "optimization" and cleans up the code a bit.
Test Plan: make check
Reviewers: rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40113
Summary:
Before this patch, any function call to ThreadStatusUtil might automatically initialize and register the thread status data. However, if it is the user-thread making this call, the allocated thread-status-data will never be released as such threads are not managed by rocksdb.
In this patch, I remove the automatic-initialization part. Thread-status data is only initialized and uninitialized in Env during the thread creation and destruction.
Test Plan:
db_test
thread_list_test
listener_test
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40017
Summary: Block c_test in ROCKSDB_LITE as it's not supported in ROCKSDB_LITE.
Test Plan: c_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40257
Summary:
Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables.
To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>.
Test Plan: Add a test case
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40119
Summary:
This is part of an effort to better understand and optimize RocksDB stalls under high load. I added a feature to db_bench to periodically write QPS to CSV files. That way we can nicely see how our QPS changes in time (especially when DB is stalled) and can do a better job of evaluating our stall system (i.e. we want the QPS to be as constant as possible, as opposed to having bunch of stalls)
Cool part of CSV files is that we can easily graph them -- there are a bunch of tools available.
Test Plan:
Ran ./db_bench --report_interval_seconds=10 --benchmarks=fillrandom --num=10000000
and observed this in report.csv:
secs_elapsed,interval_qps
10,2725860
20,1980480
30,1863456
40,1454359
50,1460389
Reviewers: sdong, MarkCallaghan, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40047