Summary: MaybeDumpStats was causing lock problem
Test Plan: make check; db_stress
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D10935
Summary:
Added an option stats_dump_period_sec to dump leveldb.stats to LOG periodically for diagnosis.
By defauly, it's set to a very big number 3600 (1 hour).
Test Plan: make check;
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10761
Summary: There was an artifical limit on the size of the write buffer size.
Test Plan: make check
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10911
Summary:
Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery.
If client is ok with losing most recent updates, we could simply skip those errors.
However, it's important to introduce an additional flag, so that paranoid_check can
still guard against more serious problems.
Test Plan: make check; db_stress
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb, emayanke
Differential Revision: https://reviews.facebook.net/D10869
Summary:
Make stop watch a simple implementation, instead of subclass of a virtual class
Allocate stop watches off the stack instead of heap.
Code is more terse now.
Test Plan: make all check, db_bench with --statistics=1
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10809
Summary:
This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface.
The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10773
Summary:
Currently, compaction filter is run on internal key older than the oldest snapshot, which is incorrect.
Compaction filter should really be run on the most recent internal key when there is no external snapshot.
Test Plan: make check; db_stress
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D10641
Summary:
WAL files are moved to archive directory and clear only at DB::Open.
Can lead to a lot of space consumption in a Database. Added logic to periodically clear Archive Directory too.
Test Plan: make all check + add unit test
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10617
Summary:
This diff introduces a new Merge operation into rocksdb.
The purpose of this review is mostly getting feedback from the team (everyone please) on the design.
Please focus on the four files under include/leveldb/, as they spell the client visible interface change.
include/leveldb/db.h
include/leveldb/merge_operator.h
include/leveldb/options.h
include/leveldb/write_batch.h
Please go over local/my_test.cc carefully, as it is a concerete use case.
Please also review the impelmentation files to see if the straw man implementation makes sense.
Note that, the diff does pass all make check and truly supports forward iterator over db and a version
of Get that's based on iterator.
Future work:
- Integration with compaction
- A raw Get implementation
I am working on a wiki that explains the design and implementation choices, but coding comes
just naturally and I think it might be a good idea to share the code earlier. The code is
heavily commented.
Test Plan: run all local tests
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb, zshao, sheki, emayanke, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D9651
Summary:
- don't see a point exposing table.h to the public.
- fixed make clean to remove also *.d files.
Test Plan: make check; db_stress
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10479
Summary:
- removed the compaction_filter_value from the callback interface. Restrict compaction filter to purging values.
- modify some comments to reflect curent status.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10335
Summary: A better error message. A local change. Did not look at other places where this could be done.
Test Plan: compile
Reviewers: dhruba, MarkCallaghan
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10251
Summary:
FindObsoleteFiles was slow, holding the single big lock, resulted in bad p99 behavior.
Didn't profile anything, but several things could be improved:
1. VersionSet::AddLiveFiles works with std::set, which is by itself slow (a tree).
You also don't know how many dynamic allocations occur just for building up this tree.
switched to std::vector, also added logic to pre-calculate total size and do just one allocation
2. Don't see why env_->GetChildren() needs to be mutex proteced, moved to PurgeObsoleteFiles where
mutex could be unlocked.
3. switched std::set to std:unordered_set, the conversion from vector is also inside PurgeObsoleteFiles
I have a feeling this should pretty much fix it.
Test Plan: make check; db_stress
Reviewers: dhruba, heyongqiang, MarkCallaghan
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10197
Summary: using unique_ptr to have automatic delete for probableWALfiles in db_impl.cc
Test Plan: make
Reviewers: sheki, dhruba
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10083
Summary:
The segfault was happening because the program was unable to open a new
sst file (as part of the compaction) because the process ran out of
file descriptors.
The fix is to check the return status of the file creation before taking
any other action.
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fabf03f9700 (LWP 29904)]
leveldb::DBImpl::OpenCompactionOutputFile (this=this@entry=0x7fabf9011400, compact=compact@entry=0x7fabf741a2b0) at db/db_impl.cc:1399
1399 db/db_impl.cc: No such file or directory.
(gdb) where
Test Plan: make check
Reviewers: MarkCallaghan, sheki
Reviewed By: MarkCallaghan
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10101
Summary:
Transaction Log Iterator did not move to the next file in the series if there was a write batch at the end of the currentFile.
The solution is if the last seq no. of the current file is < RequestedSeqNo. Assume the first seqNo. of the next file has to satisfy the request.
Also major refactoring around the code. Moved opening the logreader to a seperate function, got rid of goto.
Test Plan: added a unit test for it.
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb, emayanke
Differential Revision: https://reviews.facebook.net/D10029
Summary:
During recovery, last_updated_manifest number was not set if there were no records in the Write-ahead log.
Now check for the recovered manifest also and set last_updated_manifest file to the max value.
Test Plan: unit test
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9891
Summary:
If a class owns an object:
- If the object can be null => use a unique_ptr. no delete
- If the object can not be null => don't even need new, let alone delete
- for runtime sized array => use vector, no delete.
Test Plan: make check
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb, zshao, sheki, emayanke, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D9783
Summary:
RocksDB does a binary search to look at the files which might contain the requested sequence number at the call GetUpdatesSince.
There was a bug in the binary search => when the file pointed by the middle index of bsearch was empty/corrupt it needst to resize the vector and update indexes.
This now fixes that.
Test Plan: existing unit tests pass.
Reviewers: heyongqiang, dhruba
Reviewed By: heyongqiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9777
Summary:
If the vector returned by GetUpdatesSince is empty, it is still returned to the
user. This causes it throw an std::range error.
The probable file list is checked and it returns an IOError status instead of OK now.
Test Plan: added a unit test.
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9771
Summary:
Use non mmapd files for Write-Ahead log.
Earlier use of MMaped files. made the log iterator read ahead and miss records.
Now the reader and writer will point to the same physical location.
There is no perf regression :
./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2
with This diff :
fillseq : 10.756 micros/op 185281 ops/sec; 20.5 MB/s
without this dif :
fillseq : 11.085 micros/op 179676 ops/sec; 19.9 MB/s
Test Plan: unit test included
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9741
Summary: rocksdb uses a single global lock to protect in memory metadata. We should minimize the mutex protected code section to increase the effective parallelism of the program. See https://our.intern.facebook.com/intern/tasks/?t=2218928
Test Plan:
make check
db_bench
Reviewers: dhruba, heyongqiang
CC: zshao, leveldb
Differential Revision: https://reviews.facebook.net/D9705
Summary:
The events that trigger compaction:
* opening the database
* Get -> only if seek compaction is not disabled and other checks are true
* MakeRoomForWrite -> when memtable is full
* BackgroundCall ->
If the background thread is about to do a compaction run, it schedules
a new background task to trigger a possible compaction. This will cause
additional background threads to find and process other compactions that
can run concurrently.
Test Plan: ran db_bench with overwrite and readonly alternatively.
Reviewers: sheki, MarkCallaghan
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9579
Summary:
This patch allows an application to specify whether to use bufferedio,
reads-via-mmaps and writes-via-mmaps per database. Earlier, there
was a global static variable that was used to configure this functionality.
The default setting remains the same (and is backward compatible):
1. use bufferedio
2. do not use mmaps for reads
3. use mmap for writes
4. use readaheads for reads needed for compaction
I also added a parameter to db_bench to be able to explicitly specify
whether to do readaheads for compactions or not.
Test Plan: make check
Reviewers: sheki, heyongqiang, MarkCallaghan
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9429
Summary: Makefile had options to ignore sign-comparisons and unused-parameters, which should be there. Also fixed the specific errors in the code-base
Test Plan: make
Reviewers: chip, dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9531
Summary:
Add --benchmarks=levelstats option to report per-level stats (#files, #bytes)
Change readwhilewriting test to report response time for writes but exclude
them from the stats merged by all threads.
Prevent "NaN" in stats output by preventing division by 0.
Remove "o" file I committed by mistake.
Task ID: #
Blame Rev:
Test Plan:
make check
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D9513
Summary:
Rocksdb can create 0 sized log files when it is opened and closed without any operations.
The GetUpdatesSince fails currently if there is a log file of size zero.
This diff fixes this. If there is a log file is 0, it is removed form the probable_file_list
Test Plan: unit test
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9507
Summary:
If there is an error while writing an edit to the manifest file, the manifest
file is closed and reopened to check if the edit made it in. However, if the
re-opening of the manifest is unsuccessful and options.paranoid_checks is set
t true, then the db refuses to accept new puts, effectively putting the db
in readonly mode.
In a future diff, I would like to make the default value of paranoid_check
to true.
Test Plan: make check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9201
Summary:
Store the last flushed, seq no. in db_impl. Check against it in
transaction Log iterator. Do not attempt to read ahead if we do not know
if the data is flushed completely.
Does not work if flush is disabled. Any ideas on fixing that?
* Minor change, iter->Next is called the first time automatically for
* the first time.
Test Plan:
existing test pass.
More ideas on testing this?
Planning to run some stress test.
Reviewers: dhruba, heyongqiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9087
Summary:
The compaction process zeros out sequence numbers if the output is
part of the bottommost level.
The Slice is supposed to refer to an immutable data buffer. The
merger that implements the priority queue while reading kvs as
the input of a compaction run reies on this fact. The bug was that
were updating the sequence number of a record in-place and that was
causing suceeding invocations of the merger to return kvs in
arbitrary order of sequence numbers.
The fix is to copy the key to a local memory buffer before setting
its seqno to 0.
Test Plan:
Set Options.purge_redundant_kvs_while_flush = false and then run
db_stress --ops_per_thread=1000 --max_key=320
Reviewers: emayanke, sheki
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9147
Summary:
This adds the rate_delay_limit_milliseconds option to make the delay
configurable in MakeRoomForWrite when the max compaction score is too high.
This delay is called the Ln slowdown. This change also counts the Ln slowdown
per level to make it possible to see where the stalls occur.
From IO-bound performance testing, the Level N stalls occur:
* with compression -> at the largest uncompressed level. This makes sense
because compaction for compressed levels is much
slower. When Lx is uncompressed and Lx+1 is compressed
then files pile up at Lx because the (Lx,Lx+1)->Lx+1
compaction process is the first to be slowed by
compression.
* without compression -> at level 1
Task ID: #1832108
Blame Rev:
Test Plan:
run with real data, added test
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D9045
Summary:
Rocks accumulates recent writes and deletes in the in-memory memtable.
When the memtable is full, it writes the contents on the memtable to
a file in L0.
This patch removes redundant records at the time of the flush. If there
are multiple versions of the same key in the memtable, then only the
most recent one is dumped into the output file. The purging of
redundant records occur only if the most recent snapshot is earlier
than the earliest record in the memtable.
Should we switch on this feature by default or should we keep this feature
turned off in the default settings?
Test Plan: Added test case to db_test.cc
Reviewers: sheki, vamsi, emayanke, heyongqiang
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8991
Summary:
There was an artifical limit of 50K files per database. This is
insifficient if the database is 1 TB in size and each file is 2 MB.
Test Plan: make check
Reviewers: sheki, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8919
Summary: just record time consumed in compaction
Test Plan: compile
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8781
Summary:
* Counters for bytes read and write.
as a part of this diff, I want to=>
* Measure compaction times. @dhruba can you point which function, should
* I time to get Compaction-times. Was looking at CompactRange.
Test Plan: db_test
Reviewers: dhruba, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8763
Summary:
* Introduce is histogram in statistics.h
* stop watch to measure time.
* introduce two timers as a poc.
Replaced NULL with nullptr to fight some lint errors
Should be useful for google.
Test Plan:
ran db_bench and check stats.
make all check
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8637
Summary:
The sequence numbers in each record eat up plenty of space on storage.
The optimization zeroes out sequence numbers on kvs in the Lmax
layer that are earlier than the earliest snapshot.
Test Plan: Unit test attached.
Differential Revision: https://reviews.facebook.net/D8619
Summary:
flush_on_destroy has a default value of false and the memtable is flushed
in the dbimpl-destructor only when that is set to true. Because we want the memtable to be flushed everytime that
the destructor is called(db is closed) and the cases where we work with the memtable only are very less
it is a good idea to give this a default value of true. Thus the put from ldb
wil have its data flushed to disk in the destructor and the next Get will be able to
read it when opened with OpenForReadOnly. The reason that ldb could read the latest value when
the db was opened in the normal Open mode is that the Get from normal Open first reads
the memtable and directly finds the latest value written there and the Get from OpenForReadOnly
doesn't have access to the memtable (which is correct because all its Put/Modify) are disabled
Test Plan: make all; ldb put and get and scans
Reviewers: dhruba, heyongqiang, sheki
Reviewed By: heyongqiang
CC: kosievdmerwe, zshao, dilipj, kailiu
Differential Revision: https://reviews.facebook.net/D8631
Summary:
* Add a SplitByTTLLogger to enable this feature. In this diff I implemented generalized AutoSplitLoggerBase class to simplify the
development of such classes.
* Refactor the existing AutoSplitLogger and fix several bugs.
Test Plan:
* Added a unit tests for different types of "auto splitable" loggers individually.
* Tested the composited logger which allows the log files to be splitted by both TTL and log size.
Reviewers: heyongqiang, dhruba
Reviewed By: heyongqiang
CC: zshao, leveldb
Differential Revision: https://reviews.facebook.net/D8037
Summary:
Previously, if you opened a db with num_levels set lower than
the database, you received the unhelpful message "Corruption:
VersionEdit: new-file entry." Now you get a more verbose message
describing the issue.
Also, fix handling of compression_levels (both the run-over-the-end
issue and the memory management of it).
Lastly, unique_ptr'ify a couple of minor calls.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8151
Summary:
We continually rebuilt build_version.c because we put the
current date into it, but that's what __DATE__ already is. This makes
builds faster.
This also fixes an issue with 'make clean FOO' not working properly.
Also tweak the build rules to be more consistent, always have warnings,
and add a 'make release' rule to handle flags for release builds.
Test Plan: make, make clean
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D8139
Summary:
On some filesystems, pre-allocation can be a considerable
amount of space. xfs in our production environment pre-allocates by
1GB, for instance. By using fallocate to inform the kernel of our
expected file sizes, we eliminate this wasteage (that isn't recovered
until the file is closed which, in the case of LOG files, can be a
considerable amount of time).
Test Plan:
created an xfs loopback filesystem, mounted with
allocsize=4M, and ran db_stress. LOG file without this change was 4M,
and with it it was 128k then grew to normal size.
Reviewers: dhruba
Reviewed By: dhruba
CC: adsharma, leveldb
Differential Revision: https://reviews.facebook.net/D7953
Summary:
Replace manual memory management with std::unique_ptr in a
number of places; not exhaustive, but this fixes a few leaks with file
handles as well as clarifies semantics of the ownership of file handles
with log classes.
Test Plan: db_stress, make check
Reviewers: dhruba
Reviewed By: dhruba
CC: zshao, leveldb, heyongqiang
Differential Revision: https://reviews.facebook.net/D8043
Summary:
Found issues with `db_test` and `db_stress` when running valgrind.
`DBImpl` had an issue where if an compaction failed then it will use the uninitialised file size of an output file is used. This manifested as the final call to output to the log in `DoCompactionWork()` branching on uninitialized memory (all the way down in printf's innards).
Test Plan:
Ran `valgrind --track_origins=yes ./db_test` and `valgrind ./db_stress` to see if issues disappeared.
Ran `make check` to see if there were no regressions.
Reviewers: vamsi, dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8001
Summary:
Check in LogAndApply if the file size is more than the limit set in
Options.
Things to consider : will this be expensive?
Test Plan: make all check. Inputs on a new unit test?
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7701
Summary:
Specific changes:
1) Turn on -Werror so all warnings are errors
2) Fix some warnings the above now complains about
3) Add proper dependency support so changing a .h file forces a .c file
to rebuild
4) Automatically use fbcode gcc on any internal machine rather than
whatever system compiler is laying around
5) Fix jemalloc to once again be used in the builds (seemed like it
wasn't being?)
6) Fix issue where 'git' would fail in build_detect_version because of
LD_LIBRARY_PATH being set in the third-party build system
Test Plan:
make, make check, make clean, touch a header file, make sure
rebuild is expected
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D7887
Summary: Found some issues running Valgrind on `db_test` (there are still some outstanding ones) and fixed them.
Test Plan:
make check
ran `valgrind ./db_test` and saw that errors no longer occur
Reviewers: dhruba, vamsi, emayanke, sheki
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7803
Summary:
Changed CreateDir() to CreateDirIfMissing() so a directory that already exists now causes and error.
Fixed CreateDirIfMissing() and added Env.DirExists()
Test Plan:
make check to test for regessions
Ran the following to test if the error message is not about lock files not existing
./db_bench --db=dir/testdb
After creating a file "testdb", ran the following to see if it failed with sane error message:
./db_bench --db=testdb
Reviewers: dhruba, emayanke, vamsi, sheki
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7707
Summary:
Leveldb has an api OpenForReadOnly() that opens the database
in readonly mode. This call had an option to not process the
transaction log. This patch removes this option and always
processes all transactions that had been committed. It has
been done in such a way that it does not create/write to
any new files in the process. The invariant of "no-writes"
to the leveldb data directory is still true.
This enhancement allows multiple threads to open the same database
in readonly mode and access all trancations that were committed right
upto the OpenForReadOnly call.
I changed the public API to match the new semantics because
there are no users who are currently using this api.
Test Plan: make clean check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7479
Summary:
1. The OpenForReadOnly() call should not lock the db. This is useful
so that multiple processes can open the same database concurrently
for reading.
2. GetUpdatesSince should not error out if the archive directory
does not exist.
3. A new constructor for WriteBatch that can takes a serialized
string as a parameter of the constructor.
Test Plan: make clean check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7449
Summary:
Added kMetaDatabase for meta-databases in db/filename.h along with supporting
fuctions.
Fixed switch in DBImpl so that it also handles kMetaDatabase.
Fixed DestroyDB() that it can handle destroying meta-databases.
Test Plan: make check
Reviewers: sheki, emayanke, vamsi, dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D7245
Summary:
C tests would fail sometimes as DestroyDB would return a Failure Status
message when deleting an archival directory which was not created
(WAL_ttl_seconds = 0).
Fix: Ignore the Status returned on Deleting Archival Directory.
Test Plan: * make check
Reviewers: dhruba, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7395
Summary:
* Fixed implementation bug in Binary_Searvch introduced in https://reviews.facebook.net/D7119
* Binary search is also overflow safe.
* Delete archive log files and archive dir during DestroyDB
Test Plan: make check
Reviewers: dhruba
CC: kosievdmerwe, emayanke
Differential Revision: https://reviews.facebook.net/D7263
Summary:
Implement a interface to retrieve the most current transaction
id from the database.
Test Plan: Added unit test.
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7269
Summary:
filename.h has functions to do similar things.
Moving code away from db_impl.cc
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D7251
Summary:
How it works:
* GetUpdatesSince takes a SequenceNumber.
* A LogFile with the first SequenceNumber nearest and lesser than the requested Sequence Number is found.
* Seek in the logFile till the requested SeqNumber is found.
* Return an iterator which contains logic to return record's one by one.
Test Plan:
* Test case included to check the good code path.
* Will update with more test-cases.
* Feedback required on test-cases.
Reviewers: dhruba, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7119
Summary:
A compaction is picked based on its score. It is useful to
print the compaction score in the LOG because it aids in
debugging. If one looks at the logs, one can find out why
a compaction was preferred over another.
Test Plan: make clean check
Differential Revision: https://reviews.facebook.net/D7137
Summary:
Create a directory "archive" in the DB directory.
During DeleteObsolteFiles move the WAL files (*.log) to the Archive directory,
instead of deleting.
Test Plan: Created a DB using DB_Bench. Reopened it. Checked if files move.
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6975
Summary:
Scripted and removed all trailing spaces and converted all tabs to
spaces.
Also fixed other lint errors.
All lint errors from this point of time should be taken seriously.
Test Plan: make all check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7059
Summary:
LevelDB should delete almost-new keys when a long-open snapshot exists.
The previous behavior is to keep all versions that were created after the
oldest open snapshot. This can lead to database size bloat for
high-update workloads when there are long-open snapshots and long-open
snapshot will be used for logical backup. By "almost new" I mean that the
key was updated more than once after the oldest snapshot.
If there were two snapshots with seq numbers s1 and s2 (s1 < s2), and if
we find two instances of the same key k1 that lie entirely within s1 and
s2 (i.e. s1 < k1 < s2), then the earlier version
of k1 can be safely deleted because that version is not visible in any snapshot.
Test Plan:
unit test attached
make clean check
Differential Revision: https://reviews.facebook.net/D6999
Summary:
Print out status at the end of a compaction run. This helps in
debugging.
Test Plan: make clean check
Reviewers: sheki
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D7035
Summary:
This option is needed for fast bulk uploads. The goal is to load
all the data into files in L0 without any interference from
background compactions.
Test Plan: make clean check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6849
Summary:
dbstress has an option to reopen the database. Make it such that the
previous handle is not closed before we reopen, this simulates a
situation similar to a process crash.
Added new api to DMImpl to remove the lock file.
Test Plan: run db_stress
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6777
Summary:
There are applications that operate on multiple leveldb instances.
These applications will like to pass in an opaque type for each
leveldb instance and this type should be passed back to the application
with every invocation of the CompactionFilter api.
Test Plan: Enehanced unit test for opaque parameter to CompactionFilter.
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan, sheki, emayanke
Differential Revision: https://reviews.facebook.net/D6711
Summary: Record BloomFliter hits and drop off reasons during compaction.
Test Plan: Unit tests work.
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6591
Summary:
disable size compaction in ldb reduce_levels, this will avoid compactions rather than the manual comapction,
added --compression=none|snappy|zlib|bzip2 and --file_size= per-file size to ldb reduce_levels command
Test Plan: run ldb
Reviewers: dhruba, MarkCallaghan
Reviewed By: dhruba
CC: sheki, emayanke
Differential Revision: https://reviews.facebook.net/D6597
Summary:
When a new version is created, we sort all the files at every
level based on their size. This is necessary because we want
to compact the largest file first. The sorting takes quite a
bit of CPU.
Moved the sorting code to be outside the mutex. Also, the
earlier code was sorting files at all levels but we do not
need to sort the highest-number level because those files
are never the cause of any compaction. To reduce sorting
costs, we sort only the first few files in each level
because it is likely that those are the only files in that
level that will be picked for compaction.
At steady state, I have seen that this patch increase
throughout from 1500 writes/sec to 1700 writes/sec at the
end of a 72 hour run. The cpu saving by not sorting the
last level was not distinctive in this test run because
there were only 100K files in the highest numbered level.
I expect the cpu saving to be significant when the number of
files is much higher.
This is mostly an early preview and not ready for rigorous review.
With this patch, the writs/sec is now bottlenecked not by the sorting code but by GetOverlappingInputs. I am working on a patch to optimize GetOverlappingInputs.
Test Plan: make check
Reviewers: MarkCallaghan, heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D6411
Summary:
Added a conditional flush in ~DBImpl to flush.
There is still a chance of writes not being persisted if there is a
crash (not a clean shutdown) before the DBImpl instance is destroyed.
Test Plan: modified db_test to meet the new expectations.
Reviewers: dhruba, heyongqiang
Differential Revision: https://reviews.facebook.net/D6519
Summary:
The default compilation process now uses "-Wall" to compile.
Fix all compilation error generated by gcc.
Test Plan: make all check
Reviewers: heyongqiang, emayanke, sheki
Reviewed By: heyongqiang
CC: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6525
Summary:
There are certain use-cases where the application intends to
delete older keys aftre they have expired a certian time period.
One option for those applications is to periodically scan the
entire database and delete appropriate keys.
A better way is to allow the application to hook into the
compaction process. This patch allows the application to set
a method callback for every key that is being compacted. If
this method returns true, then the key is not preserved in
the output of the compaction.
Test Plan:
This is mostly to preview the proposed new public api.
Since it is a public api, please do due diligence on reviewing it.
I will be writing test cases for this api in mynext version of
this patch.
Reviewers: MarkCallaghan, heyongqiang
Reviewed By: heyongqiang
CC: sheki, adsharma
Differential Revision: https://reviews.facebook.net/D6285
Summary:
as subject
Test Plan:
run db_bench and db_test
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6111
Summary:
This makes the stall timers in MakeRoomForWrite more accurate by timing
the sleeps. From looking at the logs the real sleep times are usually
about 2000 usecs each when SleepForMicros(1000) is called. The modified LOG messages are:
2012/10/29-12:06:33.271984 2b3cc872f700 delaying write 13 usecs for level0_slowdown_writes_trigger
2012/10/29-12:06:34.688939 2b3cc872f700 delaying write 1728 usecs for rate limits with max score 3.83
Task ID: #
Blame Rev:
Test Plan:
run db_bench, look at DB/LOG
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6297
Summary:
The leveldb API is enhanced to support different compression algorithms at
different levels.
This adds the option min_level_to_compress to db_bench that specifies
the minimum level for which compression should be done when
compression is enabled. This can be used to disable compression for levels
0 and 1 which are likely to suffer from stalls because of the CPU load
for memtable flushes and (L0,L1) compaction. Level 0 is special as it
gets frequent memtable flushes. Level 1 is special as it frequently
gets all:all file compactions between it and level 0. But all other levels
could be the same. For any level N where N > 1, the rate of sequential
IO for that level should be the same. The last level is the
exception because it might not be full and because files from it are
not read to compact with the next larger level.
The same amount of time will be spent doing compaction at any
level N excluding N=0, 1 or the last level. By this standard all
of those levels should use the same compression. The difference is that
the loss (using more disk space) from a faster compression algorithm
is less significant for N=2 than for N=3. So we might be willing to
trade disk space for faster write rates with no compression
for L0 and L1, snappy for L2, zlib for L3. Using a faster compression
algorithm for the mid levels also allows us to reclaim some cpu
without trading off much loss in disk space overhead.
Also note that little is to be gained by compressing levels 0 and 1. For
a 4-level tree they account for 10% of the data. For a 5-level tree they
account for 1% of the data.
With compression enabled:
* memtable flush rate is ~18MB/second
* (L0,L1) compaction rate is ~30MB/second
With compression enabled but min_level_to_compress=2
* memtable flush rate is ~320MB/second
* (L0,L1) compaction rate is ~560MB/second
This practicaly takes the same code from https://reviews.facebook.net/D6225
but makes the leveldb api more general purpose with a few additional
lines of code.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6261
Summary:
Adds the "MB/sec in" and "MB/sec out" to this line:
Amplification: 1.7 rate, 0.01 GB in, 0.02 GB out, 8.24 MB/sec in, 13.75 MB/sec out
Changes all values to be reported per interval and since test start for this line:
... thread 0: (10000,60000) ops and (19155.6,27307.5) ops/second in (0.522041,2.197198) seconds
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6291
Summary:
Adds a method that returns the score for the next level that most
needs compaction. That method is then used by db_bench to rate limit threads.
Threads are put to sleep at the end of each stats interval until the score
is less than the limit. The limit is set via the --rate_limit=$double option.
The specified value must be > 1.0. Also adds the option --stats_per_interval
to enable additional metrics reported every stats interval.
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6243
Summary: Enable LevelDb to create a new log file if current log file is too large.
Test Plan:
Write a script and manually check the generated info LOG.
Task ID: 1803577
Blame Rev:
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: zshao
Differential Revision: https://reviews.facebook.net/D6003
Summary:
I used server uptime to compute per-level IO throughput rates. I
intended to use time spent doing compaction at that level. This fixes that.
Task ID: #
Blame Rev:
Test Plan:
run db_bench, look at results
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6237
Summary:
This adds more statistics to be reported by GetProperty("leveldb.stats").
The new stats include time spent waiting on stalls in MakeRoomForWrite.
This also includes the total amplification rate where that is:
(#bytes of sequential IO during compaction) / (#bytes from Put)
This also includes a lot more data for the per-level compaction report.
* Rn(MB) - MB read from level N during compaction between levels N and N+1
* Rnp1(MB) - MB read from level N+1 during compaction between levels N and N+1
* Wnew(MB) - new data written to the level during compaction
* Amplify - ( Write(MB) + Rnp1(MB) ) / Rn(MB)
* Rn - files read from level N during compaction between levels N and N+1
* Rnp1 - files read from level N+1 during compaction between levels N and N+1
* Wnp1 - files written to level N+1 during compaction between levels N and N+1
* NewW - new files written to level N+1 during compaction
* Count - number of compactions done for this level
This is the new output from DB::GetProperty("leveldb.stats"). The old output stopped at Write(MB)
Compactions
Level Files Size(MB) Time(sec) Read(MB) Write(MB) Rn(MB) Rnp1(MB) Wnew(MB) Amplify Read(MB/s) Write(MB/s) Rn Rnp1 Wnp1 NewW Count
-------------------------------------------------------------------------------------------------------------------------------------
0 3 6 33 0 576 0 0 576 -1.0 0.0 1.3 0 0 0 0 290
1 127 242 351 5316 5314 570 4747 567 17.0 12.1 12.1 287 2399 2685 286 32
2 161 328 54 822 824 326 496 328 4.0 1.9 1.9 160 251 411 160 161
Amplification: 22.3 rate, 0.56 GB in, 12.55 GB out
Uptime(secs): 439.8
Stalls(secs): 206.938 level0_slowdown, 0.000 level0_numfiles, 24.129 memtable_compaction
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
(cherry picked from commit ecdeead38f86cc02e754d0032600742c4f02fec8)
Reviewers: dhruba
Differential Revision: https://reviews.facebook.net/D6153
Summary:
The compaction process deletes a large number of files. This takes
quite a bit of time and is best done outside the mutex lock.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6123
Summary:
The compaction process deletes a large number of files. This takes
quite a bit of time and is best done outside the mutex lock.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6123
Summary:
The compaction process deletes a large number of files. This takes
quite a bit of time and is best done outside the mutex lock.
Test Plan:
Reviewers:
CC:
Task ID: #
Blame Rev:
published in https://reviews.facebook.net/D5997.
Summary:
This patch allows compaction to occur in multiple background threads
concurrently.
If a manual compaction is issued, the system falls back to a
single-compaction-thread model. This is done to ensure correctess
and simplicity of code. When the manual compaction is finished,
the system resumes its concurrent-compaction mode automatically.
The updates to the manifest are done via group-commit approach.
Test Plan: run db_bench
Summary:
The method DeleteObsolete files is a very costly methind, especially
when the number of files in a system is large. It makes a list of
all live-files and then scans the directory to compute the diff.
By default, this method is executed after every compaction run.
This patch makes it such that DeleteObsolete files is never
invoked twice within a configured period.
Test Plan: run all unit tests
Reviewers: heyongqiang, MarkCallaghan
Reviewed By: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6045
Summary:
If ReadCompaction is switched off, then it is better to not even
submit background compaction jobs. I see about 3% increase in
read-throughput on a pure memory database.
Test Plan: run db_bench
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5673
Summary: Print out the compile version in the LOG.
Test Plan: run dbbench and verify LOG
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5529