Commit graph

3077 commits

Author SHA1 Message Date
Venkatesh Radhakrishnan f2ddb8b452 Fix for bug where GeoDB accesses key after next modification of iterator
Summary:
While running cross-functional tests for weak iterators, I
encountered a bug in GeoDB. GeoDB reads a key from the database and
tries to use it after doing a Seek. Fixing it by storing the key locally
so that it is still visible after the Seek.

Test Plan: Run geodb_test

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31599
2015-01-15 11:47:41 -08:00
Thomas Dudziak d10f1de2b4 Ported LevelDB's fault_injection_test
Summary:
This is a port of [[ https://github.com/google/leveldb/blob/master/db/fault_injection_test.cc | LevelDB's fault_injection_test ]] to RocksDB. Unfortunately it fails with:

```
==== Test FaultInjectionTest.FaultTest
db/fault_injection_test.cc:491: Corruption: no meta-nextfile entry in descriptor
#0   ./fault_injection_test() [0x41477a] rocksdb::FaultInjectionTest::PartialCompactTestReopenWithFault(rocksdb::FaultInjectionTest::ResetMethod, int, int) /data/users/tomdzk/rocksdb/db/fault_injection_test.cc:491
#1   ./fault_injection_test() [0x40a38a] rocksdb::_Test_FaultTest::_Run() /data/users/tomdzk/rocksdb/db/fault_injection_test.cc:517
#2   ./fault_injection_test() [0x415bea] rocksdb::_Test_FaultTest::_RunIt() /data/users/tomdzk/rocksdb/db/fault_injection_test.cc:507
#3   ./fault_injection_test() [0x584367] rocksdb::test::RunAllTests() /data/users/tomdzk/rocksdb/util/testharness.cc:70
#4   /usr/local/fbcode/gcc-4.8.1-glibc-2.17/lib/libc.so.6(__libc_start_main+0x10e) [0x7f7a40857efe] ??  ??:0
#5   ./fault_injection_test() [0x408bb8] _start ??:0
```

so I commented out the test invocation in the source code for now (lines 514-520) so it can be merged.

Test Plan: This is a new test.

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31587
2015-01-15 10:28:10 -08:00
Igor Canadi 2bb059007b Change db_stress to work with format_version == 2 2015-01-14 16:25:36 -08:00
Igor Canadi 9ab5adfc59 New BlockBasedTable version -- better compressed block format
Summary:
This diff adds BlockBasedTable format_version = 2. New format version brings better compressed block format for these compressions:
1) Zlib -- encode decompressed size in compressed block header
2) BZip2 -- encode decompressed size in compressed block header
3) LZ4 and LZ4HC -- instead of doing memcpy of size_t encode size as varint32. memcpy is very bad because the DB is not portable accross big/little endian machines or even platforms where size_t might be 8 or 4 bytes.

It does not affect format for snappy.

If you write a new database with format_version = 2, it will not be readable by RocksDB versions before 3.10. DB::Open() will return corruption in that case.

Test Plan:
Added a new test in db_test.
I will also run db_bench and verify VSIZE when block_cache == 1GB

Reviewers: yhchiang, rven, MarkCallaghan, dhruba, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31461
2015-01-14 16:24:24 -08:00
Yueh-Hsuan Chiang 2355931c6f Merge pull request #450 from adamretter/writebatch-with-index
Add WriteBatchWithIndex to the Java API
2015-01-14 13:40:04 -08:00
Adam Retter 3d246c89cc Abstract duplicate code on key and value slice objects into generic methods 2015-01-14 21:16:06 +00:00
Adam Retter 2d0dd8db3b Implement WBWIRocksIterator for WriteBatchWithIndex in the Java API 2015-01-14 21:16:06 +00:00
Adam Retter de678b288e Abstractions for common iterator behaviour 2015-01-14 21:16:06 +00:00
Adam Retter e01acb3a04 Test for WriteBatchWithIndex#newIterator() 2015-01-14 21:16:06 +00:00
Adam Retter 56f24941ab Simplify the Java API by permitting WriteBatchWithIndex to be provided straight to RocksDB#write 2015-01-14 21:16:06 +00:00
Adam Retter 95d5f98487 Test for RocksDB#write(WriteBatchWithIndex) 2015-01-14 21:16:06 +00:00
Adam Retter ef5b34dee0 Implement WriteBatchWithIndex in the Java API 2015-01-14 21:16:05 +00:00
Adam Retter c6e5545612 Abstractions for common write batch behaviour 2015-01-14 21:16:05 +00:00
Adam Retter be905491bf Test for WriteBatchWithIndex#newIteratorWithBase(org.rocksdb.RocksIterator) 2015-01-14 21:16:05 +00:00
Adam Retter 2241e3f4d5 Extract the interface for a RocksIterator 2015-01-14 21:16:05 +00:00
Adam Retter a8cfa7ace8 Extract the interface for a WriteBatch 2015-01-14 21:16:05 +00:00
Adam Retter 45e43b81df Adds support for db->DefaultColumnFamily() to the Java API 2015-01-14 21:16:05 +00:00
Igor Canadi 516a04267e Add LZ4 compression to sanity test
Summary: This will be used to test format changes in https://reviews.facebook.net/D31461

Test Plan: run it

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31515
2015-01-14 09:45:29 -08:00
Igor Canadi 2ccc54301d Merge pull request #460 from neutronsharc/master
Remove duplicated method declarations in C header.
2015-01-13 18:19:26 -08:00
Xiangyong Ouyang 2a7bd0ea45 Remove duplicated method declarations in C header. 2015-01-14 01:20:30 +00:00
sdong bb128bfec3 More accurate message for compaction applied to a different version
Test Plan: Compile. Run it.

Reviewers: yhchiang, dhruba, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D31479
2015-01-13 16:48:18 -08:00
Igor Canadi 96b8240bc5 Support footer versions bigger than 1
Summary:
In this diff I add another parameter to BlockBasedTableOptions that will let users specify block based table's format. This will greatly simplify block based table's format changes in the future.

First format change that this will support is encoding decompressed size in Zlib and BZip2 blocks. This diff is blocking https://reviews.facebook.net/D31311.

Test Plan: Added a unit tests. More tests to come as part of https://reviews.facebook.net/D31311.

Reviewers: dhruba, MarkCallaghan, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31383
2015-01-13 14:33:04 -08:00
Igor Canadi 53f615df6a Fix clang build 2015-01-13 12:27:28 -08:00
Igor Canadi 02b30202cd Merge pull request #455 from Andersbakken/stdlib_fix
Build with clang 3.5 on Linux.
2015-01-13 09:08:32 -08:00
Yueh-Hsuan Chiang 2159484dd6 Remove two unnecessary blank lines in db/db_test.cc 2015-01-13 01:40:11 -08:00
Yueh-Hsuan Chiang e7dd88c57d Merge pull request #441 from fyrz/RocksJava-ColumnFamilyDescriptor-Alignment
[RocksJava] ColumnFamilyDescriptor alignment with listColumnFamilies
2015-01-13 01:09:54 -08:00
Yueh-Hsuan Chiang d2c018fd5b Make ThreadStatusMultipleCompaction more robust. 2015-01-13 01:02:10 -08:00
Yueh-Hsuan Chiang bf9aa4dfcd Improve GetThreadStatus to avoid false alarm in some case. 2015-01-13 00:38:09 -08:00
Yueh-Hsuan Chiang c91cdd59c1 Allow GetThreadList() to indicate a thread is doing Compaction.
Summary: Allow GetThreadList() to indicate a thread is doing Compaction.

Test Plan:
export ROCKSDB_TESTS=ThreadStatus
./db_test

Reviewers: ljin, igor, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba, jonahcohen, rven

Differential Revision: https://reviews.facebook.net/D30105
2015-01-13 00:04:08 -08:00
Igor Canadi 402c1152a7 Fix c_simple_example 2015-01-12 18:17:30 -08:00
Anders Bakken a9ea65d652 Build with clang 3.5 on Linux. 2015-01-12 09:59:36 -08:00
fyrz 23ad5f401a [RocksJava] Incorporated changes for D30525 2015-01-10 20:54:50 +01:00
fyrz 0aab1005f8 [RocksJava] ColumnFamilyDescriptor alignment with listColumnFamilies
Summary:
Previous to this commit ColumnFamilyDescriptor took a String as name for the ColumnFamily name. String is however encoding dependent which is bad because listColumnFamilies returns byte arrays without any encoding information.

All public API call were deprecated and flagged to be removed in 3.10.0

Test Plan:
make rocksdbjava
make test
mvn -f rocksjni.pom package

Reviewers: yhchiang, adamretter, ankgup87

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D30525
2015-01-10 20:54:49 +01:00
Igor Canadi 15d2abbec3 Fix build issues 2015-01-09 13:04:06 -08:00
Igor Canadi abb9b95ffe Move compression functions from port/ to util/
Summary: We keep checksum functions in util/, there is no reason for compression to be in port/

Test Plan: compiles

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31281
2015-01-09 12:57:11 -08:00
sdong 9132e52ea4 DB Stats Dump to print total stall time
Summary:
Add printing of stall time in DB Stats:

Sample outputs:

** DB Stats **
Uptime(secs): 53.2 total, 1.7 interval
Cumulative writes: 625940 writes, 625939 keys, 625940 batches, 1.0 writes per batch, 0.49 GB user ingest, stall micros: 50691070
Cumulative WAL: 625940 writes, 625939 syncs, 1.00 writes per sync, 0.49 GB written
Interval writes: 10859 writes, 10859 keys, 10859 batches, 1.0 writes per batch, 8.7 MB user ingest, stall micros: 1692319
Interval WAL: 10859 writes, 10859 syncs, 1.00 writes per sync, 0.01 MB written

Test Plan:
make all check
verify printing using db_bench

Reviewers: igor, yhchiang, rven, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D31239
2015-01-09 11:44:19 -08:00
Igor Canadi 93b35c299b Merge pull request #453 from fyrz/SimpleCExampleSigSegv
Memory issue fix in c_simple_example
2015-01-09 09:34:08 -08:00
Robert 628a67b007 Reduce memory footprint in backupable db.
* Use emplace when possible.
* Make FileInfo shared among all BackupMeta, instead of storing filenames.
* Make checksum_value in FileInfo constant.
* Reserve space beforehand if container size is known.
* Make FileInfo and BackupMeta non-copyable and non-assignable to prevent future logic errors.
  It is very dangerous to copy BackupMeta without careful handling refcounts of FileInfo.
* Remove a copy of BackupMeta when detected corrupt backup.
2015-01-09 16:58:13 +08:00
fyrz ef39016424 Fixed memory issue in c_simple_example
Valgrind report prior to this fix:
==20829== Memcheck, a memory error detector
==20829== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==20829== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==20829== Command: ./c_simple_example
==20829==
==20829== Invalid read of size 1
==20829==    at 0x4C2F1C8: strcmp (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20829==    by 0x422522: main (in /home/user/rocksgit/transfer/rocksdb-git/examples/c_simple_example)
==20829==  Address 0x5f60df5 is 0 bytes after a block of size 5 alloc'd
==20829==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20829==    by 0x4226D5: CopyString (c.cc:498)
==20829==    by 0x423032: rocksdb_get (c.cc:730)
==20829==    by 0x4224EB: main (in /home/user/rocksgit/transfer/rocksdb-git/examples/c_simple_example)
==20829==
==20829==
==20829== HEAP SUMMARY:
==20829==     in use at exit: 77 bytes in 5 blocks
==20829==   total heap usage: 4,491 allocs, 4,486 frees, 839,216 bytes allocated
==20829==
==20829== LEAK SUMMARY:
==20829==    definitely lost: 5 bytes in 1 blocks
==20829==    indirectly lost: 0 bytes in 0 blocks
==20829==      possibly lost: 0 bytes in 0 blocks
==20829==    still reachable: 72 bytes in 4 blocks
==20829==         suppressed: 0 bytes in 0 blocks
==20829== Rerun with --leak-check=full to see details of leaked memory
==20829==
==20829== For counts of detected and suppressed errors, rerun with: -v
==20829== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
2015-01-09 09:41:36 +01:00
Igor Canadi b89d58dfa3 :%s/build_config/make_config
Summary: I'm tired of double-tab when opening build_tools/<something>. This change will make bu<tab> fully complete my path :)

Test Plan: `vi bu<tab>` gives me `vi build_tools/` yay!

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30639
2015-01-07 17:26:24 -08:00
Ameya Gupte 242b9769c3 Memtablerep Benchmark
Summary:
Create a benchmark for testing memtablereps. This diff is a bit rough, but it should do the trick until other bootcampers can clean it up.

Addressing comments
Removed the mutexes
Changed ReadWriteBenchmark to fix number of reads and count the number of writes we can perform in that time.

Test Plan:
Run it.

Below runs pass
./memtablerep_bench --benchmarks fillrandom,readrandom --memtablerep skiplist

./memtablerep_bench --benchmarks fillseq,readseq --memtablerep skiplist

./memtablerep_bench --benchmarks readwrite,seqreadwrite --memtablerep skiplist --num_operations 200 --num_threads 5

./memtablerep_bench --benchmarks fillrandom,readrandom --memtablerep hashskiplist

./memtablerep_bench --benchmarks fillseq,readseq --memtablerep hashskiplist
 --num_scans 2

./memtablerep_bench --benchmarks fillseq,readseq --memtablerep vector

Reviewers: jpaton, ikabiljo, sdong

Reviewed By: sdong

Subscribers: dhruba, ameyag

Differential Revision: https://reviews.facebook.net/D22683
2015-01-07 15:15:30 -08:00
sdong 73ee4febab Add comments about properties supported by DB::GetProperty() and DB::GetIntProperty()
Summary: Add comments in db.h to help users discover their options.

Test Plan: Compile

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31077
2015-01-07 15:09:35 -08:00
Igor Canadi 2dca48f550 Merge pull request #451 from StanislavGlebik/document_db_improvement
Fixed negative numbers comparison in DocumentDB
2015-01-07 14:11:22 -08:00
stash93 4b57d9a820 Fixed negative numbers comparison in DocumentDB 2015-01-08 01:03:51 +03:00
sdong 9ef59a09a5 VersionSet::AddLiveFiles() to assert current version is included.
Summary: Add an extra assert to make sure current version is included in VersionSet::AddLiveFiles().

Test Plan: make all check

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: dhruba, hermanlee4, leveldb

Differential Revision: https://reviews.facebook.net/D30819
2015-01-07 12:03:40 -08:00
sdong 4d16a9a633 VersionBuilder to optimize for applying a later edit deleting files added by previous edits
Summary: During recovery, VersionBuilder::Apply() was called multiple times. If the DB is open for long enough, most of files added earlier will be deleted by later deletes. In current solution, sorting added file happens first and then deletes are applied. In this patch, deletes are applied when possible inside Apply(), which can significantly reduce the sorting time in some cases.

Test Plan:
Add unit tests in version_builder
valgrind_check
Open a manifest of 50MB, with 9K live files. The manifest read time reduced from 1.6 seconds to 0.7 seconds.

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30765
2015-01-07 10:36:58 -08:00
Igor Canadi 7731d51c82 Simplify column family concurrency
Summary:
This patch changes concurrency guarantees around ColumnFamilySet::column_families_ and ColumnFamilySet::column_families_data_.

Before:
* When mutating: lock DB mutex and spin lock
* When reading: lock DB mutex OR spin lock

After:
* When mutating: lock DB mutex and be in write thread
* When reading: lock DB mutex or be in write thread

That way, we eliminate the spin lock that protects these hash maps and  simplify concurrency. That means we don't need to lock the spin lock during writing, since writing is mutually exclusive with column family create/drop (the only operations that mutate those hash maps).

With these new restrictions, I also needed to move column family create to the write thread (column family drop was already in the write thread).

Even though we don't need to lock the spin lock during write, impact on performance should be minimal -- the spin lock is almost never busy, so locking it is almost free.

This addresses task t5116919.

Test Plan:
make check

Stress test with lots and lots of column family drop and create:

   time ./db_stress --threads=30 --ops_per_thread=5000000 --max_key=5000 --column_families=200 --clear_column_family_one_in=100000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress/

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30651
2015-01-06 12:44:21 -08:00
Igor Canadi 07aa4e0e35 Fix compaction summary log for trivial move
Summary: When trivial move commit is done, we log the summary of the input version instead of current. This is inconsistent with other log messages and confusing.

Test Plan: compiles

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30939
2015-01-05 17:32:49 -08:00
Leonidas Galanis 9d5bd411be benchmark.sh won't run through all tests properly if one specifies wal_dir to be different than db directory.
Summary:
A command line like this to run all the tests:
source benchmark.config.sh && nohup ./benchmark.sh 'bulkload,fillseq,overwrite,filluniquerandom,readrandom,readwhilewriting'
where
benchmark.config.sh is:
export DB_DIR=/data/mysql/rocksdata
export WAL_DIR=/txlogs/rockswal
export OUTPUT_DIR=/root/rocks_benchmarking/output

Will fail for the tests that need a new DB .

Also 1) set disable_data_sync=0 and 2) add debug mode to run through all the tests more quickly

Test Plan: run ./benchmark.sh 'debug,bulkload,fillseq,overwrite,filluniquerandom,readrandom,readwhilewriting' and verify that there are no complaints about WAL dir not being empty.

Reviewers: sdong, yhchiang, rven, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D30909
2015-01-05 15:36:47 -08:00
Igor Canadi 62ad0a9b19 Deprecating skip_log_error_on_recovery
Summary:
Since https://reviews.facebook.net/D16119, we ignore partial tailing writes. Because of that, we no longer need skip_log_error_on_recovery.

The documentation says "Skip log corruption error on recovery (If client is ok with losing most recent changes)", while the option actually ignores any corruption of the WAL (not only just the most recent changes). This is very dangerous and can lead to DB inconsistencies. This was originally set up to ignore partial tailing writes, which we now do automatically (after D16119). I have digged up old task t2416297 which confirms my findings.

Test Plan: There was actually no tests that verified correct behavior of skip_log_error_on_recovery.

Reviewers: yhchiang, rven, dhruba, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30603
2015-01-05 13:35:56 -08:00