Commit Graph

131 Commits

Author SHA1 Message Date
Andrew Kryczka ab35505e21 Write properties metablock last in block-based tables (#4158)
Summary:
The properties meta-block should come at the end since we always need to
read it when opening a file, unlike index/filter/other meta-blocks, which
are sometimes read depending on the user's configuration. This ordering
will allow us to (in a future PR) do a small readahead on the end of the file
to read properties and meta-index blocks with one I/O.

The bulk of this PR is a refactoring of the `BlockBasedTableBuilder::Finish`
function. It was previously too large with inconsistent error handling, which
made it difficult to change. So I broke it up into one function per meta-block
write, and tried to make error handling consistent within those functions.
Then reordering the metablocks was trivial -- just reorder the calls to these
helper functions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4158

Differential Revision: D8921705

Pulled By: ajkr

fbshipit-source-id: 96c9cc3182eb1adf11af46adab79dbeba7b12fcc
2018-07-20 09:11:59 -07:00
Maysam Yabandeh 29ffbb8a50 Charging block cache more accurately (#4073)
Summary:
Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it.
Closes https://github.com/facebook/rocksdb/pull/4073

Differential Revision: D8686552

Pulled By: maysamyabandeh

fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f
2018-06-29 08:57:20 -07:00
Nikhil Benesch 17339dc2f3 Add table property tracking number of range deletions (#4016)
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR #3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.

/cc ajkr nvanbenschoten
Closes https://github.com/facebook/rocksdb/pull/4016

Differential Revision: D8527575

Pulled By: ajkr

fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
2018-06-26 20:27:35 -07:00
Zhongyi Xie f1592a06c2 run make format for PR 3838 (#3954)
Summary:
PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings.
Run `make format` to fix formatting as suggested by siying .
Also piggyback two changes:
1) fix singleton destruction order for windows and posix env
2) fix two clang warnings
Closes https://github.com/facebook/rocksdb/pull/3954

Differential Revision: D8272041

Pulled By: miasantreble

fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f
2018-06-05 12:58:02 -07:00
Dmitri Smirnov f4b72d7056 Provide a way to override windows memory allocator with jemalloc for ZSTD
Summary:
Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.

For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.

The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
Closes https://github.com/facebook/rocksdb/pull/3838

Differential Revision: D8229794

Pulled By: miasantreble

fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
2018-06-04 12:12:48 -07:00
Maysam Yabandeh 402b7aa07f Exclude seq from index keys
Summary:
Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
Closes https://github.com/facebook/rocksdb/pull/3894

Differential Revision: D8118775

Pulled By: maysamyabandeh

fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
2018-05-25 18:42:43 -07:00
Zhongyi Xie c3ebc75843 Move prefix_extractor to MutableCFOptions
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601

Differential Revision: D7253114

Pulled By: miasantreble

fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
2018-05-21 14:43:11 -07:00
Andrew Kryczka 7b655214d2 Assert keys/values pinned by range deletion meta-block iterators
Summary:
`RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion.
Closes https://github.com/facebook/rocksdb/pull/3875

Differential Revision: D8063667

Pulled By: ajkr

fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41
2018-05-21 09:57:00 -07:00
Anand Ananthabhotla f9f4d40f93 Align SST file data blocks to avoid spanning multiple pages
Summary:
Provide a block_align option in BlockBasedTableOptions to allow
alignment of SST file data blocks. This will avoid higher
IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages.
When this option is set to true, the block alignment is set to lower of
block size and 4KB.
Closes https://github.com/facebook/rocksdb/pull/3502

Differential Revision: D7400897

Pulled By: anand1976

fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44
2018-03-26 20:26:10 -07:00
Huachao Huang 7a6353bd1c Ignore empty filter block when data block is empty
Summary:
Close https://github.com/facebook/rocksdb/issues/3592
Closes https://github.com/facebook/rocksdb/pull/3614

Differential Revision: D7291706

Pulled By: ajkr

fbshipit-source-id: 9dd8f40bd7716588e1e3fd6be0c2bc2766861f8c
2018-03-21 23:13:05 -07:00
Andrew Kryczka 5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Igor Sugak aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
jsteemann 4e7a182d09 Several small "fixes"
Summary:
- removed a few unneeded variables
- fused some variable declarations and their assignments
- fixed right-trimming code in string_util.cc to not underflow
- simplifed an assertion
- move non-nullptr check assertion before dereferencing of that pointer
- pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
Closes https://github.com/facebook/rocksdb/pull/3507

Differential Revision: D7004679

Pulled By: sagar0

fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
2018-02-15 16:57:37 -08:00
Anand Ananthabhotla 199405192d Add a BlockBasedTableOption to turn off index block compression.
Summary:
Add a new bool option index_uncompressed in BlockBasedTableOptions.
Closes https://github.com/facebook/rocksdb/pull/3303

Differential Revision: D6686161

Pulled By: anand1976

fbshipit-source-id: 748b46993d48a01e5f89b6bd3e41f06a59ec6054
2018-01-10 15:11:59 -08:00
Siying Dong 6b77c07379 NUMBER_BLOCK_COMPRESSED, etc, shouldn't be treated as timer counter
Summary:
NUMBER_BLOCK_DECOMPRESSED and NUMBER_BLOCK_COMPRESSED are not reported unless the stats level contain detailed timers, which is wrong. They are normal counters. Fix it.
Closes https://github.com/facebook/rocksdb/pull/3263

Differential Revision: D6552519

Pulled By: siying

fbshipit-source-id: 40899ccea7b2856bb39752616657c0bfd432f6f9
2017-12-14 10:27:43 -08:00
Prashant D baff91c1ad table: Fix coverity issues
Summary:
table/block.cc:
420  }

CID 1396127 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
7. uninit_member: Non-static class member restart_offset_ is not initialized in this constructor nor in any functions that it calls.
421}

table/block_based_table_builder.cc:

CID 1418259 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
7. uninit_member: Non-static class member compressed_cache_key_prefix_size is not initialized in this constructor nor in any functions that it calls.

table/block_based_table_reader.h:
   	3. uninit_member: Non-static class member index_type is not initialized in this constructor nor in any functions that it calls.

CID 1396147 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
5. uninit_member: Non-static class member hash_index_allow_collision is not initialized in this constructor nor in any functions that it calls.
413        global_seqno(kDisableGlobalSequenceNumber) {}
414

table/cuckoo_table_reader.cc:
 55  if (hash_funs == user_props.end()) {
 56    status_ = Status::Corruption("Number of hash functions not found");
   	5. uninit_member: Non-static class member is_last_level_ is not initialized in this constructor nor in any functions that it calls.
   	7. uninit_member: Non-static class member identity_as_first_hash_ is not initialized in this constructor nor in any functions that it calls.
   	9. uninit_member: Non-static class member use_module_hash_ is not initialized in this constructor nor in any functions that it calls.
   	11. uninit_member: Non-static class member num_hash_func_ is not initialized in this constructor nor in any functions that it calls.
   	13. uninit_member: Non-static class member key_length_ is not initialized in this constructor nor in any functions that it calls.
   	15. uninit_member: Non-static class member user_key_length_ is not initialized in this constructor nor in any functions that it calls.
   	17. uninit_member: Non-static class member value_length_ is not initialized in this constructor nor in any functions that it calls.
   	19. uninit_member: Non-static class member bucket_length_ is not initialized in this constructor nor in any functions that it calls.
   	21. uninit_member: Non-static class member cuckoo_block_size_ is not initialized in this constructor nor in any functions that it calls.
   	23. uninit_member: Non-static class member cuckoo_block_bytes_minus_one_ is not initialized in this constructor nor in any functions that it calls.

CID 1322785 (#2 of 2): Uninitialized scalar field (UNINIT_CTOR)
25. uninit_member: Non-static class member table_size_ is not initialized in this constructor nor in any functions that it calls.
 57    return;

table/plain_table_index.h:
   	2. uninit_member: Non-static class member index_size_ is not initialized in this constructor nor in any functions that it calls.

CID 1322801 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
4. uninit_member: Non-static class member sub_index_size_ is not initialized in this constructor nor in any functions that it calls.
128        huge_page_tlb_size_(huge_page_tlb_size) {}
129
Closes https://github.com/facebook/rocksdb/pull/3113

Differential Revision: D6505719

Pulled By: yiwu-arbug

fbshipit-source-id: 38f44d8f9dfefb4c2e25d83b8df25a5201c75618
2017-12-07 11:57:36 -08:00
Zhongyi Xie a6c6b8b38c Revert "No need for Restart Interval for meta blocks"
Summary:
See [issue 3169](https://github.com/facebook/rocksdb/issues/3169) for more information

This reverts commit 593d3de371.
Closes https://github.com/facebook/rocksdb/pull/3188

Differential Revision: D6379271

Pulled By: miasantreble

fbshipit-source-id: 88f9ed67ba52237ad9b6f7251db83672b62d7537
2017-11-20 16:42:03 -08:00
Maysam Yabandeh 30285ee31c Fix calculating filter partition target size
Summary:
block_size_deviation is in percentage while the partition size is in bytes. The current code fails to take that into account resulting into very large target size for filter partitions.
Closes https://github.com/facebook/rocksdb/pull/3187

Differential Revision: D6376069

Pulled By: maysamyabandeh

fbshipit-source-id: 276546fc68f50e0da32c462abb46f6cf676db9b2
2017-11-20 13:26:57 -08:00
Yi Wu 84a04af9a9 TableProperty::oldest_key_time defaults to 0
Summary:
We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0.

Also revert db_sst_test back to before #2842.
Closes https://github.com/facebook/rocksdb/pull/3079

Differential Revision: D6165702

Pulled By: yiwu-arbug

fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0
2017-10-27 15:00:05 -07:00
Yi Wu 66a2c44ef4 Add DB::Properties::kEstimateOldestKeyTime
Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.

My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842

Differential Revision: D5770600

Pulled By: yiwu-arbug

fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
2017-10-23 15:27:27 -07:00
Zhongyi Xie 593d3de371 No need for Restart Interval for meta blocks
Summary:
In SST files, restart interval helps us search in data blocks. However, some meta blocks will be read sequentially, so there's no need for restart points. Restart interval will introduce extra space in the block (https://github.com/facebook/rocksdb/blob/master/table/block_builder.cc#L80). We will see if we can remove this redundant space. (Maybe set restart interval to infinite.)
Closes https://github.com/facebook/rocksdb/pull/2940

Differential Revision: D5930139

Pulled By: miasantreble

fbshipit-source-id: 92b1b23c15cffa90378343ac846b713623b19c21
2017-09-29 20:26:20 -07:00
Andrew Kryczka 7fbb9eccaf support disabling checksum in block-based table
Summary:
store a zero as the checksum when disabled since it's easier to keep block trailer a fixed length.
Closes https://github.com/facebook/rocksdb/pull/2781

Differential Revision: D5694702

Pulled By: ajkr

fbshipit-source-id: 69cea9da415778ba2b600dfd9d0dfc8cb5188ecd
2017-08-23 19:40:47 -07:00
Sagar Vemuri 72502cf227 Revert "comment out unused parameters"
Summary:
This reverts the previous commit 1d7048c598, which broke the build.

Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627

Differential Revision: D5476473

Pulled By: sagar0

fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
2017-07-21 18:26:26 -07:00
Victor Gao 1d7048c598 comment out unused parameters
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.

Reviewed By: igorsugak

Differential Revision: D5454343

fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
2017-07-21 14:57:44 -07:00
Siying Dong 3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
Daniel Black 7a0b5de771 Gcc 7 ignored quantifiers
Summary:
The casting seemed to cause a problem.

I think this might increase it to unsigned long.
Closes https://github.com/facebook/rocksdb/pull/2562

Differential Revision: D5406842

Pulled By: siying

fbshipit-source-id: 736adef31448229a58a1a48bdbe77792f36736e8
2017-07-12 09:41:15 -07:00
Maysam Yabandeh f6b9d9355e Fix clang error in PartitionedFilterBlockBuilder
Summary: Closes https://github.com/facebook/rocksdb/pull/2536

Differential Revision: D5371271

Pulled By: maysamyabandeh

fbshipit-source-id: f1355ac658a79c9982a24986f0925c9e24fc39d5
2017-07-05 11:16:04 -07:00
Maysam Yabandeh 45b9bb0331 Cut filter partition based on metadata_block_size
Summary:
Currently metadata_block_size controls only index partition size. With this patch a partition is cut after any of index or filter partitions reaches metadata_block_size.
Closes https://github.com/facebook/rocksdb/pull/2452

Differential Revision: D5275651

Pulled By: maysamyabandeh

fbshipit-source-id: 5057e4424b4c8902043782e6bf8c38f0c4f25160
2017-07-02 10:42:12 -07:00
Sagar Vemuri 1cd45cd1b3 FIFO Compaction with TTL
Summary:
Introducing FIFO compactions with TTL.

FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.

To address that request:
- Added a new TTL option to FIFO compaction options.
- Updated FIFO compaction score to take TTL into consideration.
- Added a new table property, creation_time, to keep track of when the SST file is created.
- Creation_time is set as below:
  - On Flush: Set to the time of flush.
  - On Compaction: Set to the max creation_time of all the files involved in the compaction.
  - On Repair and Recovery: Set to the time of repair/recovery.
  - Old files created prior to this code change will have a creation_time of 0.
- FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
- FIFO compaction will fall back to the prior way of deleting files based on size if:
  - the creation_time of all files involved in compaction is 0.
  - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.

This feature is not supported if max_open_files != -1 or with table formats other than Block-based.

**Test Plan:**
Added tests.

**Benchmark results:**
Base: FIFO with max size: 100MB ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100

readwhilewriting :       1.924 micros/op 519858 ops/sec;   13.6 MB/s (1176277 of 5000000 found)
```

With TTL (a low one for testing) ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20

readwhilewriting :       1.902 micros/op 525817 ops/sec;   13.7 MB/s (1185057 of 5000000 found)
```
Example Log lines:
```
2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
...
2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
```

SST Files remaining in the dbbench dir, after db_bench execution completed:
```
svemuri@dev15905 ~/rocksdb (fifo-compaction)  $ ls -l /dev/shm//dbbench/*.sst
-rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
-rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
-rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
```
Closes https://github.com/facebook/rocksdb/pull/2480

Differential Revision: D5305116

Pulled By: sagar0

fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
2017-06-27 17:11:48 -07:00
Siying Dong 5c97a7c066 Unit Tests for sync, range sync and file close failures
Summary: Closes https://github.com/facebook/rocksdb/pull/2454

Differential Revision: D5255320

Pulled By: siying

fbshipit-source-id: 0080830fa8eb5da6de25e17ba68aee91018c7913
2017-06-26 13:27:58 -07:00
Maysam Yabandeh 6f4154d693 record index partition properties
Summary:
When Partitioning index/filter is enabled the user might need to check the index block size as well as the top-level index size via sst_dump. This patch records i) number of partitions, ii) top-level index size and make it accessible through sst_dump. The number of partitions for filters is the same as that of indexes. The top-level index for filters has a similar size to top-level index for indexes, so it is not repeated.
Closes https://github.com/facebook/rocksdb/pull/2437

Differential Revision: D5224225

Pulled By: maysamyabandeh

fbshipit-source-id: 5324598c75793523aef1bb7ee225a5475e95a9cb
2017-06-13 11:21:32 -07:00
Siying Dong d616ebea23 Add GPLv2 as an alternative license.
Summary: Closes https://github.com/facebook/rocksdb/pull/2226

Differential Revision: D4967547

Pulled By: siying

fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4
2017-04-27 18:06:12 -07:00
Islam AbdelRahman e19163688b Add macros to include file name and line number during Logging
Summary:
current logging
```
2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25
2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2.
2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK
2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK
2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0.
2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6
2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1
2017/03/14-14:20:31.
Closes https://github.com/facebook/rocksdb/pull/1990

Differential Revision: D4708695

Pulled By: IslamAbdelRahman

fbshipit-source-id: cb8968f
2017-03-15 19:39:12 -07:00
Maysam Yabandeh 54b434110e Builders for partition filter
Summary:
This is the second split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes only the builder part. The testing will be included in the third split, where the reader is also included.
Closes https://github.com/facebook/rocksdb/pull/1952

Differential Revision: D4660272

Pulled By: maysamyabandeh

fbshipit-source-id: 36b3cf0
2017-03-07 13:54:12 -08:00
Maysam Yabandeh a2f7a514d1 Refactoring
Summary:
This is the first split of https://github.com/facebook/rocksdb/pull/1891 and will be needed for the upcoming partitioned filter patch.
Closes https://github.com/facebook/rocksdb/pull/1949

Differential Revision: D4652152

Pulled By: maysamyabandeh

fbshipit-source-id: 9801778
2017-03-03 18:24:12 -08:00
Islam AbdelRahman f89b3893c0 Remove skip_table_builder_flush and default it to true
Summary:
This option is needed to be enabled for Direct IO
and I cannot think of a reason where we need to disable it

remove it and default it to true
Closes https://github.com/facebook/rocksdb/pull/1944

Differential Revision: D4641088

Pulled By: IslamAbdelRahman

fbshipit-source-id: d7085b9
2017-03-02 16:54:10 -08:00
Maysam Yabandeh 69d5262c81 Two-level Indexes
Summary:
Partition Index blocks and use a Partition-index as a 2nd level index.

The two-level index can be used by setting
BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and
configuring BlockBasedTableOptions::index_per_partition

t15539501
Closes https://github.com/facebook/rocksdb/pull/1814

Differential Revision: D4473535

Pulled By: maysamyabandeh

fbshipit-source-id: bffb87e
2017-02-06 16:39:12 -08:00
Daniel Black 5334d8b44c table/block_based_table_builder.cc: intentional fallthrough - comment to match gcc pattern
Summary:
The gcc-7 code for parsing comments (libcpp/lex.c) didn't match
the intentional fallthough in this comment.

table/block_based_table_builder.cc: In member function 'void rocksdb::BlockBasedTableBuilder::WriteRawBlock(const rocksdb::Slice&, rocksdb::CompressionType, rocksdb::BlockHandle*)':
table/block_based_table_builder.cc:754:22: error: this statement may fall through [-Werror=implicit-fallthrough=]
         assert(false);
                      ^
table/block_based_table_builder.cc:756:7: note: here
       case kCRC32c: {
       ^~~~
cc1plus: all warnings being treated as errors
Closes https://github.com/facebook/rocksdb/pull/1661

Differential Revision: D4318817

Pulled By: yiwu-arbug

fbshipit-source-id: e67d171
2016-12-13 13:54:15 -08:00
Aaron Gao 59a7c0337b Change ioptions to store user_comparator, fix bug
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.

Test Plan: make all check -j64

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65121
2016-10-21 11:31:42 -07:00
Islam AbdelRahman b88f8e87c5 Support SST files with Global sequence numbers [reland]
Summary:
reland https://reviews.facebook.net/D62523

- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: sdong, yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65211
2016-10-18 16:59:37 -07:00
Islam AbdelRahman d062328977 Revert "Support SST files with Global sequence numbers"
This reverts commit ab01da5437.
2016-10-07 14:05:12 -07:00
Islam AbdelRahman ab01da5437 Support SST files with Global sequence numbers
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: andrewkr, yhchiang, yiwu, sdong

Reviewed By: sdong

Subscribers: hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62523
2016-10-03 16:12:39 -07:00
Andrew Kryczka 6009c473c7 Store range tombstones in memtable
Summary:
- Store range tombstones in a separate MemTableRep instantiated with ColumnFamilyOptions::memtable_factory
- MemTable::NewRangeTombstoneIterator() returns a MemTableIterator over the separate MemTableRep
- Part of the read path is not implemented yet (i.e., MemTable::Get())

Test Plan: see unit tests

Reviewers: wanning

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62217
2016-09-30 09:06:43 -07:00
sdong 607628d349 Support ZSTD with finalized format
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.

Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63141
2016-09-06 12:22:16 -07:00
Aaron Gao c7004840d2 store prefix_extractor_name in table
Summary:
Make sure prefix extractor name is stored in SST files and if DB is opened with a prefix extractor of a different name, prefix bloom is skipped when read the file.
Also add unit tests for that.

Test Plan:
before change:
```
Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BlockBasedTableTest
[ RUN      ] BlockBasedTableTest.SkipPrefixBloomFilter
table/table_test.cc:1421: Failure
Value of: db_iter->Valid()
Actual: false
Expected: true
[  FAILED  ] BlockBasedTableTest.SkipPrefixBloomFilter (1 ms)
[----------] 1 test from BlockBasedTableTest (1 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] BlockBasedTableTest.SkipPrefixBloomFilter

1 FAILED TEST
```
after:
```
Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BlockBasedTableTest
[ RUN      ] BlockBasedTableTest.SkipPrefixBloomFilter
[       OK ] BlockBasedTableTest.SkipPrefixBloomFilter (0 ms)
[----------] 1 test from BlockBasedTableTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.
```

Reviewers: sdong, andrewkr, yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61215
2016-08-26 11:46:32 -07:00
Wanning Jiang 78837f5d61 TableBuilder / TableReader support for range deletion
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader

Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )

Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61473
2016-08-19 15:10:31 -07:00
omegaga d51dc96a79 Experiments on column-aware encodings
Summary:
Experiments on column-aware encodings. Supported features: 1) extract data blocks from SST file and encode with specified encodings; 2) Decode encoded data back into row format; 3) Directly extract data blocks and write in row format (without prefix encoding); 4) Get column distribution statistics for column format; 5) Dump data blocks separated by columns in human-readable format.

There is still on-going work on this diff. More refactoring is necessary.

Test Plan: Wrote tests in `column_aware_encoding_test.cc`. More tests should be added.

Reviewers: sdong

Reviewed By: sdong

Subscribers: arahut, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60027
2016-08-01 14:50:19 -07:00
John Alexander 9430333f84 New Statistics to track Compression/Decompression (#1197)
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).

* Prevent incrementing compression counter when compression is turned off in options.

* Prevent incrementing compression counter when compression is turned off in options.

* Added two more supported compression types to test code in db_test.cc

* Prevent incrementing compression counter when compression is turned off in options.

* Added new StatsLevel that excludes compression timing.

* Fixed casting error in coding.h

* Fixed CompressionStatsTest for new StatsLevel.

* Removed unused variable that was breaking the Linux build
2016-07-19 09:44:03 -07:00
Jay Edgar efd013d6d8 Miscellaneous performance improvements
Summary:
I was investigating performance issues in the SstFileWriter and found all of the following:

- The SstFileWriter::Add() function created a local InternalKey every time it was called generating a allocation and free each time.  Changed to have an InternalKey member variable that can be reset with the new InternalKey::Set() function.
- In SstFileWriter::Add() the smallest_key and largest_key values were assigned the result of a ToString() call, but it is simpler to just assign them directly from the user's key.
- The Slice class had no move constructor so each time one was returned from a function a new one had to be allocated, the old data copied to the new, and the old one was freed.  I added the move constructor which also required a copy constructor and assignment operator.
- The BlockBuilder::CurrentSizeEstimate() function calculates the current estimate size, but was being called 2 or 3 times for each key added.  I changed the class to maintain a running estimate (equal to the original calculation) so that the function can return an already calculated value.
- The code in BlockBuilder::Add() that calculated the shared bytes between the last key and the new key duplicated what Slice::difference_offset does, so I replaced it with the standard function.
- BlockBuilder::Add() had code to copy just the changed portion into the last key value (and asserted that it now matched the new key).  It is more efficient just to copy the whole new key over.
- Moved this same code up into the 'if (use_delta_encoding_)' since the last key value is only needed when delta encoding is on.
- FlushBlockBySizePolicy::BlockAlmostFull calculated a standard deviation value each time it was called, but this information would only change if block_size of block_size_deviation changed, so I created a member variable to hold the value to avoid the calculation each time.
- Each PutVarint??() function has a buffer and calls std::string::append().  Two or three calls in a row could share a buffer and a single call to std::string::append().

Some of these will be helpful outside of the SstFileWriter.  I'm not 100% the addition of the move constructor is appropriate as I wonder why this wasn't done before - maybe because of compiler compatibility?  I tried it on gcc 4.8 and 4.9.

Test Plan: The changes should not affect the results so the existing tests should all still work and no new tests were added.  The value of the changes was seen by manually testing the SstFileWriter class through MyRocks and adding timing code to identify problem areas.

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59607
2016-07-12 14:15:32 -07:00