Summary:
Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables
Closes https://github.com/facebook/rocksdb/pull/3118
Differential Revision: D6227247
Pulled By: maysamyabandeh
fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7
Summary:
Add per-exe execution capability
Add fix parsing of groups/tests
Add timer test exclusion
Fix unit tests
Ifdef threadpool specific tests that do not pass on Vista threadpool.
Remove spurious outout from prefix_test so test case listing works
properly.
Fix not using standard test directories results in file creation errors
in sst_dump_test.
BlobDb fixes:
In C++ end() iterators can not be dereferenced. They are not valid.
When deleting blob_db_ set it to nullptr before any other code executes.
Not fixed:. On Windows you can not delete a file while it is open.
[ RUN ] BlobDBTest.ReadWhileGC
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
write_batch
Should not call front() if there is a chance the container is empty
Closes https://github.com/facebook/rocksdb/pull/3152
Differential Revision: D6293274
Pulled By: sagar0
fbshipit-source-id: 318c3717c22087fae13b18715dffb24565dbd956
Summary:
A race condition will happen when:
* a user thread writes a value, but it hits the write stop condition because there are too many un-flushed memtables, while holding blob_db_impl.write_mutex_.
* Flush is triggered and call flush begin listener and try to acquire blob_db_impl.write_mutex_.
Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3149
Differential Revision: D6279805
Pulled By: yiwu-arbug
fbshipit-source-id: 0e3c58afb78795ebe3360a2c69e05651e3908c40
Summary:
`compression` shadow the method name in `BlobFile`. Rename it.
Closes https://github.com/facebook/rocksdb/pull/3148
Differential Revision: D6274498
Pulled By: yiwu-arbug
fbshipit-source-id: 7d293596530998b23b6b8a8940f983f9b6343a98
Summary:
To fix the issue of failing to decompress existing value after reopen DB with a different compression settings.
Closes https://github.com/facebook/rocksdb/pull/3142
Differential Revision: D6267260
Pulled By: yiwu-arbug
fbshipit-source-id: c7cf7f3e33b0cd25520abf4771cdf9180cc02a5f
Summary:
Adds two new counters:
`key_lock_wait_count` counts how many times a lock was blocked by another transaction and had to wait, instead of being granted the lock immediately.
`key_lock_wait_time` counts the time spent acquiring locks.
Closes https://github.com/facebook/rocksdb/pull/3107
Differential Revision: D6217332
Pulled By: lth
fbshipit-source-id: 55d4f46da5550c333e523263422fd61d6a46deb9
Summary:
After move assignment, we need to re-initialized the moved PinnableSlice.
Also update blob_db_impl.cc to not reuse the moved PinnableSlice since it is supposed to be in an undefined state after move.
Closes https://github.com/facebook/rocksdb/pull/3127
Differential Revision: D6238585
Pulled By: yiwu-arbug
fbshipit-source-id: bd99f2e37406c4f7de160c7dee6a2e8126bc224e
Summary:
Fix unreleased snapshot at the end of the test.
Closes https://github.com/facebook/rocksdb/pull/3126
Differential Revision: D6232867
Pulled By: yiwu-arbug
fbshipit-source-id: 651ca3144fc573ea2ab0ab20f0a752fb4a101d26
Summary:
After adding expiration to blob index in #3066, we are now able to add a compaction filter to cleanup expired blob index entries.
Closes https://github.com/facebook/rocksdb/pull/3090
Differential Revision: D6183812
Pulled By: yiwu-arbug
fbshipit-source-id: 9cb03267a9702975290e758c9c176a2c03530b83
Summary:
Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is
1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish
2. otherwise, it is the latest sequence of the file
Closes https://github.com/facebook/rocksdb/pull/3087
Differential Revision: D6182519
Pulled By: yiwu-arbug
fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45
Summary:
Add an option to enable/disable auto garbage collection, where we keep counting how many keys have been evicted by either deletion or compaction and decide whether to garbage collect a blob file.
Default disable auto garbage collection for now since the whole logic is not fully tested and we plan to make major change to it.
Closes https://github.com/facebook/rocksdb/pull/3117
Differential Revision: D6224756
Pulled By: yiwu-arbug
fbshipit-source-id: cdf53bdccec96a4580a2b3a342110ad9e8864dfe
Summary:
The test intent to wait until key being overwritten until proceed with garbage collection. It failed to wait for `PutUntil` finally finish. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3116
Differential Revision: D6222833
Pulled By: yiwu-arbug
fbshipit-source-id: fa9b57a772b92a66cf250b44e7975c43f62f45c5
Summary:
Evict oldest blob file and put it in obsolete_files list when close to blob db size limit. The file will be delete when the `DeleteObsoleteFiles` background job runs next time.
For now I set `kEvictOldestFileAtSize` constant, which controls when to evict the oldest file, at 90%. It could be tweaked or made into an option if really needed; I didn't want to expose it as an option pre-maturely as there are already too many :) .
Closes https://github.com/facebook/rocksdb/pull/3094
Differential Revision: D6187340
Pulled By: sagar0
fbshipit-source-id: 687f8262101b9301bf964b94025a2fe9d8573421
Summary:
Move WritePreparedTxnDB from pessimistic_transaction_db.h to its own header, write_prepared_txn_db.h
Closes https://github.com/facebook/rocksdb/pull/3114
Differential Revision: D6220987
Pulled By: maysamyabandeh
fbshipit-source-id: 18893fb4fdc6b809fe117dabb544080f9b4a301b
Summary:
Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
Closes https://github.com/facebook/rocksdb/pull/3101
Differential Revision: D6199405
Pulled By: maysamyabandeh
fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
Summary:
GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
Closes https://github.com/facebook/rocksdb/pull/3071
Differential Revision: D6148023
Pulled By: maysamyabandeh
fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
Summary:
The collapse of duplicate keys in write batch needs to sort the indexes of duplicate keys since it only checks the index in the batch with the head of the list of duplicate keys.
Closes https://github.com/facebook/rocksdb/pull/3093
Differential Revision: D6186800
Pulled By: maysamyabandeh
fbshipit-source-id: abc9ae8c2f1840445a5584f925cf86ecc6f37154
Summary:
* cleanup num_concurrent_simple_blobs. We don't do concurrent writes (by taking write_mutex_) so it doesn't make sense to have multiple non TTL files open. We can revisit later when we want to improve writes.
* cleanup eviction callback. we don't have plan to use it now.
* rename s/open_simple_blob_files_/open_non_ttl_file_/ and s/open_blob_files_/open_ttl_files_/ to avoid confusion.
Closes https://github.com/facebook/rocksdb/pull/3088
Differential Revision: D6182598
Pulled By: yiwu-arbug
fbshipit-source-id: 99e6f5e01fa66d31309cdb06ce48502464bac6ad
Summary:
Changing blob file format and some code cleanup around the change. The change with blob log format are:
* Remove timestamp field in blob file header, blob file footer and blob records. The field is not being use and often confuse with expiration field.
* Blob file header now come with column family id, which always equal to default column family id. It leaves room for future support of column family.
* Compression field in blob file header now is a standalone byte (instead of compact encode with flags field)
* Blob file footer now come with its own crc.
* Key length now being uint64_t instead of uint32_t
* Blob CRC now checksum both key and value (instead of value only).
* Some reordering of the fields.
The list of cleanups:
* Better inline comments in blob_log_format.h
* rename ttlrange_t and snrange_t to ExpirationRange and SequenceRange respectively.
* simplify blob_db::Reader
* Move crc checking logic to inside blob_log_format.cc
Closes https://github.com/facebook/rocksdb/pull/3081
Differential Revision: D6171304
Pulled By: yiwu-arbug
fbshipit-source-id: e4373e0d39264441b7e2fbd0caba93ddd99ea2af
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066
Differential Revision: D6142115
Pulled By: yiwu-arbug
fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
Summary:
I found that we continue accepting writes even when the blob db goes beyond the configured blob directory size limit. Now, we return an error for writes on reaching `blob_dir_size` limit and if `is_fifo` is set to false. (We cannot just drop any file when `is_fifo` is true.)
Deleting the oldest file when `is_fifo` is true will be handled in a later PR.
Closes https://github.com/facebook/rocksdb/pull/3060
Differential Revision: D6136156
Pulled By: sagar0
fbshipit-source-id: 2f11cb3f2eedfa94524fbfa2613dd64bfad7a23c
Summary:
A few simple changes to allow RocksDB to be built on OpenBSD. Let me know if any further changes are needed.
Closes https://github.com/facebook/rocksdb/pull/3061
Differential Revision: D6138800
Pulled By: ajkr
fbshipit-source-id: a13a17b5dc051e6518bd56a8c5efd1d24dd81b0c
Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.
My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842
Differential Revision: D5770600
Pulled By: yiwu-arbug
fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
Summary:
MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
Closes https://github.com/facebook/rocksdb/pull/3048
Differential Revision: D6126272
Pulled By: maysamyabandeh
fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f
Summary:
Enable concurrent_prepare flag for WritePrepared transactions and extend the existing transaction tests with this config.
Closes https://github.com/facebook/rocksdb/pull/3046
Differential Revision: D6106534
Pulled By: maysamyabandeh
fbshipit-source-id: 88c8d21d45bc492beb0a131caea84a2ac5e7d38c
Summary:
Blob db insert blob index to base db as kTypeBlobIndex type, to tell apart values written by plain rocksdb or blob db. This is to make it possible to migrate from existing rocksdb to blob db.
Also with the patch blob db garbage collection get away from OptimisticTransaction. Instead it use a custom write callback to achieve similar behavior as OptimisticTransaction. This is because we need to pass the is_blob_index flag to DBImpl::Get but OptimisticTransaction don't support it.
Closes https://github.com/facebook/rocksdb/pull/3000
Differential Revision: D6050044
Pulled By: yiwu-arbug
fbshipit-source-id: 61dc72ab9977625e75f78cd968e7d8a3976e3632
Summary:
Previously each time we write a blob we write blog_record_header + key + value + blob_record_footer to blob log. The footer only contains a sequence and a crc for the sequence number. The sequence number was used in garbage collection to verify the value is recent. After #2703 we moved to use optimistic transaction and no longer use sequence number from the footer. Remove the footer altogether.
There's another usage of sequence number and we are keeping it: Each blob log file keep track of sequence number range of keys in it, and use it to check if it is reference by a snapshot, before being deleted.
Closes https://github.com/facebook/rocksdb/pull/3005
Differential Revision: D6057585
Pulled By: yiwu-arbug
fbshipit-source-id: d6da53c457a316e9723f359a1b47facfc3ffe090
Summary:
On iterator create, take a snapshot, create a ReadCallback and pass the ReadCallback to the underlying DBIter to check if key is committed.
Closes https://github.com/facebook/rocksdb/pull/2981
Differential Revision: D6001471
Pulled By: yiwu-arbug
fbshipit-source-id: 3565c4cdaf25370ba47008b0e0cb65b31dfe79fe
Summary:
On WritePreparedTxnDB destruct there could be running compaction/flush holding a SnapshotChecker, which holds a pointer back to WritePreparedTxnDB. Make sure those jobs finished before destructing WritePreparedTxnDB.
This is caught by TransactionTest::SeqAdvanceTest.
Closes https://github.com/facebook/rocksdb/pull/2982
Differential Revision: D6002957
Pulled By: yiwu-arbug
fbshipit-source-id: f1e70390c9798d1bd7959f5c8e2a1c14100773c3
Summary:
Enabled WAL, during GC, for blob index which is stored on regular RocksDB.
Closes https://github.com/facebook/rocksdb/pull/2975
Differential Revision: D5997384
Pulled By: sagar0
fbshipit-source-id: b76c1487d8b5be0e36c55e8d77ffe3d37d63d85b
Summary:
Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator:
* check if key has been committed. If not, output uncommitted keys AS-IS.
* use SnapshotChecker to check if key is visible to a snapshot when in need.
* do not output key with seq = 0 if the key is not committed.
Closes https://github.com/facebook/rocksdb/pull/2926
Differential Revision: D5902907
Pulled By: yiwu-arbug
fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320
Summary:
Compaction will output keys with sequence number 0, if it is visible to
earliest snapshot. Adding a test to make sure IsInSnapshot() report sequence number 0 is
visible to any snapshot.
Closes https://github.com/facebook/rocksdb/pull/2974
Differential Revision: D5990665
Pulled By: yiwu-arbug
fbshipit-source-id: ef50ebc777ff8ca688771f3ab598c7a609b0b65e
Summary:
With WriteCommitted, when the write batch has duplicate keys, the txn db simply inserts them to the db with different seq numbers and let the db ignore/merge the duplicate values at the read time. With WritePrepared all the entries of the batch are inserted with the same seq number which prevents us from benefiting from this simple solution.
This patch applies a hackish solution to unblock the end-to-end testing. The hack is to be replaced with a proper solution soon. The patch simply detects the duplicate key insertions, and mark the previous one as obsolete. Then before writing to the db it rewrites the batch eliminating the obsolete keys. This would incur a memcpy cost. Furthermore handing duplicate merge would require to do FullMerge instead of simply ignoring the previous value, which is not handled by this patch.
Closes https://github.com/facebook/rocksdb/pull/2969
Differential Revision: D5976337
Pulled By: maysamyabandeh
fbshipit-source-id: 114e65b66f137d8454ff2d1d782b8c05da95f989
Summary:
Implement the rollback of WritePrepared txns. For each modified value, it reads the value before the txn and write it back. This would cancel out the effect of transaction. It also remove the rolled back txn from prepared heap.
Closes https://github.com/facebook/rocksdb/pull/2946
Differential Revision: D5937575
Pulled By: maysamyabandeh
fbshipit-source-id: a6d3c47f44db3729f44b287a80f97d08dc4e888d
Summary:
Now that RocksDB supports conditional merging during point lookups (introduced in #2923), Cassandra value merge operator can be updated to pass in a limit. The limit needs to be passed in from the Cassandra code.
Closes https://github.com/facebook/rocksdb/pull/2947
Differential Revision: D5938454
Pulled By: sagar0
fbshipit-source-id: d64a72d53170d8cf202b53bd648475c3952f7d7f
Summary:
It was broken when `NotifyCollectTableCollectorsOnFinish` was introduced. That function called `Finish` on each of the `TablePropertiesCollector`s, and `CompactOnDeletionCollector::Finish()` was resetting all its internal state. Then, when we checked whether compaction is necessary, the flag had already been cleared.
Fixed above issue by avoiding resetting internal state during `Finish()`. Multiple calls to `Finish()` are allowed, but callers cannot invoke `AddUserKey()` on the collector after any finishes.
Closes https://github.com/facebook/rocksdb/pull/2936
Differential Revision: D5918659
Pulled By: ajkr
fbshipit-source-id: 4f05e9d80e50ee762ba1e611d8d22620029dca6b
Summary:
Recover txns from the WAL. Also added some unit tests.
Closes https://github.com/facebook/rocksdb/pull/2901
Differential Revision: D5859596
Pulled By: maysamyabandeh
fbshipit-source-id: 6424967b231388093b4effffe0a3b1b7ec8caeb0
Summary:
SUMMARY
Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed.
TEST PLAN
ran make check
all passed
Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value.
Closes https://github.com/facebook/rocksdb/pull/2893
Reviewed By: yiwu-arbug
Differential Revision: D5845814
Pulled By: TheRushingWookie
fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd
Summary:
Looks like the API is simply missing. Adding it.
Closes https://github.com/facebook/rocksdb/pull/2937
Differential Revision: D5919955
Pulled By: yiwu-arbug
fbshipit-source-id: 6e2e9c96c29882b0bb4113d1f8efb72bffc57878
Summary:
The test didn't delete txn before creating a new one.
Closes https://github.com/facebook/rocksdb/pull/2913
Differential Revision: D5880236
Pulled By: yiwu-arbug
fbshipit-source-id: 7a4fcaada3d86332292754502cd8f4341143bf4f
Summary:
In our testing cluster, we found large amount tombstone has been promoted to kValue type from kMerge after reaching the top level of compaction. Since we used to only collecting tombstone in merge operator, those tombstones can never be collected.
This PR addresses the issue by adding a GC step in compaction filter, which is only for kValue type records. Since those record already reached the top of compaction (no earlier data exists) we can safely remove them in compaction filter without worrying old data appears.
This PR also removes an old optimization in cassandra merge operator for single merge operands. We need to do GC even on a single operand, so the optimation does not make sense anymore.
Closes https://github.com/facebook/rocksdb/pull/2855
Reviewed By: sagar0
Differential Revision: D5806445
Pulled By: wpc
fbshipit-source-id: 6eb25629d4ce917eb5e8b489f64a6aa78c7d270b
Summary:
By default the seq number in DB is increased once per written key. WritePrepared txns requires the seq to be increased once per the entire batch so that the seq would be used as the prepare timestamp by which the transaction is identified. Also we need to increase seq for the commit marker since it would give a unique id to the commit timestamp of transactions.
Two unit tests are added to verify our understanding of how the seq should be increased. The recovery path requires much more work and is left to another patch.
Closes https://github.com/facebook/rocksdb/pull/2885
Differential Revision: D5837843
Pulled By: maysamyabandeh
fbshipit-source-id: a08960b93d727e1cf438c254d0c2636fb133cc1c
Summary:
We had two proposals for lock-free commit maps. This patch implements the latter one that was simpler. We can later experiment with both proposals.
In this impl each entry is an std::atomic of uint64_t, which are accessed via memory_order_acquire/release. In x86_64 arch this is compiled to simple reads and writes from memory.
Closes https://github.com/facebook/rocksdb/pull/2861
Differential Revision: D5800724
Pulled By: maysamyabandeh
fbshipit-source-id: 41abae9a4a5df050a8eb696c43de11c2770afdda
Summary:
- Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
Closes https://github.com/facebook/rocksdb/pull/2868
Differential Revision: D5804152
Pulled By: axxufb
fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
Summary:
There are internal users who open BackupEngine for writing new backups only, and they don't care whether old backups can be read or not. The condition `BackupableDBOptions::max_valid_backups_to_open == 0` should be supported (previously in df74b775e6 I made the mistake of choosing 0 as a special value to disable the limit).
Closes https://github.com/facebook/rocksdb/pull/2819
Differential Revision: D5751599
Pulled By: ajkr
fbshipit-source-id: e73ac19eb5d756d6b68601eae8e43407ee4f2752
Summary:
Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress.
I have to disable the final key word in debug build, as a hack test class depends on overriding the class.
Closes https://github.com/facebook/rocksdb/pull/2860
Differential Revision: D5800461
Pulled By: siying
fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53
Summary:
This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
Closes https://github.com/facebook/rocksdb/pull/2850
Differential Revision: D5787375
Pulled By: maysamyabandeh
fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
Summary:
This patch advances the max_evicted_seq_ is larger granularities to reduce the overhead of updating the relevant data structures.
It also refactor the related code and adds testing to that. As part of this patch some of the TODOs for removing usage of non-static const members are also addressed.
Closes https://github.com/facebook/rocksdb/pull/2844
Differential Revision: D5772928
Pulled By: maysamyabandeh
fbshipit-source-id: f4fcc2948be69c034f10812cf922ce5ab82ef98c
Summary:
TransactionCallback was never used. Remove it to avoid confusion.
Closes https://github.com/facebook/rocksdb/pull/2853
Differential Revision: D5787219
Pulled By: maysamyabandeh
fbshipit-source-id: e2b6a89537e3770a269ad38be71c4b0b160a88ac
Summary:
The patch skips write_prepared_transaction_test from travis as they time out there. They are still covered in daily runs of tests.
Closes https://github.com/facebook/rocksdb/pull/2836
Differential Revision: D5767203
Pulled By: maysamyabandeh
fbshipit-source-id: 51045ef98a745197136e14b2ec02fc6f38081b75
Summary:
Add -DPORTABLE=1
port::cacheline_aligned_alloc() has arguments swapped which prevents every single test from running.
Closes https://github.com/facebook/rocksdb/pull/2815
Differential Revision: D5751661
Pulled By: siying
fbshipit-source-id: e0857d6e138ec46035b3c23d7c3c751901a0a4a0
Summary:
Backup engine is intentionally openable even when some backups are corrupt. Previously the engine could write new backups as long as the most recent backup wasn't corrupt. This PR makes the backup engine able to create new backups even when the most recent one is corrupt.
We now maintain two ID instance variables:
- `latest_backup_id_` is used when creating backup to choose the new ID
- `latest_valid_backup_id_` is used when restoring latest backup since we want most recent valid one
Closes https://github.com/facebook/rocksdb/pull/2804
Differential Revision: D5734148
Pulled By: ajkr
fbshipit-source-id: db440707b31df2c7b084188aa5f6368449e10bcf
Summary:
Remove cassandra tombstone when reaching the max compaction level (full merge). if all columns collected key will be removed in next compaction via compaction filter
Closes https://github.com/facebook/rocksdb/pull/2791
Reviewed By: sagar0
Differential Revision: D5722465
Pulled By: wpc
fbshipit-source-id: 61e9898a5686551653a16383255aeaab3197e65e
Summary:
Divide the old snapshots to two lists: a few that fit into a cached array and the rest in a vector, which is expected to be empty in normal cases. The former is to optimize concurrent reads from snapshots without requiring locks. It is done by an array of std::atomic, from which std::memory_order_acquire reads are compiled to simple read instructions in most of the x86_64 architectures.
Closes https://github.com/facebook/rocksdb/pull/2758
Differential Revision: D5660504
Pulled By: maysamyabandeh
fbshipit-source-id: 524fcf9a8e7f90a92324536456912a99aaa6740c
Summary:
Fixing flaky blob_db_test.
To close a blob file, blob db used to add a CloseSeqWrite job to the background thread to close it. Changing file close to be synchronous in order to simplify logic, and fix flaky blob_db_test.
Closes https://github.com/facebook/rocksdb/pull/2787
Differential Revision: D5699387
Pulled By: yiwu-arbug
fbshipit-source-id: dd07a945cd435cd3808fce7ee4ea57817409474a
Summary:
Allow `Slice` holding nullptr as a sentinel value but not in comparisons. This new restriction eliminates the need for the manual checks in 39ef900551, while still conforming to glibc's `memcmp` API. Thanks siying for the idea. Users may need to migrate, so mentioned it in HISTORY.md.
Closes https://github.com/facebook/rocksdb/pull/2777
Differential Revision: D5686016
Pulled By: ajkr
fbshipit-source-id: 03a2ca3fd9a0ebade9d0d5686c81d59a9534f563
Summary:
The ::Get from DB is not augmented with an overload method that takes a PinnableSlice instead of a string. Transactions however are not yet upgraded to use the new API. As a result, transaction users such as MyRocks cannot benefit from it. This patch updates the transactional API with a PinnableSlice overload.
Closes https://github.com/facebook/rocksdb/pull/2736
Differential Revision: D5645770
Pulled By: maysamyabandeh
fbshipit-source-id: f6af520df902f842de1bcf99bed3e8dfc43ad96d
Summary:
If GC kicks in between
* A Get() reads index entry from base db.
* The Get() read from a blob file
The GC can delete the corresponding blob file, making the key not found. Fortunately we have existing logic to avoid deleting a blob file if it is referenced by a snapshot. So the fix is to explicitly create a snapshot before reading index entry from base db.
Closes https://github.com/facebook/rocksdb/pull/2754
Differential Revision: D5655956
Pulled By: yiwu-arbug
fbshipit-source-id: e4ccbc51331362542e7343175bbcbdea5830f544
Summary:
When out of space, blob db should GC the oldest file. The current implementation GC the newest one instead. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/2757
Differential Revision: D5657611
Pulled By: yiwu-arbug
fbshipit-source-id: 56c30a4c52e6ab04551dda8c5c46006d4070b28d
Summary:
Changes:
* extended the wait_txn_map to track additional information
* designed circular buffer to store n latest deadlocks' information
* added test coverage to verify the additional information tracked is accurately stored in the buffer
Closes https://github.com/facebook/rocksdb/pull/2630
Differential Revision: D5478025
Pulled By: armishra
fbshipit-source-id: 2b138de7b5a73f5ca554fc3ff8220a3be49f39e7
Summary:
On initial call to BlobDBImpl::WaStats() `all_periods_write_` would be empty, so it will crash when we call pop_front() at line 1627. Apparently it is mean to pop only when `all_periods_write_.size() > kWriteAmplificationStatsPeriods`.
The whole write amp calculation doesn't seems to be correct and it is not being exposed. Will work on it later.
Test Plan
Change kWriteAmplificationStatsPeriodMillisecs to 1000 (1 second) and run db_bench --use_blob_db for 5 minutes.
Closes https://github.com/facebook/rocksdb/pull/2751
Differential Revision: D5648269
Pulled By: yiwu-arbug
fbshipit-source-id: b843d9a09bb5f9e1b713d101ec7b87e54b5115a4
Summary:
Updating Cassandra merge operator to make use of a single merge operand when needed. Single merge operand support has been introduced in #2721.
Closes https://github.com/facebook/rocksdb/pull/2753
Differential Revision: D5652867
Pulled By: sagar0
fbshipit-source-id: b9fbd3196d3ebd0b752626dbf9bec9aa53e3e26a
Summary:
Implement the main body of WritePrepared pseudo code. This includes PrepareInternal and CommitInternal, as well as AddCommitted which updates the commit map. It also provides a IsInSnapshot method that could be later called form the read path to decide if a version is in the read snapshot or it should other be skipped.
This patch lacks unit tests and does not attempt to offer an efficient implementation. The idea is that to have the API specified so that we can work on related tasks in parallel.
Closes https://github.com/facebook/rocksdb/pull/2713
Differential Revision: D5640021
Pulled By: maysamyabandeh
fbshipit-source-id: bfa7a05e8d8498811fab714ce4b9c21530514e1c
Summary:
`PartialMergeMulti` implementation is enough for Cassandra, and `PartialMerge` is not required. Implementing both will just duplicate the code.
As per https://github.com/facebook/rocksdb/blob/master/include/rocksdb/merge_operator.h#L130-L135 :
```
// The default implementation of PartialMergeMulti will use this function
// as a helper, for backward compatibility. Any successor class of
// MergeOperator should either implement PartialMerge or PartialMergeMulti,
// although implementing PartialMergeMulti is suggested as it is in general
// more effective to merge multiple operands at a time instead of two
// operands at a time.
```
Closes https://github.com/facebook/rocksdb/pull/2737
Reviewed By: scv119
Differential Revision: D5633073
Pulled By: sagar0
fbshipit-source-id: ef4fa102c22fec6a0175ed12f5c44c15afe3c8ca
Summary:
While GC, blob DB use optimistic transaction to delete or replace the index entry in LSM, to guarantee correctness if there's a normal write writing to the same key. However, the previous implementation doesn't call SetSnapshot() nor use GetForUpdate() of transaction API, instead it do its own sequence number checking before beginning the transaction. A normal write can sneak in after the sequence number check and overwrite the key, and the GC will delete or relocate the old version of the key by mistake. Update the code to property use GetForUpdate() to check the existing index entry.
After the patch the sequence number store with each blob record is useless, So I'm considering remove the sequence number from blob record, in another patch.
Closes https://github.com/facebook/rocksdb/pull/2703
Differential Revision: D5589178
Pulled By: yiwu-arbug
fbshipit-source-id: 8dc960cd5f4e61b36024ba7c32d05584ce149c24
Summary:
This patch splits Commit and Prepare into lock-related logic and db-write-related logic. It moves lock-related logic to PessimisticTransaction to be reused by all children classes and movies the existing impl of db-write-related to PrepareInternal, CommitSingleInternal, and CommitInternal in WriteCommittedTxnImpl.
Closes https://github.com/facebook/rocksdb/pull/2691
Differential Revision: D5569464
Pulled By: maysamyabandeh
fbshipit-source-id: d1b8698e69801a4126c7bc211745d05c636f5325
Summary:
This opens space for the new implementations of TransactionDBImpl such as WritePreparedTxnDBImpl that has a different policy of how to write to DB.
Closes https://github.com/facebook/rocksdb/pull/2689
Differential Revision: D5568918
Pulled By: maysamyabandeh
fbshipit-source-id: f7eac866e175daf3793ae79da108f65cc7dc7b25
Summary:
The FsyncFiles background job call Fsync() periodically for blob files. However it can access WritableFileWriter concurrently with a Put() or Write(). And WritableFileWriter does not support concurrent access. It will lead to WritableFileWriter buffer being flush with same content twice, and blob file end up corrupted. Fixing by simply let FsyncFiles hold write_mutex_.
Closes https://github.com/facebook/rocksdb/pull/2685
Differential Revision: D5561908
Pulled By: yiwu-arbug
fbshipit-source-id: f0bb5bcab0e05694e053b8c49eab43640721e872
Summary:
The current blob db implementation use mix of int32_t, uint32_t and uint64_t for TTL and expiration. Update all timestamps to uint64_t for consistency.
Closes https://github.com/facebook/rocksdb/pull/2683
Differential Revision: D5557103
Pulled By: yiwu-arbug
fbshipit-source-id: e4eab2691629a755e614e8cf1eed9c3a681d0c42
Summary:
I'm going with brute-force solution, just letting Put() and Write() holding a mutex before writing. May improve concurrent writing with finer granularity locking later.
Closes https://github.com/facebook/rocksdb/pull/2682
Differential Revision: D5552690
Pulled By: yiwu-arbug
fbshipit-source-id: 039abd675b5d274a7af6428198d1733cafecef4c
Summary:
Fix the bug where if blob db garbage collection revmoe keys with newer version. It shouldn't delete the key from base db when sequence number in base db is not equal to the one in blob log.
Closes https://github.com/facebook/rocksdb/pull/2678
Differential Revision: D5549752
Pulled By: yiwu-arbug
fbshipit-source-id: abb8649260963b5c389748023970fd746279d227
Summary:
This patch refactors TransactionImpl by separating the logic for pessimistic concurrency control from the implementation of how to write the data to rocksdb. The existing implementation is named WriteCommittedTxnImpl as it writes committed data to the db. A template named WritePreparedTxnImpl is also added which will be later completed to provide a an alternative implementation.
Closes https://github.com/facebook/rocksdb/pull/2676
Differential Revision: D5549998
Pulled By: maysamyabandeh
fbshipit-source-id: 16298e86b43ca4849324c1f35c731913c6d17bec
Summary:
* Dump blob db options to info log
* Remove BlobDBOptionsImpl to disallow dynamic cast *BlobDBOptions into *BlobDBOptionsImpl. Move options there to be constants or into BlobDBOptions. The dynamic cast is broken after #2645
* Change some of the default options
* Remove blob_db_options.min_blob_size, which is unimplemented. Will implement it soon.
Closes https://github.com/facebook/rocksdb/pull/2671
Differential Revision: D5529912
Pulled By: yiwu-arbug
fbshipit-source-id: dcd58ca981db5bcc7f123b65a0d6f6ae0dc703c7
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645
Differential Revision: D5502723
Pulled By: siying
fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
Summary:
I might have missed these while doing some recent cassandra code reviews.
Closes https://github.com/facebook/rocksdb/pull/2663
Differential Revision: D5520138
Pulled By: sagar0
fbshipit-source-id: 340930afe9efe03c75f535a1da1f89bd3e53c1f9
Summary:
Simple component that will add a new entry in a log file every time we lookup/insert a key in SimCache.
API:
```
SimCache::StartActivityLogging(<file_name>, <env>, <optional_max_size>)
SimCache::StopActivityLogging()
```
Sending for review, Still need to add more comments.
I was thinking about a better approach, but I ended up deciding I will use a mutex to sync the writes to the file, since this feature should not be heavily used and only used to collect info that will be analyzed offline. I think it's okay to hold the mutex every time we lookup/add to the SimCache.
Closes https://github.com/facebook/rocksdb/pull/2295
Differential Revision: D5063826
Pulled By: IslamAbdelRahman
fbshipit-source-id: f3b5daed8b201987c9a071146ddd5c5740a2dd8c
Summary:
Introducing blob_db::TTLExtractor to replace extract_ttl_fn. The TTL
extractor can be use to extract TTL from keys insert with Put or
WriteBatch. Change over existing extract_ttl_fn are:
* If value is changed, it will be return via std::string* (rather than Slice*). With Slice* the new value has to be part of the existing value. With std::string* the limitation is removed.
* It can optionally return TTL or expiration.
Other changes in this PR:
* replace `std::chrono::system_clock` with `Env::NowMicros` so that I can mock time in tests.
* add several TTL tests.
* other minor naming change.
Closes https://github.com/facebook/rocksdb/pull/2659
Differential Revision: D5512627
Pulled By: yiwu-arbug
fbshipit-source-id: 0dfcb00d74d060b8534c6130c808e4d5d0a54440
Summary:
Currently this test times out with tsan. This is likely due to decreased speed with tsan. By lowering the number of iterations we can still catch a bug as the test is run regularly and multiple runs of the test is equivalent with running the test with more iterations.
Closes https://github.com/facebook/rocksdb/pull/2639
Differential Revision: D5490549
Pulled By: maysamyabandeh
fbshipit-source-id: bd69c42a9728d337ac95a06a401088384e51731a
Summary:
Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily.
Closes https://github.com/facebook/rocksdb/pull/2621
Differential Revision: D5464500
Pulled By: siying
fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9
Summary:
This reverts the previous commit 1d7048c598, which broke the build.
Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627
Differential Revision: D5476473
Pulled By: sagar0
fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06