Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.
Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument
This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.
DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.
Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.
### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
As you can see, especially once warmed up, xxh3 is fastest.
### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
Test
for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
Results (ops/sec)
for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3
Reviewed By: mrambacher
Differential Revision: D31905249
Pulled By: pdillinger
fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
Summary:
Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050
Test Plan: added new tests
Reviewed By: anand1976
Differential Revision: D31744769
Pulled By: zhichao-cao
fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6
Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.
In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).
These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)
This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994
Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.
Reviewed By: riversand963
Differential Revision: D31489850
Pulled By: ltamasi
fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
Summary:
Made SliceTransform into a Customizable class.
Would be nice to write a test that stored and used a custom transform in an SST table.
There are a set of tests (DBBlockFliterTest.PrefixExtractor*, SamePrefixTest.InDomainTest, PrefixTest.PrefixAndWholeKeyTest that run the same with or without a SliceTransform/PrefixFilter. Is this expected?
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8641
Reviewed By: zhichao-cao
Differential Revision: D31142793
Pulled By: mrambacher
fbshipit-source-id: bb08672fccbfdc263dcae21f25a62307e1facda1
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
Reviewed By: zhichao-cao
Differential Revision: D29558961
Pulled By: mrambacher
fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
Summary:
- Fix issue with OptionType::Vector when the nested item is a Customizable with no names
- Fix issue with OptionType::Vector to appropriately wrap the elements in a Vector;
- Fix an issue with nested Customizable object with a null immutable object still appearing in the mutable options;
- Fix/Add tests for null/empty customizable objects
- Move the RegisterTestObjects from customizable_test into testutil.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8566
Reviewed By: zhichao-cao
Differential Revision: D30303724
Pulled By: mrambacher
fbshipit-source-id: 33fa8ea2a3b663210cb356da05e64aab7585b1b5
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
- Added Options/Configurable/Object Registration for TTL and Cassandra variants
- Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option
- Added tests for new functionality
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481
Reviewed By: zhichao-cao
Differential Revision: D30136050
Pulled By: mrambacher
fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
Summary:
Some URIs for creating instances (ala SecondaryCache) use complex URIs like (cache://name;prop=value). These URIs were treated as name-value properties. With this change, if the URI does not contain an "id=XX" setting, it will be treated as a single string value (and not an ID and map of name-value properties).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8547
Reviewed By: anand1976
Differential Revision: D29741386
Pulled By: mrambacher
fbshipit-source-id: 0621f62bec3a6699a7b66c7c0b5634b2856653aa
Summary:
If a Customizable option was not mutable, it would still appear in the list of mutable options when serialized. This meant that when the immutable options were used to configure another immutable object, an "option not changeable" status would be returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8457
Reviewed By: zhichao-cao
Differential Revision: D29428298
Pulled By: mrambacher
fbshipit-source-id: 3945b0b822f8e5955a7c5590fe64dfd5bc1fe6a0
Summary:
Makes the Comparator class into a Customizable object. Added/Updated the CreateFromString method to create Comparators. Added test for using the ObjectRegistry to create one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8336
Reviewed By: jay-zhuang
Differential Revision: D28999612
Pulled By: mrambacher
fbshipit-source-id: bff2cb2814eeb9fef6a00fddc61d6e34b6fbcf2e
Summary:
When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\
heck against memory corruption. This change will also allow more counters to be added in the future for better validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288
Test Plan: Pass all existing tests
Reviewed By: ajkr
Differential Revision: D28369194
fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad
Summary:
The functions will be used for remote compaction parameter
input and result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8247
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D28104680
Pulled By: jay-zhuang
fbshipit-source-id: c0a5178e6277125118384278efea2acbf90aa6cb
Summary:
Added ParseType, SerializeType, and TypesAreEqual methods to OptionTypeInfo. These methods can be used for serialization and deserialization of basic types.
Change the MutableCF/DB Options to use this format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8249
Reviewed By: jay-zhuang
Differential Revision: D28351190
Pulled By: mrambacher
fbshipit-source-id: 72a78643b804f2f0bf59c32ffefa63346672ad16
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used
The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166
Reviewed By: zhichao-cao
Differential Revision: D27657952
Pulled By: mrambacher
fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
Reviewed By: pdillinger
Differential Revision: D28226540
Pulled By: mrambacher
fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
Summary:
This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code. With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively.
readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release).
There are still fields in the ImmutableCFOptions that are not CF options but DB options. Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions). But that will be part of a future PR to minimize changes and disruptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176
Reviewed By: pdillinger
Differential Revision: D27954339
Pulled By: mrambacher
fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523
Test Plan: add new unit test, pass make check/ make asan_check
Reviewed By: pdillinger
Differential Revision: D24313271
Pulled By: zhichao-cao
fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
Summary:
"allow_data_in_errors" is not updated in BuildOptions. So it
would assume default value when BuildOptions is called.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7665
Test Plan: make check -j64
Reviewed By: zhichao-cao
Differential Revision: D24929100
Pulled By: akankshamahajan15
fbshipit-source-id: dd6225a6c9f13b20027ff1b6de8e79801b57b3f7
Summary:
This patch simply adds a couple of options that will enable users to
configure garbage collection when using the integrated BlobDB
implementation. The actual GC logic will be added in a separate step.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7661
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24906544
Pulled By: ltamasi
fbshipit-source-id: ee0e056a712a4b4475cd90de8b27d969bd61b7e1
Summary:
The Customizable class is an extension of the Configurable class and allows instances to be created by a name/ID. Classes that extend customizable can define their Type (e.g. "TableFactory", "Cache") and a method to instantiate them (TableFactory::CreateFromString). Customizable objects can be registered with the ObjectRegistry and created dynamically.
Future PRs will make more types of objects extend Customizable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6590
Reviewed By: cheng-chang
Differential Revision: D24841553
Pulled By: zhichao-cao
fbshipit-source-id: d0c2132bd932e971cbfe2c908ca2e5db30c5e155
Summary:
In dictionary compression's initial implementation, in order to save CPU overhead, we only enabled it
for bottom level under the assumption that the vast majority of data is
stored there. At that time, there was no
such thing as `ColumnFamilyOptions::bottommost_compression_opts`, so we just
hardcoded disabling dictionary compression in flush and compactions to
non-bottommost level. Now, we have users who generate all their files
through flush and are considering using dictionary compression.
To support such a use case, this PR expands the scope of `ColumnFamilyOptions::compression_opts` to
additionally include flushed files and files generated by compaction to
a non-bottommost level. Users can still get the old behavior by moving
their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts`
and explicitly enabling both that and `ColumnFamilyOptions::bottommost_compression`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7619
Reviewed By: ltamasi
Differential Revision: D24665610
Pulled By: ajkr
fbshipit-source-id: 656b90bce1033fe21c71e09af931ef5bde3e464c
Summary:
Further refinement of the earlier PR. Now the Status is NotFound with a subcode of PathNotFound. Also the existing functions for options parsing/loading are reverted to return InvalidArgument no matter in which way the user-provided arguments are deemed invalid.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7563
Reviewed By: zhichao-cao
Differential Revision: D24422491
Pulled By: ajkr
fbshipit-source-id: ba6b237cd0584d3f925c5ba0d349aeb8c250af67
Summary:
This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479
Test Plan: Add new unit tests
Reviewed By: pdillinger
Differential Revision: D24340671
Pulled By: anand1976
fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2
Summary:
This option determines whether WALs will be tracked in MANIFEST and verified on recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7275
Test Plan:
db_options_test
options_test
Reviewed By: pdillinger
Differential Revision: D23181418
Pulled By: cheng-chang
fbshipit-source-id: 5dd1cdc166f3dfc1c93c094df4a2f7734e3b4547
Summary:
Introduce an new option options.check_flush_compaction_key_order, by default set to true, which checks key order of flush and compaction, and fail the operation if the order is violated.
Also did minor refactor hash checking code, which consolidates the hashing logic to a vlidation class, where the key ordering logic is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7467
Test Plan: Add unit tests to validate the check can catch reordering in flush and compaction, and can be properly disabled.
Reviewed By: riversand963
Differential Revision: D24010683
fbshipit-source-id: 8dd6292d2cda8006054e9ded7cfa4bf405f0527c
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753
Reviewed By: ajkr
Differential Revision: D23385030
Pulled By: zhichao-cao
fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
Summary:
A colon will be added after 'msg' automatically when invoke function Status(Code _code, const Slice& msg, const Slice& msg2),
it's not needed to append a colon explicitly to 'msg'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7041
Reviewed By: ajkr
Differential Revision: D22292801
fbshipit-source-id: 8f2d69065bb779d2613468bf9fc9169f32c3f1ec
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765
Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check
Reviewed By: anand1976
Differential Revision: D21916789
Pulled By: zhichao-cao
fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
Summary:
It's useful to build RocksDB using a more recent clang version in CI. Add a CircleCI build and fix some issues with it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7025
Test Plan: See all tests pass.
Reviewed By: pdillinger
Differential Revision: D22215700
fbshipit-source-id: 914a729c2cd3f3ac4a627cc0ac58d4691dca2168
Summary:
The OptionTypeInfo::Vector method allows a vector<T> to be converted to/from strings via the options.
The kVectorInt and kVectorCompressionType vectors were replaced with this methodology.
As part of this change, the NextToken method was added to the OptionTypeInfo. This method was refactored from code within the StringToMap function.
Future types that could use this functionality include the EventListener vectors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6424
Reviewed By: cheng-chang
Differential Revision: D21832368
Pulled By: zhichao-cao
fbshipit-source-id: e1ca766faff139d54e6e8407a9ec09ece6517439
Summary:
* Print stack trace on status checked failure
* Make folly_synchronization_distributed_mutex_test a parallel test
* Disable ldb_test.py and rocksdb_dump_test.sh with
ASSERT_STATUS_CHECKED (broken)
* Fix shadow warning in random_access_file_reader.h reported by gcc
4.8.5 (ROCKSDB_NO_FBCODE), also https://github.com/facebook/rocksdb/issues/6866
* Work around compiler bug on max_align_t for gcc < 4.9
* Remove an apparently wrong comment in status.h
* Use check_some in Travis config (for proper diagnostic output)
* Fix ignored Status in loop in options_helper.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6871
Test Plan: manual, CI
Reviewed By: ajkr
Differential Revision: D21706619
Pulled By: pdillinger
fbshipit-source-id: daf6364173d6689904eb394461a69a11f5bee2cb
Summary:
Fixed some option handling code that recently broke the
ASSERT_STATUS_CHECKED build for options_test.
Added all other existing tests that pass under ASSERT_STATUS_CHECKED to
the whitelist.
Added a Travis configuration to run all whitelisted tests with
ASSERT_STATUS_CHECKED. (Someday we might enable this check by default in
debug builds.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6870
Test Plan: ASSERT_STATUS_CHECKED=1 make check, Travis
Reviewed By: ajkr
Differential Revision: D21704374
Pulled By: pdillinger
fbshipit-source-id: 15daef98136a19d7a6843fa0c9ec08738c2ac693
Summary:
Added code for generically handing structs to OptionTypeInfo. A struct is a collection of variables handled by their own map of OptionTypeInfos. Examples of structs include Compaction and Cache options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6425
Reviewed By: siying
Differential Revision: D21668789
Pulled By: zhichao-cao
fbshipit-source-id: 064b110de39dadf82361ed4663f7ac1a535b0b07
Summary:
Tried making Status object enforce that it is checked in some way. In cases it is not checked, `PermitUncheckedError()` must be called explicitly.
Added a way to run tests (`ASSERT_STATUS_CHECKED=1 make -j48 check`) on a
whitelist. The effort appears significant to get each test to pass with
this assertion, so I only fixed up enough to get one test (`options_test`)
working and added it to the whitelist.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6798
Reviewed By: pdillinger
Differential Revision: D21377404
Pulled By: ajkr
fbshipit-source-id: 73236f9c8df38f01cf24ecac4a6d1661b72d077e
Summary:
Add methods and constructors for handling enums to the OptionTypeInfo. This change allows enums to be converted/compared without adding a special "type" to the OptionType.
This change addresses a couple of issues:
- It allows new enumerated types to be added to the options without editing the OptionType base class (and related methods)
- It standardizes the procedure for adding enumerated types to the options, reducing potential mistakes
- It moves the enum maps to the location where they are used, allowing them to be static file members rather than global values
- It reduces the number of types and cases that need to be handled in the various OptionType methods
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6423
Reviewed By: siying
Differential Revision: D21408713
Pulled By: zhichao-cao
fbshipit-source-id: fc492af285d011822578b95d186a0fce25d35626
Summary:
Added functions for parsing, serializing, and comparing elements to OptionTypeInfo. These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map. Using these functions, every type can be handled via the map rather than special cased.
By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422
Test Plan: pass make check
Reviewed By: siying
Differential Revision: D21269005
Pulled By: zhichao-cao
fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077
Summary:
GetSupportedCompressions needs to be defined under LITE.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6758
Test Plan: build under LITE
Reviewed By: zhichao-cao
Differential Revision: D21247937
Pulled By: cheng-chang
fbshipit-source-id: 880e59d3e107cdd736d16427a68c5641d1318fb4
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389
Reviewed By: siying
Differential Revision: D21163707
Pulled By: zhichao-cao
fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
Summary:
1. Add changes so that max_background_flushes can be set dynamically.
2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the
max_background_flushes dynamically using SetDBOptions.
TestPlan: 1. make -j64 check
2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701
Reviewed By: ajkr
Differential Revision: D21028010
Pulled By: akankshamahajan15
fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1
Summary:
… to CFOptions
https://github.com/facebook/rocksdb/pull/6615 made several compression related options dynamically changeable. They are moved to MutableCFOptions. However, they are not copied back to ColumnFamilyOptions, so the changed values are not written to option files and for some other uses. Fix it by copying them back.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6668
Test Plan: Add a unit test to make sure that when a MutableCFOptions is converted to CFOptions and back to MutableCFOptions, they stay the same. This test would fail without the fix.
Reviewed By: ajkr
Differential Revision: D20923999
fbshipit-source-id: c3bccd6923b00d677764e2269bed6a95ad7ed780
Summary:
This is a predecessor to the Configurable PR. This change moves the OptionTypeInfo maps closer to where they will be used.
When the Configurable changes are adopted, these values will become static and not associated with the OptionsHelper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6198
Reviewed By: siying
Differential Revision: D20778108
Pulled By: zhichao-cao
fbshipit-source-id: a9f85fc73bc53503656e1958ecc1e764052fd1aa
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262
Reviewed By: ajkr
Differential Revision: D20651306
fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
Summary:
These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615
Test Plan: Add a unit test to make sure that SetOptions() can change the options.
Reviewed By: riversand963
Differential Revision: D20755951
fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600
Test Plan: tested with make asan_check
Reviewed By: riversand963
Differential Revision: D20717670
Pulled By: zhichao-cao
fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
Summary:
The current Env/FileSystem API separation has a couple of issues -
1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation.
2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes.
This PR solves the above issues and simplifies the migration in the following ways -
1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```.
1a. - This also makes it more robust by ensuring that even if RocksDB
has some stray calls to Env APIs rather than FileSystem, they will go
through the same object and thus there is no risk of getting out of
sync.
2. Provide a ```NewCompositeEnv()``` API that can be used to construct a
PosixEnv with a custom FileSystem implementation. This eliminates an
indirection to call Env APIs, and relieves the FileSystem developer of
the burden of having to implement wrappers for the Env APIs.
3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and
```NewLogger()```
Tests:
1. New unit tests
2. make check and make asan_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552
Reviewed By: riversand963
Differential Revision: D20592038
Pulled By: anand1976
fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7
Summary:
There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
`DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.
Test plan (on devserver):
```
$make check
$COMPILE_WITH_ASAN=1 make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334
Reviewed By: anand1976
Differential Revision: D19778960
Pulled By: riversand963
fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.
Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216
Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.
Differential Revision: D19171461
Pulled By: zhichao-cao
fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
Summary:
When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.
If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.
We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353
Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.
Differential Revision: D19656425
Pulled By: al13n321
fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
The max batch size that we can write to the WAL is controlled by a static manner. So if the leader write is less than 128 KB we will have the batch size as leader write size + 128 KB else the limit will be 1 MB. Both of them are statically defined.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5759
Differential Revision: D17329298
fbshipit-source-id: a3d910629d8d8ca84ea39ad89c2b2d284571ded5
Summary:
Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong.
The goal of this PR is solve this problem by moving the DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725
Test Plan: Added unit tests.
Differential Revision: D16963840
Pulled By: vjnadimpalli
fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
Differential Revision: D14394062
Pulled By: miasantreble
fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
Summary:
The ObjectRegistry class replaces the Registrar and NewCustomObjects. Objects are registered with the registry by Type (the class must implement the static const char *Type() method).
This change is necessary for a few reasons:
- By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable.
- By having a class with instances, different units could have different objects registered. This could be useful if, for example, one Option allowed for a dynamic library and one did not.
When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program.
Test plan (on riversand963's devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check
```
All tests pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293
Differential Revision: D16363396
Pulled By: riversand963
fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180
Summary:
Added log_readahead_size option to control prefetching for Log::Reader.
This is mostly useful for reading a remotely located log, as it can save the number of round-trips when reading it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5592
Differential Revision: D16362989
Pulled By: elipoz
fbshipit-source-id: c5d4d5245a44008cd59879640efff70c091ad3e8
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
Summary:
This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046
Differential Revision: D15863138
Pulled By: miasantreble
fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b
Summary:
When dynamically setting options, we check the option type info and skip options that are marked deprecated. However this check is only done at top level, which results in bugs where SetOptions will corrupt option values and cause unexpected system behavior iff a deprecated second level option is set dynamically.
For exmaple, the following call:
```
dbfull()->SetOptions(
{{"compaction_options_fifo",
"{allow_compaction=true;max_table_files_size=1024;ttl=731;}"}});
```
was from pre 6.0 release when `ttl` was part of `compaction_options_fifo`. Now that it got moved out of `compaction_options_fifo`, this call will incorrectly set `compaction_options_fifo.max_table_files_size` to 731 (as `max_table_files_size` is the first one in `OptionsHelper::fifo_compaction_options_type_info` struct) and cause files to gett evicted much faster than expected.
This PR adds verification to second level options like `compaction_options_fifo.ttl` or `compaction_options_fifo.max_table_files_size` when set dynamically, and filter out those marked as deprecated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5367
Differential Revision: D15530998
Pulled By: miasantreble
fbshipit-source-id: 818258be5c3abe09cd82d62f3c083572d70fecdd
Summary:
In options_helper.cc various functions take a const unordered_map of
string -> TypeInfo for options handling. These functions pass by-value
the (const) maps, resulting in unnecessary copies.
Change to pass by reference.
This results in a noticable reduction in the amount of time spent
parsing options - in my case a set of unit tests using RocksDB which
call SetOptions() to modify options see a ~25% runtime reduction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5295
Differential Revision: D15296334
Pulled By: riversand963
fbshipit-source-id: 4d4be3db635264943607911b296dda27fd7ce1a7
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.
Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
Differential Revision: D15219029
Pulled By: maysamyabandeh
fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
Summary:
Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
Differential Revision: D15203291
Pulled By: maysamyabandeh
fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
Summary:
Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269
Differential Revision: D15151285
Pulled By: maysamyabandeh
fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e
Summary:
Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099
Differential Revision: D15086710
Pulled By: maysamyabandeh
fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
Summary:
- By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object.
- Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup.
- Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237
Differential Revision: D15056360
Pulled By: siying
fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e
Summary:
The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.
My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.
Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.
There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).
The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183
Differential Revision: D14953553
Pulled By: siying
fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
Summary:
Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174
Differential Revision: D14884185
Pulled By: al13n321
fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
Summary:
Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator.
In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option.
(EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.)
I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043
Differential Revision: D14339233
Pulled By: al13n321
fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8
Summary:
Allow customized merge operator to be loaded from option file/map/string
by allowing users to pre-regiester merge operators to object registry.
Also update HISTORY.md and header files for the same feature for comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5123
Differential Revision: D14658488
Pulled By: siying
fbshipit-source-id: 86ea2fbd2a0a04632d8ea9fceaffefd041f6ae61
Summary:
Even customized ldb may not be able to read data from some databases if
comparator is not standard. We modify option helper to get comparator from
object registry so that we can use customized ldb to read non-standard
comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5106
Differential Revision: D14622107
Pulled By: siying
fbshipit-source-id: 151dcb295a35a4c7d54f919cd4e322a89dc601c9
Summary:
This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
1. lz4 or snappy for fast compression
2. zstd or zlib for slow but higher compression.
The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.
The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842
Differential Revision: D13629011
Pulled By: shobhitdayal
fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
Summary:
This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.
(This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748
Differential Revision: D13961471
Pulled By: miasantreble
fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
Summary:
MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985
Differential Revision: D14118257
Pulled By: miasantreble
fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea
Summary:
We introduced ttl option in CompactionOptionsFIFO when ttl-based file
deletion (compaction) was supported only as part of FIFO Compaction. But
with the extension of ttl semantics even to Level compaction,
CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
using ColumnFamilyOptions.ttl for FIFO compaction as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965
Differential Revision: D14072960
Pulled By: sagar0
fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
Summary:
In the past, both `DBImpl::atomic_flush_` and
`DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set
`immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is
set correctly. This does not lead to incorrect behavior, but is a duplicate of
information.
Since `immutable_db_options_` is always there and has `atomic_flush`, we should
use it as source of truth and remove `DBImpl::atomic_flush_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631
Differential Revision: D12928371
Pulled By: riversand963
fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d
Summary:
Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`
The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174
Differential Revision: D8965914
Pulled By: fgwu
fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
Summary:
Added DataBlockIndexType option in BlockBasedTableOptions.
```
enum DataBlockIndexType : char {
kDataBlockBinarySearch = 0, // traditional block type
kDataBlockHashIndex = 1, // additional hash index appended to the end.
};
```
The default type is the traditional binary seek option: `kDataBlockBinarySearch`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4150
Differential Revision: D8895958
Pulled By: fgwu
fbshipit-source-id: 480adef48104cf11d30db3bad9a73f98b4a80c10
Summary:
…ression
For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts
Closes https://github.com/facebook/rocksdb/pull/3985
Reviewed By: riversand963
Differential Revision: D8385911
Pulled By: zhichao-cao
fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc
Summary:
Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file.
This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at.
Closes https://github.com/facebook/rocksdb/pull/3899
Differential Revision: D8157459
Pulled By: miasantreble
fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601
Differential Revision: D7253114
Pulled By: miasantreble
fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
Summary:
Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests.
This PR is built on top of https://github.com/facebook/rocksdb/pull/3756
Closes https://github.com/facebook/rocksdb/pull/3824
Differential Revision: D7909153
Pulled By: maysamyabandeh
fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce
Summary:
In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion:
```
db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed.
```
We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons.
Closes https://github.com/facebook/rocksdb/pull/3700
Differential Revision: D7572596
Pulled By: ajkr
fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102
Differential Revision: D6951697
Pulled By: ajkr
fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
Summary:
Level Compaction with TTL.
As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.
This functionality can be controlled by the newly introduced column family option -- ttl.
TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591
Differential Revision: D7275442
Pulled By: sagar0
fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
Summary:
enable_pipelined_write was not set in BuildDBOptions() causing its default
value to be dumped in the OPTIONS file
Closes https://github.com/facebook/rocksdb/pull/3585
Differential Revision: D7226395
Pulled By: yiwu-arbug
fbshipit-source-id: 45a659a48d18103ac9ee74bb8805dd0a6ec12474