Summary:
Add `ReserveThreads` and `ReleaseThreads` functions in thread pool to support reservation in for a specific thread pool. With this feature, a thread will be blocked if the number of waiting threads (noted by `num_waiting_threads_`) equals the number of reserved threads (noted by `reserved_threads_`), normally `reserved_threads_` is upper bounded by `num_waiting_threads_`; in rare cases (e.g. `SetBackgroundThreadsInternal` is called when some threads are already reserved), `num_waiting_threads_` can be less than `reserved_threads`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10278
Test Plan: Add `ReserveThreads` unit test in `env_test`. Update the unit test `SimpleColumnFamilyInfoTest` in `thread_list_test` with adding `ReserveThreads` related assertions.
Reviewed By: hx235
Differential Revision: D37640946
Pulled By: littlepig2013
fbshipit-source-id: 4d691f6b9a433569f96ab52d52c3defe5b065367
Summary:
With https://github.com/facebook/rocksdb/pull/9996 , we can pass the rate_limiter_priority to FS for most cases. This PR is to update the code path for filter block reader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10251
Test Plan: Current unit tests should pass.
Reviewed By: pdillinger
Differential Revision: D37427667
Pulled By: gitbw95
fbshipit-source-id: 1ce5b759b136efe4cfa48a6b97e2f837ff087433
Summary:
folly DistributedMutex is faster than standard mutexes though
imposes some static obligations on usage. See
https://github.com/facebook/folly/blob/main/folly/synchronization/DistributedMutex.h
for details. Here we use this alternative for our Cache implementations
(especially LRUCache) for better locking performance, when RocksDB is
compiled with folly.
Also added information about which distributed mutex implementation is
being used to cache_bench output and to DB LOG.
Intended follow-up:
* Use DMutex in more places, perhaps improving API to support non-scoped
locking
* Fix linking with fbcode compiler (needs ROCKSDB_NO_FBCODE=1 currently)
Credit: Thanks Siying for reminding me about this line of work that was previously
left unfinished.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10179
Test Plan:
for correctness, existing tests. CircleCI config updated.
Also Meta-internal buck build updated.
For performance, ran simultaneous before & after cache_bench. Out of three
comparison runs, the middle improvement to ops/sec was +21%:
Baseline: USE_CLANG=1 DEBUG_LEVEL=0 make -j24 cache_bench (fbcode
compiler)
```
Complete in 20.201 s; Rough parallel ops/sec = 1584062
Thread ops/sec = 107176
Operation latency (ns):
Count: 32000000 Average: 9257.9421 StdDev: 122412.04
Min: 134 Median: 3623.0493 Max: 56918500
Percentiles: P50: 3623.05 P75: 10288.02 P99: 30219.35 P99.9: 683522.04 P99.99: 7302791.63
```
New: (add USE_FOLLY=1)
```
Complete in 16.674 s; Rough parallel ops/sec = 1919135 (+21%)
Thread ops/sec = 135487
Operation latency (ns):
Count: 32000000 Average: 7304.9294 StdDev: 108530.28
Min: 132 Median: 3777.6012 Max: 91030902
Percentiles: P50: 3777.60 P75: 10169.89 P99: 24504.51 P99.9: 59721.59 P99.99: 1861151.83
```
Reviewed By: anand1976
Differential Revision: D37182983
Pulled By: pdillinger
fbshipit-source-id: a17eb05f25b832b6a2c1356f5c657e831a5af8d1
Summary:
In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
the public API, because of its inefficiency. Although we normally maintain read compatibility
on old DBs forever, filters are not required for reading a DB, only for optimizing read
performance. Thus, it should be acceptable to remove this code and the substantial
maintenance burden it carries as useful features are developed and validated (such
as user timestamp).
This change completely removes the code for reading and writing the old block-based
filters, net removing about 1370 lines of code no longer needed. Options removed from
testing / benchmarking tools. The prior existence is only evident in a couple of places:
* `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
a major release to minimize source code incompatibilities.
* A warning is logged when an old table file is opened that used the old block-based
filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
should suffice. Unfortunately, sst_dump does not tell you whether a file uses
block-based filter, and the structure of the code makes it very difficult to fix.
* To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
for metaindex is maintained (for now).
Other notes:
* In some cases where numbers are associated with filter configurations, we have had to
update the assigned numbers so that they all correspond to something that exists.
* Fixed potential stat counting bug by assuming `filter_checked = false` for cases
like `filter == nullptr` rather than assuming `filter_checked = true`
* Removed obsolete `block_offset` and `prefix_extractor` parameters from several
functions.
* Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
because the caller guarantees the prefix extractor exists and is compatible
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184
Test Plan:
tests updated, manually test new warning in LOG using base version to
generate a DB
Reviewed By: riversand963
Differential Revision: D37212647
Pulled By: pdillinger
fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
Summary:
Fix existing usage of non-ASCII and add a check to prevent
future use. Added `-n` option to greps to provide line numbers.
Alternative to https://github.com/facebook/rocksdb/issues/10147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10164
Test Plan:
used new checker to find & fix cases, manually check
db_bench output is preserved
Reviewed By: akankshamahajan15
Differential Revision: D37148792
Pulled By: pdillinger
fbshipit-source-id: 68c8b57e7ab829369540d532590bf756938855c7
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.
We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.
A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".
This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)
Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144
Test Plan: test updates included
Reviewed By: siying
Differential Revision: D37052710
Pulled By: pdillinger
fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
Summary:
The patch adds some low-level logic that can be used to serialize/deserialize
a sorted vector of wide columns to/from a simple binary searchable string
representation. Currently, there is no user-facing API; this will be implemented in
subsequent stages.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9915
Test Plan: `make check`
Reviewed By: siying
Differential Revision: D35978076
Pulled By: ltamasi
fbshipit-source-id: 33f5f6628ec3bcd8c8beab363b1978ac047a8788
Summary:
There are currently some preprocessor checks that assume support for Visual Studio versions older than 2015 (i.e., 0 < _MSC_VER < 1900), although we don't support them any more.
We removed all code that only compiles on those older versions, except third-party/ files.
The ROCKSDB_NOEXCEPT symbol is now obsolete, since it now always gets replaced by noexcept. We removed it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10065
Reviewed By: pdillinger
Differential Revision: D36721901
Pulled By: guidotag
fbshipit-source-id: a2892d365ef53cce44a0a7d90dd6b72ee9b5e5f2
Summary:
There are some time-related POSIX APIs that are not available on Windows
(e.g. `localtime_r`), which we have worked around by providing our own
implementations in `port/sys_time.h`. This workaround actually relies on
some ambiguity: on Windows, a call to `localtime_r` calls
`ROCKSDB_NAMESPACE::port::localtime_r` (which is pulled into
`ROCKSDB_NAMESPACE` by a using-declaration), while on other platforms
it calls the global `localtime_r`. This works fine as long as there is only one
candidate function; however, it breaks down when there is more than one
`localtime_r` visible in a scope.
The patch fixes this by introducing `ROCKSDB_NAMESPACE::port::{TimeVal, GetTimeOfDay, LocalTimeR}`
to eliminate any ambiguity.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10045
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D36639372
Pulled By: ltamasi
fbshipit-source-id: fc13dbfa421b7c8918111a6d9e24ce77e91a7c50
Summary:
Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now).
The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973
Test Plan:
- `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter.
- Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart.
- Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb`
- Benchmark:
```
strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db
strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db
```
- db bench on backup and restore to ensure no performance regression.
- backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%)
- restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%)
```
# Set up
./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000
# benchmark
TEST_TMPDIR=/tmp/test_rocksdb
NUM_RUN=50
for ((j=0;j<$NUM_RUN;j++))
do
./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup'
# Restore
#./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db
done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt
```
Reviewed By: hx235
Differential Revision: D36327418
Pulled By: cbi42
fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691
Summary:
The build failed due to different namespaces for coroutines (std::experimental vs std) based on compiler version.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10041
Reviewed By: ltamasi
Differential Revision: D36617212
Pulled By: anand1976
fbshipit-source-id: dfb25320788d32969317d5651173059e2cbd8bd5
Summary:
This PR implements a coroutine version of batched MultiGet in order to concurrently read from multiple SST files in a level using async IO, thus reducing the latency of the MultiGet. The API from the user perspective is still synchronous and single threaded, with the RocksDB part of the processing happening in the context of the caller's thread. In Version::MultiGet, the decision is made whether to call synchronous or coroutine code.
A good way to review this PR is to review the first 4 commits in order - de773b3, 70c2f70, 10b50e1, and 377a597 - before reviewing the rest.
TODO:
1. Figure out how to build it in CircleCI (requires some dependencies to be installed)
2. Do some stress testing with coroutines enabled
No regression in synchronous MultiGet between this branch and main -
```
./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=64 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=true -threads=16 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -statistics
```
Branch - ```multireadrandom : 4.025 micros/op 3975111 ops/sec 60.001 seconds 238509056 operations; 2062.3 MB/s (14767808 of 14767808 found)```
Main - ```multireadrandom : 3.987 micros/op 4013216 ops/sec 60.001 seconds 240795392 operations; 2082.1 MB/s (15231040 of 15231040 found)```
More benchmarks in various scenarios are given below. The measurements were taken with ```async_io=false``` (no coroutines) and ```async_io=true``` (use coroutines). For an IO bound workload (with every key requiring an IO), the coroutines version shows a clear benefit, being ~2.6X faster. For CPU bound workloads, the coroutines version has ~6-15% higher CPU utilization, depending on how many keys overlap an SST file.
1. Single thread IO bound workload on remote storage with sparse MultiGet batch keys (~1 key overlap/file) -
No coroutines - ```multireadrandom : 831.774 micros/op 1202 ops/sec 60.001 seconds 72136 operations; 0.6 MB/s (72136 of 72136 found)```
Using coroutines - ```multireadrandom : 318.742 micros/op 3137 ops/sec 60.003 seconds 188248 operations; 1.6 MB/s (188248 of 188248 found)```
2. Single thread CPU bound workload (all data cached) with ~1 key overlap/file -
No coroutines - ```multireadrandom : 4.127 micros/op 242322 ops/sec 60.000 seconds 14539384 operations; 125.7 MB/s (14539384 of 14539384 found)```
Using coroutines - ```multireadrandom : 4.741 micros/op 210935 ops/sec 60.000 seconds 12656176 operations; 109.4 MB/s (12656176 of 12656176 found)```
3. Single thread CPU bound workload with ~2 key overlap/file -
No coroutines - ```multireadrandom : 3.717 micros/op 269000 ops/sec 60.000 seconds 16140024 operations; 139.6 MB/s (16140024 of 16140024 found)```
Using coroutines - ```multireadrandom : 4.146 micros/op 241204 ops/sec 60.000 seconds 14472296 operations; 125.1 MB/s (14472296 of 14472296 found)```
4. CPU bound multi-threaded (16 threads) with ~4 key overlap/file -
No coroutines - ```multireadrandom : 4.534 micros/op 3528792 ops/sec 60.000 seconds 211728728 operations; 1830.7 MB/s (12737024 of 12737024 found) ```
Using coroutines - ```multireadrandom : 4.872 micros/op 3283812 ops/sec 60.000 seconds 197030096 operations; 1703.6 MB/s (12548032 of 12548032 found) ```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9968
Reviewed By: akankshamahajan15
Differential Revision: D36348563
Pulled By: anand1976
fbshipit-source-id: c0ce85a505fd26ebfbb09786cbd7f25202038696
Summary:
ROCKSDB_SUPPORT_THREAD_LOCAL definition has been removed.
`__thread`(#define) has been replaced with `thread_local`(C++ keyword) across the code base.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10015
Reviewed By: siying
Differential Revision: D36485491
Pulled By: pdillinger
fbshipit-source-id: 6522d212514ee190b90b4e2750c80c7e34013c78
Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.
From the RocksDB perspective, there can be two kinds of rate limiters, the internal (native) one and the external one.
- The internal (native) rate limiter is introduced in [the wiki](https://github.com/facebook/rocksdb/wiki/Rate-Limiter). Currently, only IO_LOW and IO_HIGH are used and they are set statically.
- For the external rate limiter, in FSWritableFile functions, IOOptions is open for end users to set and get rate_limiter_priority for their own rate limiter. Currently, RocksDB doesn’t pass the rate_limiter_priority through IOOptions to the file system.
### Solution
During the User Read, Flush write, Compaction read/write, the WriteController is used to determine whether DB writes are stalled or slowed down. The rate limiter priority (Env::IOPriority) can be determined accordingly. We decided to always pass the priority in IOOptions. What the file system does with it should be a contract between the user and the file system. We would like to set the rate limiter priority at file level, since the Flush/Compaction job level may be too coarse with multiple files and block IO level is too granular.
**This PR is for the Write path.** The **Write:** dynamic priority for different state are listed as follows:
| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
| Flush | IO_HIGH | IO_USER | IO_USER |
| Compaction | IO_LOW | IO_USER | IO_USER |
Flush and Compaction writes share the same call path through BlockBaseTableWriter, WritableFileWriter, and FSWritableFile. When a new FSWritableFile object is created, its io_priority_ can be set dynamically based on the state of the WriteController. In WritableFileWriter, before the call sites of FSWritableFile functions, WritableFileWriter::DecideRateLimiterPriority() determines the rate_limiter_priority. The options (IOOptions) argument of FSWritableFile functions will be updated with the rate_limiter_priority.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9988
Test Plan: Add unit tests.
Reviewed By: anand1976
Differential Revision: D36395159
Pulled By: gitbw95
fbshipit-source-id: a7c82fc29759139a1a07ec46c37dbf7e753474cf
Summary:
**Context:**
Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`.
Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags.
**Summary:**
- Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that
- Added missing db bench/stress test for some memory charging features
- Renamed related test suite to indicate they are under the same theme of memory charging
- Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication
- Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926
Test Plan:
- New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)`
- New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)`
- CI
- db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
#-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465**
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078**
- db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal
Reviewed By: ajkr
Differential Revision: D36054712
Pulled By: hx235
fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af
Summary:
Changed the static objects that had non-trivial destructors to use the STATIC_AVOID_DESTRUCTION construct.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9958
Reviewed By: pdillinger
Differential Revision: D36442982
Pulled By: mrambacher
fbshipit-source-id: 029d47b1374d30d198bfede369a4c0ae7a4eb519
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
Test Plan: Watch CI tests
Reviewed By: riversand963
Differential Revision: D36176799
fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
Test Plan: See CI Runs.
Reviewed By: riversand963
Differential Revision: D36173954
fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
Summary:
When MultiGet() determines that multiple query keys can be
served by examining the same data block in block cache (one Lookup()),
each PinnableSlice referring to data in that data block needs to hold
on to the block in cache so that they can be released at arbitrary
times by the API user. Historically this is accomplished with extra
calls to Ref() on the Handle from Lookup(), with each PinnableSlice
cleanup calling Release() on the Handle, but this creates extra
contention on the block cache for the extra Ref()s and Release()es,
especially because they hit the same cache shard repeatedly.
In the case of merge operands (possibly more cases?), the problem was
compounded by doing an extra Ref()+eventual Release() for each merge
operand for a key reusing a block (which could be the same key!), rather
than one Ref() per key. (Note: the non-shared case with `biter` was
already one per key.)
This change optimizes MultiGet not to rely on these extra, contentious
Ref()+Release() calls by instead, in the shared block case, wrapping
the cache Release() cleanup in a refcounted object referenced by the
PinnableSlices, such that after the last wrapped reference is released,
the cache entry is Release()ed. Relaxed atomic refcounts should be
much faster than mutex-guarded Ref() and Release(), and much less prone
to a performance cliff when MultiGet() does a lot of block sharing.
Note that I did not use std::shared_ptr, because that would require an
extra indirection object (shared_ptr itself new/delete) in order to
associate a ref increment/decrement with a Cleanable cleanup entry. (If
I assumed it was the size of two pointers, I could do some hackery to
make it work without the extra indirection, but that's too fragile.)
Some details:
* Fixed (removed) extra block cache tracing entries in cases of cache
entry reuse in MultiGet, but it's likely that in some other cases traces
are missing (XXX comment inserted)
* Moved existing implementations for cleanable.h from iterator.cc to
new cleanable.cc
* Improved API comments on Cleanable
* Added a public SharedCleanablePtr class to cleanable.h in case others
could benefit from the same pattern (potentially many Cleanables and/or
smart pointers referencing a shared Cleanable)
* Add a typedef for MultiGetContext::Mask
* Some variable renaming for clarity
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9899
Test Plan:
Added unit tests for SharedCleanablePtr.
Greatly enhanced ability of existing tests to detect cache use-after-free.
* Release PinnableSlices from MultiGet as they are read rather than in
bulk (in db_test_util wrapper).
* In ASAN build, default to using a trivially small LRUCache for block_cache
so that entries are immediately erased when unreferenced. (Updated two
tests that depend on caching.) New ASAN testsuite running time seems
OK to me.
If I introduce a bug into my implementation where we skip the shared
cleanups on block reuse, ASAN detects the bug in
`db_basic_test *MultiGet*`. If I remove either of the above testing
enhancements, the bug is not detected.
Consider for follow-up work: manipulate or randomize ordering of
PinnableSlice use and release from MultiGet db_test_util wrapper. But in
typical cases, natural ordering gives pretty good functional coverage.
Performance test:
In the extreme (but possible) case of MultiGetting the same or adjacent keys
in a batch, throughput can improve by an order of magnitude.
`./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb -readonly -num=5 -duration=10 -threads=20 -multiread_batched -batch_size=200`
Before ops/sec, num=5: 1,384,394
Before ops/sec, num=500: 6,423,720
After ops/sec, num=500: 10,658,794
After ops/sec, num=5: 16,027,257
Also note that previously, with high parallelism, having query keys
concentrated in a single block was worse than spreading them out a bit. Now
concentrated in a single block is faster than spread out, which is hopefully
consistent with natural expectation.
Random query performance: with num=1000000, over 999 x 10s runs running before & after simultaneously (each -threads=12):
Before: multireadrandom [AVG 999 runs] : 1088699 (± 7344) ops/sec; 120.4 (± 0.8 ) MB/sec
After: multireadrandom [AVG 999 runs] : 1090402 (± 7230) ops/sec; 120.6 (± 0.8 ) MB/sec
Possibly better, possibly in the noise.
Reviewed By: anand1976
Differential Revision: D35907003
Pulled By: pdillinger
fbshipit-source-id: bbd244d703649a8ca12d476f2d03853ed9d1a17e
Summary:
The minimum libzstd version that has `ZSTD_compressStream2` is
1.4.0 so only define ZSTD_STREAMING in that case.
Fixes building on Ubuntu 18.04 which has libzstd 1.3.3 as its
repository version.
Fixes https://github.com/facebook/rocksdb/issues/9795
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9841
Test Plan:
Build and test on Ubuntu 18.04 with:
apt-get install libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev \
libzstd-dev libgflags-dev g++ make curl
Reviewed By: ajkr
Differential Revision: D35648738
Pulled By: jay-zhuang
fbshipit-source-id: 2a9e969bcc17a7dc10172f3817283409de885811
Summary:
Especially after updating to C++17, I don't see a compelling case for
*requiring* any folly components in RocksDB. I was able to purge the existing
hard dependencies, and it can be quite difficult to strip out non-trivial components
from folly for use in RocksDB. (The prospect of doing that on F14 has changed
my mind on the best approach here.)
But this change creates an optional integration where we can plug in
components from folly at compile time, starting here with F14FastMap to replace
std::unordered_map when possible (probably no public APIs for example). I have
replaced the biggest CPU users of std::unordered_map with compile-time
pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
that is in the Makefile for public CI testing. A full folly build is not needed, but
checking out the full folly repo is much simpler for getting the dependency,
and anything else we might want to optionally integrate in the future.
Some picky details:
* I don't think the distributed mutex stuff is actually used, so it was easy to remove.
* I implemented an alternative to `folly::constexpr_log2` (which is much easier
in C++17 than C++11) so that I could pull out the hard dependencies on
`ConstexprMath.h`
* I had to add noexcept move constructors/operators to some types to make
F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
macro to make that easier in some common cases.
* Updated Meta-internal buck build to use folly F14Map (always)
No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
production integration for open source users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546
Test Plan:
CircleCI tests updated so that a couple of them use folly.
Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
(Note: they should probably use buck but they currently use Makefile.)
Example performance improvement: when filter partitions are pinned in cache,
they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
a test that exercises that heavily. Build DB with
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
```
and test with (simultaneous runs with & without folly, ~20 times each to see
convergence)
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
```
Average ops/s no folly: 26229.2
Average ops/s with folly: 26853.3 (+2.4%)
Reviewed By: ajkr
Differential Revision: D34181736
Pulled By: pdillinger
fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
Summary:
Added a Plugin class to the ObjectRegistry. Enabled compile-time and program-time addition of plugins to the Registry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7949
Reviewed By: mrambacher
Differential Revision: D33517674
Pulled By: pdillinger
fbshipit-source-id: c3e3270aab76a489bfa9e85d78cdfca951912557
Summary:
The goal of this change is to allow changes to the "current" (in
FileSystem) file temperatures to feed back into DB metadata, so that
they can inform decisions and stats reporting. In part because of
modular code factoring, it doesn't seem easy to do this automagically,
where opening an SST file and observing current Temperature different
from expected would trigger a change in metadata and DB manifest write
(essentially giving the deep read path access to the write path). It is also
difficult to do this while the DB is open because of the limitations of
LogAndApply.
This change allows updating file temperature metadata on a closed DB
using an experimental utility function UpdateManifestForFilesState()
or `ldb update_manifest --update_temperatures`. This should suffice for
"migration" scenarios where outside tooling has placed or re-arranged DB
files into a (different) tiered configuration without going through
RocksDB itself (currently, only compaction can change temperature
metadata).
Some details:
* Refactored and added unit test for `ldb unsafe_remove_sst_file` because
of shared functionality
* Pulled in autovector.h changes from https://github.com/facebook/rocksdb/issues/9546 to fix SuperVersionContext
move constructor (related to an older draft of this change)
Possible follow-up work:
* Support updating manifest with file checksums, such as when a
new checksum function is used and want existing DB metadata updated
for it.
* It's possible that for some repair scenarios, lighter weight than
full repair, we might want to support UpdateManifestForFilesState() to
modify critical file details like size or checksum using same
algorithm. But let's make sure these are differentiated from modifying
file details in ways that don't suspect corruption (or require extreme
trust).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9683
Test Plan: unit tests added
Reviewed By: jay-zhuang
Differential Revision: D34798828
Pulled By: pdillinger
fbshipit-source-id: cfd83e8fb10761d8c9e7f9c020d68c9106a95554
Summary:
Timer crash when multiple DB instances doing heavy DB open and close
operations concurrently. Which is caused by adding a timer task with
smaller timestamp than the current running task. Fix it by moving the
getting new task timestamp part within timer mutex protection.
And other fixes:
- Disallow adding duplicated function name to timer
- Fix a minor memory leak in timer when a running task is cancelled
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9656
Reviewed By: ajkr
Differential Revision: D34626296
Pulled By: jay-zhuang
fbshipit-source-id: 6b6d96a5149746bf503546244912a9e41a0c5f6b
Summary:
Fixes https://github.com/facebook/rocksdb/issues/9560. Only use popcnt intrinsic when HAVE_SSE42 is set. Also avoid setting it based on compiler test in portable builds because such test will pass on MSVC even without proper arch flags (ref: https://devblogs.microsoft.com/oldnewthing/20201026-00/?p=104397).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9680
Test Plan: verified the combinations of -DPORTABLE and -DFORCE_SSE42 produce expected compiler flags on Linux. Verified MSVC build using PORTABLE=1 (in CircleCI) does not set HAVE_SSE42.
Reviewed By: pdillinger
Differential Revision: D34739033
Pulled By: ajkr
fbshipit-source-id: d10456f3392945fc3e59430a1777840f7b60b276
Summary:
Integrate the streaming compress/uncompress API into WAL compression.
The streaming compression object is stored in the log_writer along with a reusable output buffer to store the compressed buffer(s).
The streaming uncompress object is stored in the log_reader along with a reusable output buffer to store the uncompressed buffer(s).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9642
Test Plan:
Added unit tests to verify different scenarios - large buffers, split compressed buffers, etc.
Future optimizations:
The overhead for small records is quite high, so it makes sense to compress only buffers above a certain threshold and use a separate record type to indicate that those records are compressed.
Reviewed By: anand1976
Differential Revision: D34709167
Pulled By: sidroyc
fbshipit-source-id: a37a3cd1301adff6152fb3fcd23726106af07dd4
Summary:
The plain data length may not be big enough if the compression actually expands data. So use deflateBound() to get the upper limit on the compressed output before deflate().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9572
Reviewed By: riversand963
Differential Revision: D34326475
Pulled By: ajkr
fbshipit-source-id: 4b679cb7a83a62782a127785b4d5eb9aa4646449
Summary:
Implement a streaming compression API (compress/uncompress) to use for WAL compression. The log_writer would use the compress class/API to compress a record before writing it out in chunks. The log_reader would use the uncompress class/API to uncompress the chunks and combine into a single record.
Added unit test to verify the API for different sizes/compression types.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9619
Test Plan: make -j24 check
Reviewed By: anand1976
Differential Revision: D34437346
Pulled By: sidroyc
fbshipit-source-id: b180569ad2ddcf3106380f8758b556cc0ad18382
Summary:
Make FilterPolicy into a Customizable class. Allow new FilterPolicy to be discovered through the ObjectRegistry
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9590
Reviewed By: pdillinger
Differential Revision: D34327367
Pulled By: mrambacher
fbshipit-source-id: 37e7edac90ec9457422b72f359ab8ef48829c190
Summary:
When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs.
Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556
Reviewed By: anand1976
Differential Revision: D34308216
Pulled By: sidroyc
fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705
Summary:
Some changes to make it easier to make FilterPolicy
customizable. Especially, create distinct classes for the different
testing-only and user-facing built-in FilterPolicy modes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9567
Test Plan:
tests updated, with no intended difference in functionality
tested. No difference in test performance seen as a result of moving to
string-based filter type configuration.
Reviewed By: mrambacher
Differential Revision: D34234694
Pulled By: pdillinger
fbshipit-source-id: 8a94931a9e04c3bcca863a4f524cfd064aaf0122
Summary:
This fix addresses https://github.com/facebook/rocksdb/issues/9299.
If attempting to create a new object via the ObjectRegistry and a factory is not found, the ObjectRegistry will return a "NotSupported" status. This is the same behavior as previously.
If the factory is found but could not successfully create the object, an "InvalidArgument" status is returned. If the factory returned a reason why (in the errmsg), this message will be in the returned status.
In practice, there are two options in the ConfigOptions that control how these errors are propagated:
- If "ignore_unknown_options=true", then both InvalidArgument and NotSupported status codes will be swallowed internally. Both cases will return success
- If "ignore_unsupported_options=true", then having no factory will return success but a failing factory will return an error
- If both options are false, both cases (no and failing factory) will return errors.
In practice this likely only changes Customizable that may be partially available. For example, the JEMallocMemoryAllocator is a built-in allocator that is registered with the system but may not be compiled in. In this case, the status code for this allocator changed from NotSupported("JEMalloc not available") to InvalidArgumen("JEMalloc not available"). Other Customizable builtins/plugins would have the same semantics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9333
Reviewed By: pdillinger
Differential Revision: D33517681
Pulled By: mrambacher
fbshipit-source-id: 8033052d4a4a7b88c2d9f90147b1b4467e51f6fd
Summary:
* Inefficient block-based filter is no longer customizable in the public
API, though (for now) can still be enabled.
* Removed deprecated FilterPolicy::CreateFilter() and
FilterPolicy::KeyMayMatch()
* Removed `rocksdb_filterpolicy_create()` from C API
* Change meaning of nullptr return from GetBuilderWithContext() from "use
block-based filter" to "generate no filter in this case." This is a
cleaner solution to the proposal in https://github.com/facebook/rocksdb/issues/8250.
* Also, when user specifies bits_per_key < 0.5, we now round this down
to "no filter" because we expect a filter with >= 80% FP rate is
unlikely to be worth the CPU cost of accessing it (esp with
cache_index_and_filter_blocks=1 or partition_filters=1).
* bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP
rate)
* This also gives us some support for configuring filters from OPTIONS
file as currently saved: `filter_policy=rocksdb.BuiltinBloomFilter`.
Opening from such an options file will enable reading filters (an
improvement) but not writing new ones. (See Customizable follow-up
below.)
* Also removed deprecated functions
* FilterBitsBuilder::CalculateNumEntry()
* FilterPolicy::GetFilterBitsBuilder()
* NewExperimentalRibbonFilterPolicy()
* Remove default implementations of
* FilterBitsBuilder::EstimateEntriesAdded()
* FilterBitsBuilder::ApproximateNumEntries()
* FilterPolicy::GetBuilderWithContext()
* Remove support for "filter_policy=experimental_ribbon" configuration
string.
* Allow "filter_policy=bloomfilter:n" without bool to discourage use of
block-based filter.
Some pieces for https://github.com/facebook/rocksdb/issues/9389
Likely follow-up (later PRs):
* Refactoring toward FilterPolicy Customizable, so that we can generate
filters with same configuration as before when configuring from options
file.
* Remove support for user enabling block-based filter (ignore `bool
use_block_based_builder`)
* Some months after this change, we could even remove read support for
block-based filter, because it is not critical to DB data
preservation.
* Make FilterBitsBuilder::FinishV2 to avoid `using
FilterBitsBuilder::Finish` mess and add support for specifying a
MemoryAllocator (for cache warming)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9501
Test Plan:
A number of obsolete tests deleted and new tests or test
cases added or updated.
Reviewed By: hx235
Differential Revision: D34008011
Pulled By: pdillinger
fbshipit-source-id: a39a720457c354e00d5b59166b686f7f59e392aa
Summary:
... seen only in internal clang-analyze runs after https://github.com/facebook/rocksdb/issues/9481
* Mostly, this works around falsely reported leaks by using
std::unique_ptr in some places where clang-analyze was getting
confused. (I didn't see any changes in C++17 that could make our Status
implementation leak memory.)
* Also fixed SetBGError returning address of a stack variable.
* Also fixed another false null deref report by adding an assert.
Also, use SKIP_LINK=1 to speed up `make analyze`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9515
Test Plan:
Was able to reproduce the reported errors locally and verify
they're fixed (except SetBGError). Otherwise, existing tests
Reviewed By: hx235
Differential Revision: D34054630
Pulled By: pdillinger
fbshipit-source-id: 38600ef3da75ddca307dff96b7a1a523c2885c2e
Summary:
Drop support for some old compilers by requiring C++17 standard
(or higher). See https://github.com/facebook/rocksdb/issues/9388
First modification based on this is to remove some conditional compilation in slice.h (also
better for ODR)
Also in this PR:
* Fix some Makefile formatting that seems to affect ASSERT_STATUS_CHECKED config in
some cases
* Add c_test to NON_PARALLEL_TEST in Makefile
* Fix a clang-analyze reported "potential leak" in lru_cache_test
* Better "compatibility" definition of DEFINE_uint32 for old versions of gflags
* Fix a linking problem with shared libraries in Makefile (`./random_test: error while loading shared libraries: librocksdb.so.6.29: cannot open shared object file: No such file or directory`)
* Always set ROCKSDB_SUPPORT_THREAD_LOCAL and use thread_local (from C++11)
* TODO in later PR: clean up that obsolete flag
* Fix a cosmetic typo in c.h (https://github.com/facebook/rocksdb/issues/9488)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9481
Test Plan:
CircleCI config substantially updated.
* Upgrade to latest Ubuntu images for each release
* Generally prefer Ubuntu 20, but keep a couple Ubuntu 16 builds with oldest supported
compilers, to ensure compatibility
* Remove .circleci/cat_ignore_eagain except for Ubuntu 16 builds, because this is to work
around a kernel bug that should not affect anything but Ubuntu 16.
* Remove designated gcc-9 build, because the default linux build now uses GCC 9 from
Ubuntu 20.
* Add some `apt-key add` to fix some apt "couldn't be verified" errors
* Generally drop SKIP_LINK=1; work-around no longer needed
* Generally `add-apt-repository` before `apt-get update` as manual testing indicated the
reverse might not work.
Travis:
* Use gcc-7 by default (remove specific gcc-7 and gcc-4.8 builds)
* TODO in later PR: fix s390x "Assembler messages: Error: invalid switch -march=z14" failure
AppVeyor:
* Completely dropped because we are dropping VS2015 support and CircleCI covers
VS >= 2017
Also local testing with old gflags (out of necessity when using ROCKSDB_NO_FBCODE=1).
Reviewed By: mrambacher
Differential Revision: D33946377
Pulled By: pdillinger
fbshipit-source-id: ae077c823905b45370a26c0103ada119459da6c1
Summary:
Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum
This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.
Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.
**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption
- See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
- Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342
Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
- For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
- FastLocalBloom
- Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
- After change:
- `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
- Standard128Ribbon
- Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
- After change:
- `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.
Reviewed By: pdillinger
Differential Revision: D33746928
Pulled By: hx235
fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
Summary:
Disallow `immutable_db_opts.use_direct_io_for_flush_and_compaction == true` and
`mutable_db_opts.writable_file_max_buffer_size == 0`, since it causes `WritableFileWriter::Append()`
to loop forever and does not make much sense in direct IO.
This combination of options itself does not make much sense: asking RocksDB to do direct IO but not allowing
RocksDB to allocate a buffer. We should detect this false combination and warn user early, no matter whether
the application is running on a platform that supports direct IO or not. In the case of platform **not** supporting
direct IO, it's ok if the user learns about this and then finds that direct IO is not supported.
One tricky thing: the constructor of `WritableFileWriter` is being used in our unit tests, and it's impossible
to return status code from constructor. Since we do not throw, I put an assertion for now. Fortunately,
the constructor is not exposed to external applications.
Closing https://github.com/facebook/rocksdb/issues/7109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9348
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D33371924
Pulled By: riversand963
fbshipit-source-id: 2a3701ab541cee23bffda8a36cdf37b2d235edfa
Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.
This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)
Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453
Test Plan:
new unit test added. Added memtable Bloom to crash test.
Performance testing
--------------------
Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```
Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)
Before & after run simultaneously, average over 10 iterations per data point, ops/sec.
MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066
MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990
MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before: 12081715
After: 12117603
MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591
Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.
Reviewed By: ajkr
Differential Revision: D33805025
Pulled By: pdillinger
fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
Summary:
1. Removed the options from the Capped/Fixed SliceTransforms. Instead these classes are created with id.number. This allows the GetID() id to be calculated and stored at class construction time. This change puts the construction back to similar to how it was prior to the Customizable changes for SliceTransform.
2. Improve the performance of AsString by using the ID only if there are no option properties (which is the case for all of the builtin transforms).
Ran tests of calling AsString in a loop 5M times and found approximately a 10x performance increase vs the original code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9401
Reviewed By: pdillinger
Differential Revision: D33668672
Pulled By: mrambacher
fbshipit-source-id: d0075912c6ece8ed754ee543bc6b0b49a169b309
Summary:
Regexes are considered potentially problematic for use in
registering RocksDB extensions, so we are removing
ObjectLibrary::Register() and the Regex public API it depended on (now
unused).
In reference to https://github.com/facebook/rocksdb/issues/9389
Why?
* The power of Regexes can make it hard to reason about which extension
will match what. (The replacement API isn't perfect, but we are at least
"holding the line" on patterns we have seen in practice.)
* It is easy to make regexes that don't quite mean what you think they
mean, such as forgetting that the `.` in `foo.bar` can match any character
or that matching is nondeterministic, as in `a🅱️42` matching `.*:[0-9]+`.
* Some regexes and implementations can have disastrously bad
performance. This might not be much practical concern for ObjectLibray
here, but we don't want to encourage potentially dangerous further use
in production code. (Testing code is fine. See TestRegex.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9439
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D33792342
Pulled By: pdillinger
fbshipit-source-id: 4f64dcb04764e639162c8977a5fa196f67754cec
Summary:
Fixes a major performance regression in 6.26, where
extra CPU is spent in SliceTransform::AsString when reads involve
a prefix_extractor (Get, MultiGet, Seek). Common case performance
is now better than 6.25.
This change creates a "fast path" for verifying that the current prefix
extractor is unchanged and compatible with what was used to
generate a table file. This fast path detects the common case by
pointer comparison on the current prefix_extractor and a "known
good" prefix extractor (if applicable) that is saved at the time the
table reader is opened. The "known good" prefix extractor is saved
as another shared_ptr copy (in an existing field, however) to ensure
the pointer is not recycled.
When the prefix_extractor has changed to a different instance but
same compatible configuration (rare, odd), performance is still a
regression compared to 6.25, but this is likely acceptable because
of the oddity of such a case. The performance of incompatible
prefix_extractor is essentially unchanged.
Also fixed a minor case (ForwardIterator) where a prefix_extractor
could be used via a raw pointer after being freed as a shared_ptr,
if replaced via SetOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9407
Test Plan:
## Performance
Populate DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`
Running head-to-head comparisons simultaneously with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=seekrandom -num=10000000 -duration=20 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`
Below each is compared by ops/sec vs. baseline which is version 6.25 (multiple baseline runs because of variable machine load)
v6.26: 4833 vs. 6698 (<- major regression!)
v6.27: 4737 vs. 6397 (still)
New: 6704 vs. 6461 (better than baseline in common case)
Disabled fastpath: 4843 vs. 6389 (e.g. if prefix extractor instance changes but is still compatible)
Changed prefix size (no usable filter) in new: 787 vs. 5927
Changed prefix size (no usable filter) in new & baseline: 773 vs. 784
Reviewed By: mrambacher
Differential Revision: D33677812
Pulled By: pdillinger
fbshipit-source-id: 571d9711c461fb97f957378a061b7e7dbc4d6a76
Summary:
As title.
This is part of an fb-internal task.
First, remove all `using namespace` statements if applicable.
Next, utilize multiple build platforms and see if anything is broken.
Should anything become broken, fix the compilation errors with as little extra change as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369
Test Plan:
internal build and make check
make clean && make static_lib && cd examples && make all
Reviewed By: pdillinger
Differential Revision: D33517260
Pulled By: riversand963
fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40
Summary:
In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions. The PatternEntry methods were left in place but renamed to AddFactory. The goal is to allow for the deprecation of the original regex Registry method in an upcoming release.
Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362
Reviewed By: pdillinger
Differential Revision: D33432562
Pulled By: mrambacher
fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02
Summary:
Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, as part of https://github.com/facebook/rocksdb/pull/9342
**Context:**
https://github.com/facebook/rocksdb/pull/9073 charged the hash entries' memory in block cache with `CacheReservationHandle`. However, in the edge case where Ribbon Filter falls back to Bloom Filter and swaps its hash entries to the embedded bloom filter object, the handles associated with those entries are not swapped and thus not released as soon as those entries are cleared during Bloom Filter's finish process.
Although this is a minor issue since RocksDB internal calls `FilterBitsBuilder->Reset()` right after `FilterBitsBuilder->Finish()` on the main path, which releases all the cache reservation related to both the Ribbon Filter and its embedded Bloom Filter, it still worths this fix to avoid confusion.
**Summary:**
- Swapped the `CacheReservationHandle` associated with the hash entries on Ribbon Filter's fallback
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9345
Test Plan: - Added a unit test to verify the number of cache reservation after clearing hash entries, which failed before the change and now succeeds
Reviewed By: pdillinger
Differential Revision: D33377225
Pulled By: hx235
fbshipit-source-id: 7487f4c40dfb6ee7928232021f93ef2c5329cffa
Summary:
Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex. For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes. For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes. There is also a class for Customizable::IndividualId format matches.
Added tests for the new derivative classes and got all unit tests to pass.
Resolves https://github.com/facebook/rocksdb/issues/9225.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9264
Reviewed By: pdillinger
Differential Revision: D33062001
Pulled By: mrambacher
fbshipit-source-id: c2d2143bd2d38bdf522705c8280c35381b135c03
Summary:
in hope to get rockdb compiled with GCC-11 without warning
* util/bloom_test: init a variable before using it
to silence the GCC warning like
```
util/bloom_test.cc:1253:31: error: ‘<anonymous>’ may be used uninitialized [-Werror=maybe-uninitialized]
1253 | Slice key_slice{key_bytes, 8};
| ^
...
include/rocksdb/slice.h:41:3: note: by argument 2 of type ‘const char*’ to ‘rocksdb::Slice::Slice(const char*, size_t)’ declared here
41 | Slice(const char* d, size_t n) : data_(d), size_(n) {}
| ^~~~~
util/bloom_test.cc:1249:3: note: ‘<anonymous>’ declared here
1249 | };
| ^
cc1plus: all warnings being treated as errors
```
* cmake: add find_package(uring ...)
find liburing in a more consistent way. also it is the encouraged way for finding a library.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9286
Reviewed By: mrambacher
Differential Revision: D33165241
Pulled By: jay-zhuang
fbshipit-source-id: 9f3487e11b4e40fd8f1c97c8facb24a190e5ce31
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).
The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.
This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)
The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.
Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.
Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.
### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)
### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime
We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
```
Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```
These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day
After enough data, we get a result at the end:
```
(keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```
If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
```
(keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```
The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```
I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
Reviewed By: zhichao-cao
Differential Revision: D33171746
Pulled By: pdillinger
fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
Summary:
I'm working on a new format_version=6 to support context
checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test
updates to support that change.
Test coverage data and manual inspection agree on dead code in
block_based_table_reader.cc (removed).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240
Test Plan:
tests enhanced to cover more cases etc.
Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}
(Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
Before w/ wal: 454512
After w/ wal: 444820 (-2.1%)
Before w/o wal: 1004560
After w/o wal: 998897 (-0.6%)
Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.
This regression will be corrected in a follow-up PR.
Reviewed By: ajkr
Differential Revision: D32813769
Pulled By: pdillinger
fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
Summary:
Add a new API in listener.h that notifies about IOErrors on
Read/Write/Append/Flush etc. The API reports about IOStatus, filename, Operation
name, offset and length.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9177
Test Plan: Added new unit tests
Reviewed By: anand1976
Differential Revision: D32470627
Pulled By: akankshamahajan15
fbshipit-source-id: 189a717033590ae227b3beae8b1e7e185e4cdc12
Summary:
Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge.
**Context:**
Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`.
The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option.
**Summary:**
- Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction:
- hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance),
- banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`),
- final filter (i.e, `mutable_buf`'s size).
- Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as explicitly delete the filter bits builder when done with the final filter in block based table.
- Added option fo run `filter_bench` with this memory reservation feature
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073
Test Plan:
- Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory`
- To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around **3000 - 6000 ms** and others take around **1500 - 2000 ms**, in total adding **20000 - 25000 ms** to the test suit running locally
- Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter
- Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over **20** run as below:
- FastLocalBloom
- baseline `./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg'`:
- **Build avg ns/key: 29.56295** (DEBUG_LEVEL=1), **29.98153** (DEBUG_LEVEL=0)
- new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg'`:
- **Build avg ns/key: 30.99046** (DEBUG_LEVEL=1), **30.48867** (DEBUG_LEVEL=0)
- new feature of RibbonFilter with fallback (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
- **Build avg ns/key: 31.146975** (DEBUG_LEVEL=1), **30.08165** (DEBUG_LEVEL=0)
- Ribbon128
- baseline `./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg'`:
- **Build avg ns/key: 129.17585** (DEBUG_LEVEL=1), **130.5225** (DEBUG_LEVEL=0)
- new feature (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg' `:
- **Build avg ns/key: 131.61645** (DEBUG_LEVEL=1), **132.98075** (DEBUG_LEVEL=0)
- new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
- **Build avg ns/key: 52.032965** (DEBUG_LEVEL=1), **52.597825** (DEBUG_LEVEL=0)
- And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console.
Reviewed By: pdillinger
Differential Revision: D31991348
Pulled By: hx235
fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e
Summary:
`pthread_setname_np()` fails on attempts to assign oversized names like
"rocksdb:bottom10", which resulted in some thread name updates being
lost. We do not need the ID suffix so I removed it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9165
Test Plan:
```
$ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -max_background_flushes=123 -max_background_compactions=456 -num_bottom_pri_threads=789 -duration=60
```
While above is running:
```
$ ps -o 'comm' -Lp `pidof db_bench` | grep '^rocksdb:' | sort | uniq -c
789 rocksdb:bottom
123 rocksdb:high
456 rocksdb:low
```
Reviewed By: pdillinger
Differential Revision: D32415077
Pulled By: ajkr
fbshipit-source-id: a0e013101e26a78bc5eca73509293ef4bf22254f
Summary:
**Context:**
Some existing internal calls of `GenericRateLimiter::Request()` in backupable_db.cc and newly added internal calls in https://github.com/facebook/rocksdb/pull/8722/ do not make sure `bytes <= GetSingleBurstBytes()` as required by rate_limiter https://github.com/facebook/rocksdb/blob/master/include/rocksdb/rate_limiter.h#L47.
**Impacts of this bug include:**
(1) In debug build, when `GenericRateLimiter::Request()` requests bytes greater than `GenericRateLimiter:: kMinRefillBytesPerPeriod = 100` byte, process will crash due to assertion failure. See https://github.com/facebook/rocksdb/pull/9063#discussion_r737034133 and for possible scenario
(2) In production build, although there will not be the above crash due to disabled assertion, the bug can lead to a request of small bytes being blocked for a long time by a request of same priority with insanely large bytes from a different thread. See updated https://github.com/facebook/rocksdb/wiki/Rate-Limiter ("Notice that although....the maximum bytes that can be granted in a single request have to be bounded...") for more info.
There is an on-going effort to move rate-limiting to file wrapper level so rate limiting in `BackupEngine` and this PR might be made obsolete in the future.
**Summary:**
- Implemented loop-calling `GenericRateLimiter::Request()` with `bytes <= GetSingleBurstBytes()` as a static private helper function `BackupEngineImpl::LoopRateLimitRequestHelper`
-- Considering make this a util function in `RateLimiter` later or do something with `RateLimiter::RequestToken()`
- Replaced buggy internal callers with this helper function wherever requested byte is not pre-limited by `GetSingleBurstBytes()`
- Removed the minimum refill bytes per period enforced by `GenericRateLimiter` since it is useless and prevents testing `GenericRateLimiter` for extreme case with small refill bytes per period.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9063
Test Plan:
- Added a new test that failed the assertion before this change and now passes
- It exposed bugs in [the write during creation in `CopyOrCreateFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2034-L2043)), [the read of table properties in `GetFileDbIdentities()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2372-L2378)), [some read of metadata in `BackupMeta::LoadFromFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2726))
- Passing Existing tests
Reviewed By: ajkr
Differential Revision: D31824535
Pulled By: hx235
fbshipit-source-id: d2b3dea7a64e2a4b1e6a59fca322f0800a4fcbcc
Summary:
Context:
Surprisingly, there isn't any sanitization against negative `int64_t bytes` in `GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri, Statistics* stats)`. A negative `bytes` can be passed in and incorrectly increases `available_bytes_` by subtracting the negative `bytes` from `available_bytes_`, such as [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L138) and [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L283), which are incorrect behaviors.
- Sanitized negative request bytes by rounding it up to 0
- Added notes to public and internal API
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9112
Test Plan: - Rely on existing tests
Reviewed By: ajkr
Differential Revision: D32085364
Pulled By: hx235
fbshipit-source-id: b1b6066b2dd5ffc7bcbfb07069ca65a33578251b
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
`IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
forced, the same as above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903
Test Plan: run tests on btrfs and non btrfs
Reviewed By: ajkr
Differential Revision: D30885059
Pulled By: jay-zhuang
fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
Summary:
This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in.
There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with?
1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package
2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image
Not sure if there is more required for travis. Happy to help in any way I can.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962
Reviewed By: mrambacher
Differential Revision: D31802198
Pulled By: pdillinger
fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2
Summary:
Adds changes to DBOptions (comparable to ColumnFamilyOptions) to allow some option values to be ignored on rehydration from the Options file. This is necessary for some customizable classes that were not registered with the ObjectRegistry but are saved/restored from the Options file.
All tests pass. Will run check_format_compatible.sh shortly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9045
Reviewed By: zhichao-cao
Differential Revision: D31761664
Pulled By: mrambacher
fbshipit-source-id: 300c2251639cce2b223481c3bb2a63877b1f3766
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).
Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990
Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.
Reviewed By: zhichao-cao, mrambacher
Differential Revision: D31582865
Pulled By: pdillinger
fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
Summary:
There were three implementations of VectorIterator (util/vector_iterator, test_util/testutil.h and LoggingForwardVectorIterator). Merged them into one class to increase code coverage/testing and reduce duplication.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8901
Reviewed By: pdillinger
Differential Revision: D31022673
Pulled By: mrambacher
fbshipit-source-id: 8e3acbd2dfd60b4df609d02cc72846de2389d531
Summary:
This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead.
Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
Hopefully, this sort of code hygiene cleanup will speed up the builds...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
Reviewed By: pdillinger
Differential Revision: D31142788
Pulled By: mrambacher
fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
Summary:
Made SliceTransform into a Customizable class.
Would be nice to write a test that stored and used a custom transform in an SST table.
There are a set of tests (DBBlockFliterTest.PrefixExtractor*, SamePrefixTest.InDomainTest, PrefixTest.PrefixAndWholeKeyTest that run the same with or without a SliceTransform/PrefixFilter. Is this expected?
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8641
Reviewed By: zhichao-cao
Differential Revision: D31142793
Pulled By: mrambacher
fbshipit-source-id: bb08672fccbfdc263dcae21f25a62307e1facda1
Summary:
Context:
After more discussion, a fix in https://github.com/facebook/rocksdb/issues/8938 might turn out to be too restrictive for the case where `GetTotalPendingRequests` might be invoked on RateLimiter classes that does not support the recently added API `RateLimiter::GetTotalPendingRequests` (https://github.com/facebook/rocksdb/issues/8890) due to the `assert(false)` in https://github.com/facebook/rocksdb/issues/8938. Furthermore, sentinel value like `-1` proposed in https://github.com/facebook/rocksdb/issues/8938 is easy to be ignored and unchecked. Therefore we decided to adopt `Status::NotSupported()`, which is also a convention of adding new API to public header in RocksDB.
- Changed return value type of `RateLimiter::GetTotalPendingRequests` in related declaration/definition
- Passed in pointer argument to hold the output instead of returning it as before
- Adapted to the changes above in calling `RateLimiter::GetTotalPendingRequests` in test
- Minor improvement to `TEST_F(RateLimiterTest, GetTotalPendingRequests)`: added failure message for assertion and replaced repetitive statements with a loop
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8950
Reviewed By: ajkr, pdillinger
Differential Revision: D31128450
Pulled By: hx235
fbshipit-source-id: 282ac9c4f3dacaa0aec6d0a993161f77ad47a040
Summary:
Made SystemClock into a Customizable class, complete with CreateFromString.
Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv).
Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636
Reviewed By: zhichao-cao
Differential Revision: D30483360
Pulled By: mrambacher
fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1
Summary:
- Fixed a bug in `RateLimiterTest.GeneratePriorityIterationOrder` that the callbacks in this test were not called starting from `i = 1`. Fix by increasing `rate_bytes_per_sec` and requested bytes.
- The bug is due to the previous `rate_bytes_per_sec` was set too small, resulting in `refill_bytes_per_period` less than `kMinRefillBytesPerPeriod`. Hence the actual `refill_bytes_per_period` was equal to `kMinRefillBytesPerPeriod` due to the logic [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L302-L303) and it ended up being greater than the previously set requested bytes. Therefore starting from `i = 1`, `RefillBytesAndGrantRequests()` and `GeneratePriorityIterationOrder` won't be called and the test callbacks was not triggered to execute the assertion.
- Added internal flag to assert callbacks are called in `RateLimiterTest.GeneratePriorityIterationOrder` to prevent any future changes defeat the purpose of the test [as suggested](https://github.com/facebook/rocksdb/pull/8890#discussion_r704915134)
- Increased `rate_bytes_per_sec` and bytes of each request in `RateLimiterTest.GetTotalBytesThrough`, `RateLimiterTest.GetTotalRequests`, `RateLimiterTest.GetTotalPendingRequests` to trigger the "long path" of execution (i.e, the one trigger RefillBytesAndGrantRequests()) to increase test coverage
- This increased the running time of the three tests, see test plan for time difference running locally
- Cleared up sync point effects after each test by calling `SyncPoint::GetInstance()->DisableProcessing();` and `SyncPoint::GetInstance()->ClearAllCallBacks();` in `~RateLimiterTest()` [as suggested](https://github.com/facebook/rocksdb/pull/8595/files#r697534279)
- It's fine to call these two methods even when `EnableProcessing()` or `SetCallBack()` is not called in the test or is already cleaned up. In those cases, calling these two functions in destructor is effectively no-op.
- This will allow cleaning up sync point effects of previous test even when the previous test failed in assertion.
- Added missing `SyncPoint::GetInstance()->DisableProcessing();` and `SyncPoint::GetInstance()->ClearCallBacks(..);` in existing tests for completeness
- Called `SyncPoint::GetInstance()->DisableProcessing();` and `SyncPoint::GetInstance()->ClearCallBacks(..);` in loop in `RateLimiterTest.GeneratePriorityIterationOrder` for completeness
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8904
Test Plan:
- Passing existing tests
- To verify the 1st change, run `RateLimiterTest.GeneratePriorityIterationOrder` with assertions of callbacks are indeed called under original `rate_bytes_per_sec` and request byte and under updated `rate_bytes_per_sec` and request byte. The former will fail the assertion while the latter succeeds.
- Here is the increased test time due to the 3rd change mentioned above in the summary. The relevant 3 tests mentioned in total increase the test time by 6s (~6000/33848 = 17.7% of the original total test time), which IMO is acceptable for better test coverage through running the "long path".
- current (run on branch rate_limiter_ut_improve locally)
[ RUN ] RateLimiterTest.GetTotalBytesThrough
[ OK ] RateLimiterTest.GetTotalBytesThrough (3000 ms)
[ RUN ] RateLimiterTest.GetTotalRequests
[ OK ] RateLimiterTest.GetTotalRequests (3001 ms)
[ RUN ] RateLimiterTest.GetTotalPendingRequests
[ OK ] RateLimiterTest.GetTotalPendingRequests (0 ms)
...
[----------] 10 tests from RateLimiterTest (43349 ms total)
[----------] Global test environment tear-down
[==========] 10 tests from 1 test case ran. (43349 ms total)
[ PASSED ] 10 tests.
- previous (run on branch main locally)
[ RUN ] RateLimiterTest.GetTotalBytesThrough
[ OK ] RateLimiterTest.GetTotalBytesThrough (0 ms)
[ RUN ] RateLimiterTest.GetTotalRequests
[ OK ] RateLimiterTest.GetTotalRequests (0 ms)
[ RUN ] RateLimiterTest.GetTotalPendingRequests
[ OK ] RateLimiterTest.GetTotalPendingRequests (0 ms)
...
[----------] 10 tests from RateLimiterTest (33848 ms total)
[----------] Global test environment tear-down
[==========] 10 tests from 1 test case ran. (33848 ms total)
[ PASSED ] 10 tests.
Reviewed By: ajkr
Differential Revision: D30872544
Pulled By: hx235
fbshipit-source-id: ff894f5c1a4bef70e8e407d53b00be45f776b3e4
Summary:
It's always annoying to find a header does not include its own
dependencies and only works when included after other includes. This
change adds `make check-headers` which validates that each header can
be included at the top of a file. Some headers are excluded e.g. because
of platform or external dependencies.
rocksdb_namespace.h had to be re-worked slightly to enable checking for
failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.)
Fixes mostly involve adding and cleaning up #includes, but for
FileTraceWriter, a constructor was out-of-lined to make a forward
declaration sufficient.
This check is not currently run with `make check` but is added to
CircleCI build-linux-unity since that one is already relatively fast.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893
Test Plan: existing tests and resolving issues detected by new check
Reviewed By: mrambacher
Differential Revision: D30823300
Pulled By: pdillinger
fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572
Summary:
Context/Summary:
As users requested, a public API RateLimiter::GetTotalPendingRequests() is added to expose the total number of pending requests for bytes in the rate limiter, which is the size of the request queue of that priority (or of all priorities, if IO_TOTAL is interested) at the time when this API is called.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8890
Test Plan:
- Passing added new unit tests
- Passing existing unit tests
Reviewed By: ajkr
Differential Revision: D30815500
Pulled By: hx235
fbshipit-source-id: 2dfa990f651c1c47378b6215c751ad76a5824300
Summary:
* Consolidate use of std::regex for testing to testharness.cc, to
minimize Facebook linters constantly flagging uses in non-production
code.
* Improve syntax and error messages for asserting some string matches a
regex in tests.
* Add a public Regex wrapper class to encapsulate existing usage in
ObjectRegistry.
* Remove unnecessary include <regex>
* Put warnings that use of Regex in production code could cause bad
performance or stack overflow.
Intended follow-up work:
* Replace std::regex with another underlying implementation like RE2
* Improve ObjectRegistry interface in terms of possibly confusing literal
string matching vs. regex and in terms of reporting invalid regex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8740
Test Plan:
tests updated, basic unit test for public Regex, and some manual
testing of temporary changes to see example error messages:
utilities/backupable/backupable_db_test.cc:917: Failure
000010_1162373755_138626.blob (child.name)
does not match regex
[0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern)
db/db_basic_test.cc:74: Failure
R3SHSBA8C4U0CIMV2ZB0 (sid3)
does not match regex [0-9A-Z]{20}HAHAHA
Reviewed By: mrambacher
Differential Revision: D30706246
Pulled By: pdillinger
fbshipit-source-id: ba845e8f563ccad39bdb58f44f04e9da8f78c3fd
Summary:
Old typedef syntax is confusing
Most but not all changes with
perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files
make format
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751
Test Plan: existing
Reviewed By: zhichao-cao
Differential Revision: D30745277
Pulled By: pdillinger
fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736
Test Plan: existing tests
Reviewed By: mrambacher
Differential Revision: D30700039
Pulled By: pdillinger
fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
Summary:
Context:
An extra IO_USER priority in rate limiter allows users to optionally charge WAL writes / SST reads to rate limiter at this priority level, which then has higher priority than IO_HIGH and IO_LOW. With an extra IO_USER priority, it allows users to better specify the relative urgency/importance among different requests in rate limiter. As a consequence, IO resource management can better prioritize and limit resource based on user's need.
The IO_USER is implemented as superior priority in GenericRateLimiter, in the sense that its request queue will always be iterated first without being constrained to fairness. The reason is that the notion of fairness is only meaningful in helping lower priorities in background IO (i.e, IO_HIGH/MID/LOW) to gain some fair chance to run so that it does not block foreground IO (i.e, the ones that are charged at the level of IO_USER). As we can see, the ultimate goal here is to not blocking foreground IO at IO_USER level, which justifies the superiority of IO_USER.
Similar benefits exist for IO_MID priority.
- Rewrote the logic of deciding the order of iterating request queues of high/low priorities to include the extra user/mid priority w/o affecting the existing behavior (see PR's [comment](https://github.com/facebook/rocksdb/pull/8595/files#r678749331))
- Included the request queue of user-pri/mid-pri in the code path of next-leader-candidate signaling and GenericRateLimiter's destructor
- Included the extra user/mid-pri in bookkeeping data structures: total_bytes_through_ and total_requests_
- Re-written the previous impl of explicitly iterating priorities with a loop from Env::IO_LOW to Env::IO_TOTAL
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8595
Test Plan:
- passed existing rate_limiter_test.cc
- passed added unit tests in rate_limiter_test.cc
- run performance test to verify performance with only high/low requests is not affected by this change
- Set-up command:
`TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --duration=5 --compression_type=none --num=100000000 --disable_auto_compactions=true --write_buffer_size=1048576 --writable_file_max_buffer_size=65536 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --level0_slowdown_writes_trigger=$(((1 << 31) - 1)) --level0_stop_writes_trigger=$(((1 << 31) - 1))`
- Test command:
`TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=overwrite --use_existing_db=true --disable_wal=true --duration=30 --compression_type=none --num=100000000 --write_buffer_size=1048576 --writable_file_max_buffer_size=65536 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --level0_slowdown_writes_trigger=$(((1 << 31) - 1)) --level0_stop_writes_trigger=$(((1 << 31) - 1)) --statistics=true --rate_limiter_bytes_per_sec=1048576 --rate_limiter_refill_period_us=1000 --threads=32 |& grep -E '(flush|compact)\.write\.bytes'`
- Before (on branch upstream/master):
`rocksdb.compact.write.bytes COUNT : 4014162`
`rocksdb.flush.write.bytes COUNT : 26715832`
rocksdb.flush.write.bytes/rocksdb.compact.write.bytes ~= 6.66
- After (on branch rate_limiter_user_pri):
`rocksdb.compact.write.bytes COUNT : 3807822`
`rocksdb.flush.write.bytes COUNT : 26098659`
rocksdb.flush.write.bytes/rocksdb.compact.write.bytes ~= 6.85
Reviewed By: ajkr
Differential Revision: D30577783
Pulled By: hx235
fbshipit-source-id: 0881f2705ffd13ecd331256bde7e8ec874a353f4
Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.
This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.
Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
* Lots of info from default Env like time, process id, and hostname.
* std::random_device
* port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.
DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.
GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.
Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708
Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.
Reviewed By: zhichao-cao, mrambacher
Differential Revision: D30563780
Pulled By: pdillinger
fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364
Summary:
With expected use for a 128-bit hash, xxhash library is
upgraded to current dev (2c611a76f914828bed675f0f342d6c4199ffee1e)
as of Aug 6 so that we can use production version of XXH3_128bits
as new Hash128 function (added in hash128.h).
To make this work, however, we have to carve out the "preview" version
of XXH3 that is used in new SST Bloom and Ribbon filters, since that
will not get maintenance in xxhash releases. I have consolidated all the
relevant code into xxph3.h and made it "inline only" (no .cc file). The
working name for this hash function is changed from XXH3p to XXPH3
(XX Preview Hash) because the latter is easier to get working with no
symbol name conflicts between the headers.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8634
Test Plan:
no expected change in existing functionality. For Hash128,
added some unit tests based on those for Hash64 to ensure some basic
properties and that the values do not change accidentally.
Reviewed By: zhichao-cao
Differential Revision: D30173490
Pulled By: pdillinger
fbshipit-source-id: 06aa542a7a28b353bc2c865b9b2f8bdfe44158e4
Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.
So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)
I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.
C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.
BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679
Test Plan: new + updated tests, including crash test
Reviewed By: jay-zhuang
Differential Revision: D30445797
Pulled By: pdillinger
fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
Summary:
Extends https://github.com/facebook/rocksdb/issues/8659 to work for ingested external SST files, even
the same file ingested into different DBs sharing a block cache.
Note: These new cache keys are currently only enabled when FileSystem
does not provide GetUniqueId. For now, they are typically larger,
so slightly less efficient.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8669
Test Plan: Extended unit test
Reviewed By: zhichao-cao
Differential Revision: D30398532
Pulled By: pdillinger
fbshipit-source-id: 1f13e2af4b8bfff5741953a69466e9589fbc23c7
Summary:
`GenericRateLimiter` slow path handles requests that cannot be satisfied
immediately. Such requests enter a queue, and their thread stays in `Request()`
until they are granted or the rate limiter is stopped. These threads are
responsible for unblocking themselves. The work to do so is split into two main
duties.
(1) Waiting for the next refill time.
(2) Refilling the bytes and granting requests.
Prior to this PR, the slow path logic involved a leader election algorithm to
pick one thread to perform (1) followed by (2). It elected the thread whose
request was at the front of the highest priority non-empty queue since that
request was most likely to be granted. This algorithm was efficient in terms of
reducing intermediate wakeups, which is a thread waking up only to resume
waiting after finding its request is not granted. However, the conceptual
complexity of this algorithm was too high. It took me a long time to draw a
timeline to understand how it works for just one edge case yet there were so
many.
This PR drops the leader election to reduce conceptual complexity. Now, the two
duties can be performed by whichever thread acquires the lock first. The risk
of this change is increasing the number of intermediate wakeups, however, we
took steps to mitigate that.
- `wait_until_refill_pending_` flag ensures only one thread performs (1). This\
prevents the thundering herd problem at the next refill time. The remaining\
threads wait on their condition variable with an unbounded duration -- thus we\
must remember to notify them to ensure forward progress.
- (1) is typically done by a thread at the front of a queue. This is trivial\
when the queues are initially empty as the first choice that arrives must be\
the only entry in its queue. When queues are initially non-empty, we achieve\
this by having (2) notify a thread at the front of a queue (preferring higher\
priority) to perform the next duty.
- We do not require any additional wakeup for (2). Typically it will just be\
done by the thread that finished (1).
Combined, the second and third bullet points above suggest the refill/granting
will typically be done by a request at the front of its queue. This is
important because one wakeup is saved when a granted request happens to be in an
already running thread.
Note there are a few cases that still lead to intermediate wakeup, however. The
first two are existing issues that also apply to the old algorithm, however, the
third (including both subpoints) is new.
- No request may be granted (only possible when rate limit dynamically\
decreases).
- Requests from a different queue may be granted.
- (2) may be run by a non-front request thread causing it to not be granted even\
if some requests in that same queue are granted. It can happen for a couple\
(unlikely) reasons.
- A new request may sneak in and grab the lock at the refill time, before the\
thread finishing (1) can wake up and grab it.
- A new request may sneak in and grab the lock and execute (1) before (2)'s\
chosen candidate can wake up and grab the lock. Then that non-front request\
thread performing (1) can carry over to perform (2).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8602
Test Plan:
- Use existing tests. The edge cases listed in the comment are all performance\
related; I could not really think of any related to correctness. The logic\
looks the same whether a thread wakes up/finishes its work early/on-time/late,\
or whether the thread is chosen vs. "steals" the work.
- Verified write throughput and CPU overhead are basically the same with and\
without this change, even in a rate limiter heavy workload:
Test command:
```
$ rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=fillrandom -num_multi_db=64 -num_low_pri_threads=64 -num_high_pri_threads=64 -write_buffer_size=262144 -target_file_size_base=262144 -max_bytes_for_level_base=1048576 -rate_limiter_bytes_per_sec=16777216 -key_size=24 -value_size=1000 -num=10000 -compression_type=none -rate_limiter_refill_period_us=1000
```
Results before this PR:
```
fillrandom : 108.463 micros/op 9219 ops/sec; 9.0 MB/s
7.40user 8.84system 1:26.20elapsed 18%CPU (0avgtext+0avgdata 256140maxresident)k
```
Results after this PR:
```
fillrandom : 108.108 micros/op 9250 ops/sec; 9.0 MB/s
7.45user 8.23system 1:26.68elapsed 18%CPU (0avgtext+0avgdata 255688maxresident)k
```
Reviewed By: hx235
Differential Revision: D30048013
Pulled By: ajkr
fbshipit-source-id: 6741bba9d9dfbccab359806d725105817fef818b
Summary: UBSAN revealed a pointer underflow when `LZ4HC_init_internal` is called with a null `start`.
Reviewed By: ajkr
Differential Revision: D30181874
fbshipit-source-id: ca9bbac1a85c58782871d7f153af733b000cc66c
Summary:
This draining mechanism should not be run during `JoinThreads()` because it can detach threads that will be joined. Joining detached threads would throw an exception.
With this PR, we skip draining when `JoinThreads()` has already decided what threads to `join()`, so the threads will exit naturally once the work queue empties.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8635
Test Plan: verified it unblocked using `WaitForJobsAndJoinAllThreads()` in https://github.com/facebook/rocksdb/issues/8611.
Reviewed By: riversand963
Differential Revision: D30174587
Pulled By: ajkr
fbshipit-source-id: 144966398a607987e0763c7152a0f653fdbf3c8b
Summary:
Context:
As need for new feature of resource management using RocksDB's rate limiter like [https://github.com/facebook/rocksdb/issues/8595](https://github.com/facebook/rocksdb/pull/8595) arises, it is about time to re-learn our rate limiter and make this learning process easier for others by improving its readability. The comment/assertion/one extra else-branch are added based on my best understanding toward the rate_limiter.cc and rate_limiter_test.cc up to date after giving it a hard read.
- Add code comments/assertion/one extra else-branch (that is not affecting existing behavior, see PR comment) to describe how leader-election works under multi-thread settings in GenericRateLimiter::Request()
- Add code comments to describe a non-obvious trick during clean-up of rate limiter destructor
- Add code comments to explain more about the starvation being fixed in GenericRateLimiter::Refill() through partial byte-granting
- Add code comments to the rate limiter's setup in a complicated unit test in rate_limiter_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8596
Test Plan: - passed existing rate_limiter_test.cc
Reviewed By: ajkr
Differential Revision: D29982590
Pulled By: hx235
fbshipit-source-id: c3592986bb5b0c90d8229fe44f425251ec7e8a0a
Summary:
Calling the GetImpl function could leave reference to a local
callback function in a field of a parameter struct. As this is
performance-critical code, I'm not going to attempt to sanitize this
code too much, but make the existing hack a bit cleaner by reverting
what it overwrites in the input struct.
Added SaveAndRestore utility class to make that easier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8590
Test Plan:
added unit test for SaveAndRestore; existing tests for
GetImpl
Reviewed By: riversand963
Differential Revision: D29947983
Pulled By: pdillinger
fbshipit-source-id: 2f608853f970bc06724e834cc84dcc4b8599ddeb
Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555
Reviewed By: ajkr
Differential Revision: D29758399
Pulled By: jay-zhuang
fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011
Summary:
Rare TSAN and valgrind failures are caused by unnecessary
reading of a field on the TaskLimiterToken::limiter_ for an assertion
after the token has been released and the limiter destroyed. To simplify
we can simply destroy the token before triggering DB shutdown
(potentially destroying the limiter). This makes the ReleaseOnce logic
unnecessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8567
Test Plan: watch for more failures in CI
Reviewed By: ajkr
Differential Revision: D29811795
Pulled By: pdillinger
fbshipit-source-id: 135549ebb98fe4f176d1542ed85d5bd6350a40b3
Summary:
Fixed a few MSVC (VCToolsVersion=14.0) build errors and warnings
* `DEFINE_string` is a macro and VC compiler complains that it cannot put [ifdef-inside-define](https://stackoverflow.com/questions/5586429/ifdef-inside-define)
* `sleep()` is not a recognizable function. Use `FLAGS_env->SleepForMicroseconds` instead
* Define precise type in comparison to avoid mismatch warning
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8519
Reviewed By: jay-zhuang
Differential Revision: D29683086
fbshipit-source-id: 8c80941472089f8daba84ae29597e75e603850e4
Summary:
In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead.
In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412
Test Plan: make check, add new testing cases.
Reviewed By: anand1976
Differential Revision: D29151545
Pulled By: zhichao-cao
fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772
Summary:
Implement a function to generate the crc32c of two combined strings. Suppose we have the string 1 (s1) with crc32c checksum crc32c_1 and string 2 (s2) with crc32c checksum crc32c_2, the new string is s1+s2 and its checksum is crc32c_new=Crc32cCombine(crc32c_1, crc32c_2, s2.size).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8305
Test Plan: make check, added new testing case
Reviewed By: pdillinger
Differential Revision: D28651665
Pulled By: zhichao-cao
fbshipit-source-id: c84116108388f11a81f6a217b49f99c70d4ffacf
Summary:
Makes the Comparator class into a Customizable object. Added/Updated the CreateFromString method to create Comparators. Added test for using the ObjectRegistry to create one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8336
Reviewed By: jay-zhuang
Differential Revision: D28999612
Pulled By: mrambacher
fbshipit-source-id: bff2cb2814eeb9fef6a00fddc61d6e34b6fbcf2e