Commit Graph

12530 Commits

Author SHA1 Message Date
Levi Tamasi 81765866c4 Update HISTORY/version/format compatibility script for the 8.10 release (#12154)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12154

Reviewed By: jaykorean, akankshamahajan15

Differential Revision: D52216271

Pulled By: ltamasi

fbshipit-source-id: 13bab72802eeec8f6e3544be9ebcd7f725a64d2e
2023-12-15 14:44:23 -08:00
anand76 cc069f25b3 Add some compressed and tiered secondary cache stats (#12150)
Summary:
Add statistics for more visibility.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12150

Reviewed By: akankshamahajan15

Differential Revision: D52184633

Pulled By: anand1976

fbshipit-source-id: 9969e05d65223811cd12627102b020bb6d229352
2023-12-15 11:34:08 -08:00
Peter Dillinger 88bc91f3cc Cap eviction effort (CPU under stress) in HyperClockCache (#12141)
Summary:
HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list.  All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem.

However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system.

To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC, `eviction_effort_cap`. See the comments on the new config parameter in the public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12141

Test Plan:
unit test included

 ## Performance test

We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior:

```
./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7
```

```
Before, LRU (alternate baseline not exhibiting bad behavior):
Rough parallel ops/sec = 2290997
1088060 maxresident

Before, AutoHCC (bad behavior):
Rough parallel ops/sec = 141011 <- Yes, more than 10x slower
1083932 maxresident
```

Now let us sample a range of values in the solution space:

```
After, AutoHCC, eviction_effort_cap = 1:
Rough parallel ops/sec = 3212586
2402216 maxresident

After, AutoHCC, eviction_effort_cap = 10:
Rough parallel ops/sec = 2371639
1248884 maxresident

After, AutoHCC, eviction_effort_cap = 30:
Rough parallel ops/sec = 1981092
1131596 maxresident

After, AutoHCC, eviction_effort_cap = 100:
Rough parallel ops/sec = 1446188
1090976 maxresident

After, AutoHCC, eviction_effort_cap = 1000:
Rough parallel ops/sec = 549568
1084064 maxresident
```

I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default.

```
Change to -pinned_ratio=0.85
Before, LRU:
Rough parallel ops/sec = 2108373
1078232 maxresident

Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2164910
1077312 maxresident

After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2145542
1077216 maxresident
```

The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress.

```
Change to -pinned_ratio=0.25 (low stress)
Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2221149
1076540 maxresident

After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2224521
1076664 maxresident
```

No measurable difference under normal circumstances.

Some tests repeated with FixedHCC, with similar results.

Reviewed By: anand1976

Differential Revision: D52174755

Pulled By: pdillinger

fbshipit-source-id: d278108031b1220c1fa4c89c5a9d34b7cf4ef1b8
2023-12-14 22:13:32 -08:00
Akanksha Mahajan cd577f6059 Fix WRITE_STALL start_time (#12147)
Summary:
`Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)
However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time);

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147

Test Plan: Existing CircleCI

Reviewed By: cbi42

Differential Revision: D52173481

Pulled By: akankshamahajan15

fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98
2023-12-14 13:45:06 -08:00
Ludovic Henry 5502f06729 Add support for linux-riscv64 (#12139)
Summary:
Following https://github.com/evolvedbinary/docker-rocksjava/pull/2, we can now build rocksdb on riscv64.

I've verified this works as expected with `make rocksdbjavastaticdockerriscv64`.

Also fixes https://github.com/facebook/rocksdb/issues/10500 https://github.com/facebook/rocksdb/issues/11994

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12139

Reviewed By: jaykorean

Differential Revision: D52128098

Pulled By: akankshamahajan15

fbshipit-source-id: 706d36a3f8a9e990b76f426bc450937a0cd1a537
2023-12-14 11:27:17 -08:00
akankshamahajan e7c6259447 Make auto_readahead_size default true (#12080)
Summary:
Make auto_readahead_size option default true

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12080

Test Plan: benchmarks and exisiting tests

Reviewed By: anand1976

Differential Revision: D52152132

Pulled By: akankshamahajan15

fbshipit-source-id: f1515563564e77df457dff2e865e4ede8c3ddf44
2023-12-14 11:25:51 -08:00
Levi Tamasi cd21e4e69d Some further cleanup in WriteBatchWithIndex::MultiGetFromBatchAndDB (#12143)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12143

https://github.com/facebook/rocksdb/pull/11982 changed `WriteBatchWithIndex::MultiGetFromBatchDB` to preallocate space in the `autovector`s `key_contexts` and `merges` in order to prevent any reallocations, both as an optimization and in order to prevent pointers into the container from being invalidated during subsequent insertions. On second thought, this preallocation can actually be a pessimization in cases when only a small subset of keys require querying the underlying database. To prevent any memory regressions, the PR reverts this preallocation. In addition, it makes some small code hygiene improvements like incorporating the `PinnableWideColumns` object into `MergeTuple`.

Reviewed By: jaykorean

Differential Revision: D52136513

fbshipit-source-id: 21aa835084433feab27b501d9d1fc5434acea609
2023-12-13 17:34:18 -08:00
Peter Dillinger c74531b1d2 Fix a nuisance compiler warning from clang (#12144)
Summary:
Example:

```
cache/clock_cache.cc:56:7: error: fallthrough annotation in unreachable code [-Werror,-Wimplicit-fallthrough]
      FALLTHROUGH_INTENDED;
      ^
./port/lang.h:10:30: note: expanded from macro 'FALLTHROUGH_INTENDED'
                             ^
```

In clang < 14, this is annoyingly generated from -Wimplicit-fallthrough, but was changed to -Wunreachable-code-fallthrough (implied by -Wunreachable-code) in clang 14. See https://reviews.llvm.org/D107933 for how this nuisance pattern generated false positives similar to ours in the Linux kernel.

Just to underscore the ridiculousness of this warning, here an error is reported on the annotation, not the call to do_something(), depending on the constexpr value (https://godbolt.org/z/EvxqdPTdr):

```
#include <atomic>
void do_something();
void test(int v) {
    switch (v) {
        case 1:
            if constexpr (std::atomic<long>::is_always_lock_free) {
                return;
            } else {
                do_something();
                [[fallthrough]];
            }
        case 2:
            return;
    }
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12144

Test Plan: Added the warning to our Makefile for USE_CLANG, which reproduced the warning-as-error as shown above, but is now fixed.

Reviewed By: jaykorean

Differential Revision: D52139615

Pulled By: pdillinger

fbshipit-source-id: ba967ae700c0916d1a478bc465cf917633e337d9
2023-12-13 15:58:46 -08:00
akankshamahajan d926593df5 Fix stress tests failure for auto_readahead_size (#12131)
Summary:
When auto_readahead_size is enabled, Prev operation calls SeekForPrev in db_iter  so that
- BlockBasedTableIterator can point index_iter_ to the right block.
- disable readahead_cache_lookup.
However, there can be cases where SeekForPrev might not go through Version_set and call BlockBasedTableIterator SeekForPrev. In that case, when BlockBasedTableIterator::Prev is called, it returns NotSupported error. This more like a corner case.

So to handle that case, removed SeekForPrev calling from db_iter and reseeking index_iter_ in Prev operation. block_iter_'s key already point to right block. So reseeking to index_iter_ solves the issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12131

Test Plan:
- Tested on db_stress command that was failing -
`./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --best_efforts_recovery=1 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_verifydb=1 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35`

 - make crash_test -j32

Reviewed By: anand1976

Differential Revision: D51986326

Pulled By: akankshamahajan15

fbshipit-source-id: 90e11e63d1f1894770b457a44d8b213ae5512df9
2023-12-13 12:15:04 -08:00
Andrew Kryczka d8e47620d7 Speedup based on pending compaction bytes relative to data size (#12130)
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. The pressure detection based on pending compaction bytes was only comparing against the slowdown trigger (`soft_pending_compaction_bytes_limit`). Online services tend to set that extremely high to avoid stalling at all costs. Perhaps they should have set it to zero, but we never documented that zero disables stalling so I have been telling everyone to increase it for years.

This PR adds pressure detection based on pending compaction bytes relative to the size of bottommost data. The size of bottommost data should be fairly stable and proportional to the logical data size

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12130

Reviewed By: hx235

Differential Revision: D52000746

Pulled By: ajkr

fbshipit-source-id: 7e1fd170901a74c2d4a69266285e3edf6e7631c7
2023-12-13 10:37:27 -08:00
anand76 ebb5242d55 Sanitize the secondary_cache option in TieredCacheOptions (#12137)
Summary:
Sanitize the `secondary_cache` field in the `cache_opts` option of `TieredCacheOptions` to `nullptr` if set by the user. The nvm secondary cache should be directly set in `TieredCacheOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12137

Reviewed By: akankshamahajan15

Differential Revision: D52063817

Pulled By: anand1976

fbshipit-source-id: 255116c665a9b908c8f44109a2d331d4b73e7591
2023-12-12 10:58:00 -08:00
Yu Zhang c2ab4e754b Add initial support to stress test persist_user_defined_timestamps (#12124)
Summary:
This PR adds initial stress testing for the user-defined timestamps in memtable only feature. Each flavor of the `*_ts` crash test get a 1 in 3 chance to run with timestamps not persisted, this setting is initialized once and kept consistent across the following re-runs.

This initial stress test included these things besides disabling incompatible feature combinations to make the test run more stably:
1) It currently only run test methods that validates db state with expected state. Not the ones that validate db state by comparing result from one API to another API. Such as `TestMultiGet` (compared with `Get`), similarly `TestMultiGetEntity`, `TestIterate` (compare src iterator to a control iterator). Due to timestamps being removed, results from one API to another API is not directly comparable as it is now. More test logic to handle that need to be added, will do that in a follow up.

2) Even when comparing db state to expected state, sometimes the db can receive `InvalidArgument` too due to timestamps getting flushed and removed. Added some logic to handle that.

3) When timestamps are not persisted, we don't try to read with older timestamp. Since that's making it easier to get `InvalidArgument`. And this capability is not yet needed by our customer so it's disabled for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12124

Test Plan: running multiple flavor of this test on continuous run for sometime before checkin

Reviewed By: ltamasi

Differential Revision: D51916267

Pulled By: jowlyzhang

fbshipit-source-id: 3f3eb5f9618d05d296062820e0ef5cb8edc7c2b2
2023-12-12 09:35:29 -08:00
anand76 c1b84d0437 Fix false negative in TieredSecondaryCache nvm cache lookup (#12134)
Summary:
There is a bug in the `TieredSecondaryCache` that can result in a false negative. This can happen when a MultiGet does a cache lookup that gets a hit in the `TieredSecondaryCache` local nvm cache tier, and the result is available before MultiGet calls `WaitAll` (i.e the nvm cache `SecondaryCacheResultHandle` `IsReady` returns true).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12134

Test Plan: Add a new unit test in tiered_secondary_cache_test

Reviewed By: akankshamahajan15

Differential Revision: D52023309

Pulled By: anand1976

fbshipit-source-id: e5ae681226a0f12753fecb2f6acc7e5f254ae72b
2023-12-11 16:59:59 -08:00
Peter Dillinger c96d9a0fbb Allow TablePropertiesCollectorFactory to return null collector (#12129)
Summary:
As part of building another feature, I wanted this:
* Custom implementations of `TablePropertiesCollectorFactory` may now return a `nullptr` collector to decline processing a file, reducing callback overheads in such cases.
* Polished, clarified some related API comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12129

Test Plan: unit test added

Reviewed By: ltamasi

Differential Revision: D51966667

Pulled By: pdillinger

fbshipit-source-id: 2991c08fe6ce3a8c9f14c68f1495f5a17bca2770
2023-12-11 12:02:56 -08:00
Radek Hubner 5c5e018943 Fix JNI lazy load regression. (#12133)
Summary:
A small regression that conflicted with PR https://github.com/facebook/rocksdb/pull/12133 was later merged in commit 2296c624fa (diff-26d3ab8a3d764183d0ea3aea834fe481eec2347c334b918ebd7bdce4bcabcc19R35)
This PR addresses that regression. Closes https://github.com/facebook/rocksdb/issues/12132

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12133

Reviewed By: jowlyzhang

Differential Revision: D52041736

Pulled By: ltamasi

fbshipit-source-id: 33db57035154c833ae00b5d921b17b3be80c8dd7
2023-12-11 11:21:52 -08:00
Alan Paxton 5a063ecd34 Java API consistency between RocksDB.put() , .merge() and Transaction.put() , .merge() (#11019)
Summary:
### Implement new Java API get()/put()/merge() methods, and transactional variants.

The Java API methods are very inconsistent in terms of how they pass parameters (byte[], ByteBuffer), and what variants and defaulted parameters they support. We try to bring some consistency to this.
 * All APIs should support calls with ByteBuffer parameters.
 * Similar methods (RocksDB.get() vs Transaction.get()) should support as similar as possible sets of parameters for predictability.
 * get()-like methods should provide variants where the caller supplies the target buffer, for the sake of efficiency. Allocation costs in Java can be significant when large buffers are repeatedly allocated and freed.

### API Additions

 1. RockDB.get implement indirect ByteBuffers. Added indirect ByteBuffers and supporting native methods for get().
 2. RocksDB.Iterator implement missing (byte[], offset, length) variants for key() and value() parameters.
 3. Transaction.get() implement missing methods, based on RocksDB.get. Added ByteBuffer.get with and without column family. Added byte[]-as-target get.
 4. Transaction.iterator() implement a getIterator() which defaults ReadOptions; as per RocksDB.iterator(). Rationalize support API for this and RocksDB.iterator()
 5. RocksDB.merge implement ByteBuffer methods; both direct and indirect buffers. Shadow the methods of RocksDB.put; RocksDB.put only offers ByteBuffer API with explicit WriteOptions. Duplicated this with RocksDB.merge
 6. Transaction.merge implement methods as per RocksDB.merge methods. Transaction is already constructed with WriteOptions, so no explicit WriteOptions methods required.
 7. Transaction.mergeUntracked implement the same API methods as Transaction.merge except the ones that use assumeTracked, because that’s not a feature of merge untracked.

### Support Changes (C++)

The current JNI code in C++ supports multiple variants of methods through a number of helper functions. There are numerous TODO suggestions in the code proposing that the helpers be re-factored/shared.

We have taken a different approach for the new methods; we have created wrapper classes `JDirectBufferSlice`, `JDirectBufferPinnableSlice`, `JByteArraySlice` and `JByteArrayPinnableSlice` RAII classes which construct slices from JNI parameters and can then be passed directly to RocksDB methods. For instance, the `Java_org_rocksdb_Transaction_getDirect` method is implemented like this:

```
  try {
    ROCKSDB_NAMESPACE::JDirectBufferSlice key(env, jkey_bb, jkey_off,
                                              jkey_part_len);
    ROCKSDB_NAMESPACE::JDirectBufferPinnableSlice value(env, jval_bb, jval_off,
                                                        jval_part_len);
    ROCKSDB_NAMESPACE::KVException::ThrowOnError(
        env, txn->Get(*read_options, column_family_handle, key.slice(),
                      &value.pinnable_slice()));
    return value.Fetch();
  } catch (const ROCKSDB_NAMESPACE::KVException& e) {
    return e.Code();
  }
```
Notice the try/catch mechanism with the `KVException` class, which combined with RAII and the wrapper classes means that there is no ad-hoc cleanup necessary in the JNI methods.

We propose to extend this mechanism to existing JNI methods as further work.

### Support Changes (Java)

Where there are multiple parameter-variant versions of the same method, we use fewer or just one supporting native method for all of them. This makes maintenance a bit easier and reduces the opportunity for coding errors mixing up (untyped) object handles.

In  order to support this efficiently, some classes need to have default values for column families and read options added and cached so that they are not re-constructed on every method call.

This PR closes https://github.com/facebook/rocksdb/issues/9776

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11019

Reviewed By: ajkr

Differential Revision: D52039446

Pulled By: jowlyzhang

fbshipit-source-id: 45d0140a4887e42134d2e56520e9b8efbd349660
2023-12-11 11:03:17 -08:00
Richard Barnes 4f04f96742 Remove extra semi colon from infrasec/authorization/audit/AclAuditor.cpp
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D51995065

fbshipit-source-id: 9b55a0d8abd0927b76376cb7751bf0fcab10518c
2023-12-08 17:21:52 -08:00
Kevin Mingtarja 44fd914128 Fix double counting of BYTES_WRITTEN ticker (#12111)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12061.

We were double counting the `BYTES_WRITTEN` ticker when doing writes with transactions. During transactions, after writing, a client can call `Prepare()`, which writes the values to WAL but not to the Memtable. After that, they can call `Commit()`, which writes a commit marker to the WAL and the values to Memtable.

The cause of this bug is previously during writes, we didn't take into account `writer->ShouldWriteToMemtable()` before adding to `total_byte_size`, so it is still added to during the `Prepare()` phase even though we're not writing to the Memtable, which was why we saw the value to be double of what's written to WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12111

Test Plan: Added a test in `db/db_statistics_test.cc` that tests writes with and without transactions, by comparing the values of `BYTES_WRITTEN` and `WAL_FILE_BYTES` after doing writes.

Reviewed By: jaykorean

Differential Revision: D51954327

Pulled By: jowlyzhang

fbshipit-source-id: 57a0986a14e5b94eb5188715d819212529110d2c
2023-12-08 17:12:11 -08:00
Levi Tamasi a143f93236 Turn the default Timer in PeriodicTaskScheduler into a leaky Meyers singleton (#12128)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12128

The patch turns the `Timer` Meyers singleton in `PeriodicTaskScheduler::Default()` into one of the leaky variety in order to prevent static destruction order issues.

Reviewed By: akankshamahajan15

Differential Revision: D51963950

fbshipit-source-id: 0fc34113ad03c51fdc83bdb8c2cfb6c9f6913948
2023-12-08 10:34:07 -08:00
Hui Xiao 179d2c7646 Intensify "xxx_one_in"'s default value in crash test (#12127)
Summary:
**Context/Summary:**
My experimental stress runs with more frequent "xxx_one_in" surfaced a couple interesting bugs/issues with RocksDB or crash test framework in the past. We now consider changing the default value so they are run more frequently in production testing environment.

Increase frequency by 2 orders of magnitude for most parameters, except for error-prone features e.g, manual compaction and file ingestion (increased by 3 orders) and expensive features e.g, checksum verification (increased by 1 order)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12127

Test Plan: Monitor CI to see if it did surface more interesting bugs/issues. If not, we may consider intensify even more.

Reviewed By: pdillinger

Differential Revision: D51954235

Pulled By: hx235

fbshipit-source-id: 92046cb7c52a37212f19ab7965b40f77b90b08b1
2023-12-08 10:22:14 -08:00
akankshamahajan c77b50a4fd Add AsyncIO support for tuning readahead_size by block cache lookup (#11936)
Summary:
Add support for tuning of readahead_size by block cache lookup for async_io.

**Design/ Implementation** -

**BlockBasedTableIterator.cc** -

`BlockCacheLookupForReadAheadSize` callback API lookups in the block cache and tries to reduce the start
and end offset passed. This function looks into the block cache for the blocks between `start_offset`
and `end_offset` and add all the handles in the queue.

It then iterates from the end in the handles to find first miss block and update the end offset to that block.
It also iterates from the start and find first miss block and update the start offset to that block.

```
_read_curr_block_ argument : True if this call was due to miss in the cache and caller wants to read that block
                             synchronously.
                             False if current call is to prefetch additional data in extra buffers
                            (due to ReadAsync call in FilePrefetchBuffer)
```
In case there is no data to be read in that callback (because of upper_bound or all blocks are in cache),
it updates start and end offset to be equal and that `FilePrefetchBuffer` interprets that as 0 length to be read.

**FilePrefetchBuffer.cc** -

FilePrefetchBuffer calls the callback - `ReadAheadSizeTuning` and pass the start and end offset to that
callback to get updated start and end offset to read based on cache hits/misses.

1. In case of Read calls (when offset passed to FilePrefetchBuffer is on cache miss and that data needs to be read), _read_curr_block_ is passed true.
2. In case of ReadAsync calls, when buffer is all consumed and can go for additional prefetching,  the start offset passed is the initial end offset of prev buffer (without any updated offset based on cache hit/miss).

Foreg. if following are the data blocks with cache hit/miss and start offset
and Read API found miss on DB1 and based on readahead_size (50)  it passes end offset to be 50.
 [DB1 - miss- 0 ] [DB2 - hit -10] [DB3 - miss -20] [DB4 - miss-30] [DB5 - hit-40]
 [DB6 - hit-50] [DB7 - miss-60] [DB8 - miss - 70] [DB9 - hit - 80] [DB6 - hit 90]

- For Read call - updated start offset remains 0 but end offset updates to DB4, as DB5 is in cache.
- Read calls saves initial end offset 50 as that was meant to be prefetched.
- Now for next ReadAsync call - the start offset will be 50 (previous buffer initial end offset) and based on readahead_size, end offset will be 100
- On callback, because of cache hits - callback will update the start offset to 60 and end offset to 80 to read only 2 data blocks (DB7 and DB8).
- And for that ReadAsync call - initial end offset will be set to 100 which will again used by next ReadAsync call as start offset.
-  `initial_end_offset_` in `BufferInfo` is used to save the initial end offset of that buffer.

- If let's say DB5 and DB6 overlaps in 2 buffers (because of alignment), `prev_buf_end_offset` is passed to make sure already prefetched data is not prefetched again in second buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11936

Test Plan:
- Ran crash_test several times.
-  New unit tests added.

Reviewed By: anand1976

Differential Revision: D50906217

Pulled By: akankshamahajan15

fbshipit-source-id: 0d75d3c98274e98aa34901b201b8fb05232139cf
2023-12-06 13:48:15 -08:00
Levi Tamasi 0ebe1614cb Eliminate some code duplication in MergeHelper (#12121)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12121

The patch eliminates some code duplication by unifying the two sets of `MergeHelper::TimedFullMerge` overloads using variadic templates. It also brings the order of parameters into sync when it comes to the various `TimedFullMerge*` methods.

Reviewed By: jaykorean

Differential Revision: D51862483

fbshipit-source-id: e3f832a6ff89ba34591451655cf11025d0a0d018
2023-12-05 14:07:42 -08:00
Levi Tamasi 2045fe4693 Mention PR 11892 in the changelog (#12118)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12118

Reviewed By: jaykorean

Differential Revision: D51820703

fbshipit-source-id: d2a86a4781618747c6b7c71971862d510a25e103
2023-12-04 13:20:28 -08:00
Yu Zhang ba8fa0f546 internal_repo_rocksdb (4372117296613874540) (#12117)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12117

Reviewed By: ajkr

Differential Revision: D51745846

Pulled By: jowlyzhang

fbshipit-source-id: 51c806a484b3b43d174b06d2cfe9499191d09914
2023-12-04 11:17:32 -08:00
Richard Barnes dce3ca5ab8 Remove extra semi colon from internal_repo_rocksdb/repo/monitoring/perf_context_imp.h
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: dmm-fb

Differential Revision: D51778007

fbshipit-source-id: 5d1b20a3acc4bcc7cd7c204f2f73a14fc8f81883
2023-12-01 22:35:34 -08:00
Andrew Kryczka 06dc32ef25 internal_repo_rocksdb (435146444452818992) (#12115)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12115

Reviewed By: jowlyzhang

Differential Revision: D51745742

Pulled By: ajkr

fbshipit-source-id: 67000d07783b413924798dd9c1751da27e119d53
2023-12-01 11:15:17 -08:00
Andrew Kryczka be3bc36811 internal_repo_rocksdb (-8794174668376270091) (#12114)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12114

Reviewed By: jowlyzhang

Differential Revision: D51745613

Pulled By: ajkr

fbshipit-source-id: 27ca4bda275cab057d3a3ec99f0f92cdb9be5177
2023-12-01 11:10:30 -08:00
Yu Zhang 7eca51dfc3 Refactor crash test stderr parsing logic into a function (#12109)
Summary:
This is a simple refactor for the crash test script to put shared logic for parsing stderr into a function. There is no functional change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12109

Test Plan: manually tested the script

Reviewed By: ajkr

Differential Revision: D51692172

Pulled By: jowlyzhang

fbshipit-source-id: d346d64e981d9c489c380ff6ce33296a224b5877
2023-12-01 11:01:29 -08:00
Levi Tamasi b760af321f Initial support for wide columns in WriteBatchWithIndex (#11982)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11982

The patch constitutes the first phase of adding wide-column support to `WriteBatchWithIndex`. Namely, it implements the `PutEntity` API in `WriteBatchWithIndex` on the write path, and the `Iterator::columns()` API in `BaseDeltaIterator` on the read path. In addition, it updates all existing read APIs (`GetFromBatch`, `GetFromBatchAndDB`, `MultiGetFromBatchAndDB`, and `BaseDeltaIterator`) so that they handle wide-column entities correctly. This includes returning the value of the default column of entities as appropriate and correctly applying merges to wide-column base values. I plan to add the wide-column specific point lookup APIs (`GetEntityFromBatch`, `GetEntityFromBatchAndDB`, and `MultiGetEntityFromBatchAndDB`) in subsequent patches.

Reviewed By: jaykorean

Differential Revision: D50439231

fbshipit-source-id: 59fd0f12c45249fecde8af249c5d3f509ba58bbe
2023-11-30 14:10:13 -08:00
raffertyyu a7779458bd sst_dump support cuckoo table (#12098)
Summary:
https://github.com/facebook/rocksdb/issues/11446

Support Cuckoo Table format in sst_dump.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12098

Reviewed By: jowlyzhang

Differential Revision: D51594094

Pulled By: ajkr

fbshipit-source-id: ba9092818bc3cc0f207b000391aa21d564570df2
2023-11-30 08:06:37 -08:00
Yu Zhang d68f45e777 Flush buffered logs when FlushRequest is rescheduled (#12105)
Summary:
The optimization to not find and delete obsolete files when FlushRequest is re-scheduled also inadvertently skipped flushing the `LogBuffer`, resulting in missed logs. This PR fixes the issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12105

Test Plan:
manually check this test has the correct info log after the fix
`./column_family_test --gtest_filter=ColumnFamilyRetainUDTTest.NotAllKeysExpiredFlushRescheduled`

Reviewed By: ajkr

Differential Revision: D51671079

Pulled By: jowlyzhang

fbshipit-source-id: da0640e07e35c69c08988772ed611ec9e67f2e92
2023-11-29 11:35:59 -08:00
anand76 acc078f878 Add tiered cache options to db_bench (#12104)
Summary:
Add the option to have a 3-tier block cache (uncompressed RAM, compressed RAM, and local flash) in db_bench, as well as specifying secondary cache admission policy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12104

Reviewed By: ajkr

Differential Revision: D51629092

Pulled By: anand1976

fbshipit-source-id: 6a208f853bc85d3d8b437d91cb1b0142d9a99e53
2023-11-28 14:54:08 -08:00
anand76 4d04138512 Add dynamic disabling of compressed cache to db_stress (#12102)
Summary:
We now support re-enabling the compressed portion of the `TieredCache` after dynamically disabling it. Add it to db_stress for testing purposes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12102

Reviewed By: akankshamahajan15

Differential Revision: D51594259

Pulled By: anand1976

fbshipit-source-id: ea544e30a5ebd6290fc9ed46a241f09634764d2a
2023-11-27 13:00:15 -08:00
Alexander Kiel 6e7701d49b Fix JavaDoc of setCompactionReadaheadSize (#12090)
Summary:
Recently in https://github.com/facebook/rocksdb/issues/11762 the default of `compaction_readahead_size` changed from 0 to 2 MB.

Closes: https://github.com/facebook/rocksdb/issues/12088

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12090

Reviewed By: jaykorean

Differential Revision: D51531762

Pulled By: ajkr

fbshipit-source-id: a0b7145a1dca95ee90ffa3553f6eeacce6424aee
2023-11-27 11:50:53 -08:00
Peter Dillinger 4dd2bb8f70 Fix stack trace trimming with LLDB (#12101)
Summary:
I must have chosen trimming before frame 8 based on assertion failures, but that trims too many frame for a general segfault. So this changes to start printing at frame 4, as in this example where I've seeded a null deref:

```
Received signal 11 (Segmentation fault)
Invoking LLDB for stack trace...
Process 873208 stopped
* thread #1, name = 'db_stress', stop reason = signal SIGSTOP
    frame #0: 0x00007fb1fe8f1033 libc.so.6`__GI___wait4(pid=873478, stat_loc=0x00007fb1fb114030, options=0, usage=0x0000000000000000) at wait4.c:30:10
  thread #2, name = 'rocksdb:low', stop reason = signal SIGSTOP
    frame #0: 0x00007fb1fe8972a1 libc.so.6`__GI___futex_abstimed_wait_cancelable64 at futex-internal.c:57:12
Executable module set to "/data/users/peterd/rocksdb/db_stress".
Architecture set to: x86_64-unknown-linux-gnu.
True
frame #4: 0x00007fb1fe844540 libc.so.6`__restore_rt at libc_sigaction.c:13
frame #5: 0x0000000000608514 db_stress`rocksdb::StressTest::InitDb(rocksdb::SharedState*) at db_stress_test_base.cc:345:18
frame #6: 0x0000000000585d62 db_stress`rocksdb::RunStressTestImpl(rocksdb::SharedState*) at db_stress_driver.cc:84:17
frame #7: 0x000000000058dd69 db_stress`rocksdb::RunStressTest(shared=0x00006120000001c0) at db_stress_driver.cc:266:34
frame #8: 0x0000000000453b34 db_stress`rocksdb::db_stress_tool(int, char**) at db_stress_tool.cc:370:20
...
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12101

Test Plan: manual (see above)

Reviewed By: ajkr

Differential Revision: D51593217

Pulled By: pdillinger

fbshipit-source-id: 4a71eb8e516edbc32e682f9537bc77d073a7b4ed
2023-11-27 11:49:52 -08:00
Peter Dillinger f6fd4b9dbd Print stack traces more reliably with concurrency (#12086)
Summary:
It's been relatively easy to break our stack trace printer:
* If another thread reaches a signal condition such as a related SEGV or assertion failure while one is trying to print a stack trace from the signal handler, it seems to end the process abruptly without a stack trace.
* If the process exits normally in one thread (such as main finishing) while another is trying to print a stack trace from the signal handler, it seems the process will often end normally without a stack trace.

This change attempts to fix these issues, with
* Keep the custom signal handler installed as long as possible, so that other threads will most likely re-enter our custom handler. (We only switch back to default for triggering core dump or whatever after stack trace.)
* Use atomics and sleeps to implement a crude recursive mutex for ensuring all threads hitting the custom signal handler wait on the first that is trying to print a stack trace, while recursive signals in the same thread can still be handled cleanly.
* Use an atexit handler to hook into normal exit to (a) wait on a pending printing of stack trace when detectable and applicable, and (b) detect and warn when printing a stack trace might be interrupted by a process exit in progress. (I don't know how to pause that *after* our atexit handler has been called; the best I know how to do is warn, "In a race with process already exiting...".)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12086

Test Plan:
manual, including with TSAN. I added this code to the end of a unit test file:
```
  for (size_t i = 0; i < 3; ++i) {
    std::thread t([]() { assert(false); });
    t.detach();
  }
```
Followed by either `sleep(100)` or `usleep(100)` or usual process exit. And for recursive signal testing, inject `abort()` at various places in the handler.

Reviewed By: cbi42

Differential Revision: D51531882

Pulled By: pdillinger

fbshipit-source-id: 3473b863a43e61b722dfb7a2ed12a8120949b09c
2023-11-22 11:55:10 -08:00
Peter Dillinger a140b519b1 Convert all but one windows job to nightly (#12089)
Summary:
... because they are expensive and rarely disagree with each other. Historical data indicates that the 2019 job is most sensitive to failure.

https://fburl.com/scuba/opensource_ci_jobs/ntq3ue3p https://fburl.com/scuba/opensource_ci_jobs/0xo91j5f

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12089

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D51530386

Pulled By: pdillinger

fbshipit-source-id: 8b676d6e01096e359a0f465b59d81ac10f4f7969
2023-11-22 10:40:52 -08:00
cz2h 324453e579 Fix rowcache get returning incorrect timestamp (#11952)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/7930.

When there is a timestamp associated with stored records, get from row cache will return the timestamp provided in query instead of the timestamp associated with the stored record.

## Cause of error:
Currently a row_handle is fetched using row_cache_key(contains a timestamp provided by user query) and the row_handle itself does not persist timestamp associated with the object. Hence the [GetContext::SaveValue()
](6e3429b8a6/table/get_context.cc (L257)) function will fetch the timestamp in row_cache_key and may return the incorrect timestamp value.

## Proposed Solution
If current cf enables ts, append a timestamp associated with stored records after the value in replay_log (equivalently the value of row cache entry).

When read, `replayGetContextLog()` will update parsed_key with the correct timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11952

Reviewed By: ajkr

Differential Revision: D51501176

Pulled By: jowlyzhang

fbshipit-source-id: 808fc943a8ae95de56ae0e82ec59a2573a031f28
2023-11-21 20:39:33 -08:00
Jay Huh ddb7df10ef Update HISTORY.md and version.h for 8.9.fb release (#12074)
Summary:
Creating cut for 8.9 release

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12074

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D51435289

Pulled By: jaykorean

fbshipit-source-id: 3918a8250032839e5b71f67f26c8ba01cbc17a41
2023-11-21 18:07:19 -08:00
Yu Zhang 84a54e1e28 Fix some bugs in index builder and reader for the UDT in memtable only feature (#12062)
Summary:
These bugs surfaced while I was trying to add the stress test for the feature:

Bug 1) On the index building path: the optimization to use user key instead of internal key as separator needed a bit tweak for when user defined timestamps can be removed. Because even though the user key look different now and eligible to be used as separator, when their user-defined timestamps are removed, they could be equal and that invariant no longer stands.

Bug 2) On the index reading path: one path that builds the second level index iterator for `PartitionedIndexReader` are not passing the corresponding `user_defined_timestamps_persisted` flag. As a result, the default `true` value be used leading to no minimum timestamps padded when they should be.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12062

Test Plan:
For bug 1): added separate unit test `BlockBasedTableReaderTest::Get` to exercise the `Get` API. It's a different code path from `MultiGet` so worth having its own test. Also in order to cover the bug, the test is modified to generate key values with the same user provided key, different timestamps and different sequence numbers. The test reads back different versions of the same user provided key.  `MultiGet` takes one `ReadOptions` with one read timestamp so we cannot test retrieving different versions of the same key easily.

For bug 2): simply added options `BlockBasedTableOptions.metadata_cache_options.partition_pinning = PinningTier::kAll` to exercise all the index iterator creating paths.

Reviewed By: ltamasi

Differential Revision: D51508280

Pulled By: jowlyzhang

fbshipit-source-id: 8b174d3d70373c0599266ac1f467f2bd4d7ea6e5
2023-11-21 14:05:02 -08:00
songqing d3e015fe06 Fix compact_files_example (#12084)
Summary:
The option "write_buffer_size" has changed from 4MB for 64MB by default, and the compact_files_example will not work as expected, as the test data written is only about 50MB and will not trigger compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12084

Reviewed By: cbi42

Differential Revision: D51499959

Pulled By: ajkr

fbshipit-source-id: 4f4b25ebc4b6bb568501adc8e97813edcddceea8
2023-11-21 09:34:59 -08:00
Andrew Kryczka 04cbc77b90 Add missing license to source files (#12083)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12079.

Fixed missing licenses in "\*.h" and "\*.cc" files

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12083

Reviewed By: cbi42

Differential Revision: D51489634

Pulled By: ajkr

fbshipit-source-id: 764bfee257b9d6603fd7606a55664b7537e1898f
2023-11-21 08:36:30 -08:00
anand76 336a74db60 Add some asserts in ~CacheWithSecondaryAdapter (#12082)
Summary:
Add some asserts in the `CacheWithSecondaryAdapter` destructor to help debug a crash test failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12082

Reviewed By: cbi42

Differential Revision: D51486041

Pulled By: anand1976

fbshipit-source-id: 76537beed31ba27ab9ac8b4ce6deb775629e3be5
2023-11-20 17:48:17 -08:00
Changyu Bi fb5c8c7ea3 Do not compare op_type in `WithinPenultimateLevelOutputRange()` (#12081)
Summary:
`WithinPenultimateLevelOutputRange()` is updated in https://github.com/facebook/rocksdb/issues/12063 to check internal key range. However, op_type of a key can change during compaction, e.g. MERGE -> PUT, which makes a key larger and becomes out of penultimate output range. This has caused stress test failures with error message "Unsafe to store Seq later than snapshot in the last level if per_key_placement is enabled". So update `WithinPenultimateLevelOutputRange()` to only check user key and sequence number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12081

Test Plan:
* This repro can produce the corruption within a few runs. Ran it a few times after the fix and did not see Corruption failure.
```
python3 ./tools/db_crashtest.py whitebox --test_tiered_storage --random_kill_odd=888887 --use_merge=1 --writepercent=100 --readpercent=0 --prefixpercent=0 --delpercent=0 --delrangepercent=0 --iterpercent=0 --write_buffer_size=419430 --column_families=1 --read_fault_one_in=0 --write_fault_one_in=0
```

Reviewed By: ajkr

Differential Revision: D51481202

Pulled By: cbi42

fbshipit-source-id: cad6b65099733e03071b496e752bbdb09cf4db82
2023-11-20 17:07:28 -08:00
Timo Riski 39d33475da Fix build on FreeBSD (#11218) (#12078)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11218

Changes from https://github.com/facebook/rocksdb/issues/10881 broke FreeBSD builds with:

    env/io_posix.h:39:9: error: 'POSIX_MADV_NORMAL' macro redefined [-Werror,-Wmacro-redefined]

This commit fixes FreeBSD builds by ignoring MADV defines.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12078

Reviewed By: cbi42

Differential Revision: D51452802

Pulled By: ajkr

fbshipit-source-id: 0a1f5a90954e7d257a95794277a843ac77f3a709
2023-11-20 10:11:16 -08:00
Changyu Bi b059c5680e Add missing copyright header (#12076)
Summary:
Required for open source repo.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12076

Reviewed By: ajkr

Differential Revision: D51449839

Pulled By: cbi42

fbshipit-source-id: 4a25a3422880db3f28a2834d966341935db32530
2023-11-19 09:50:59 -08:00
Benoît Mériaux 7780e98268 add write_buffer_manager setter into options and tests in c bindings, (#12007)
Summary:
following https://github.com/facebook/rocksdb/pull/11710
 - add test on wbm c api
- add a setter for WBM in `DBOptions`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12007

Reviewed By: cbi42

Differential Revision: D51430042

Pulled By: ajkr

fbshipit-source-id: 608bc4d3ed35a84200459d0230b35be64b3475f7
2023-11-17 11:34:05 -08:00
Changyu Bi 4e58cc6437 Check internal key range when compacting from last level to penultimate level (#12063)
Summary:
The test failure in https://github.com/facebook/rocksdb/issues/11909 shows that we may compact keys outside of internal key range of penultimate level input files from last level to penultimate level, which can potentially cause overlapping files in the penultimate level. This PR updates the  `Compaction::WithinPenultimateLevelOutputRange()` to check internal key range instead of user key.

Other fixes:
* skip range del sentinels when deciding output level for tiered compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12063

Test Plan:
- existing unit tests
- apply the fix to https://github.com/facebook/rocksdb/issues/11905 and run `./tiered_compaction_test --gtest_filter="*RangeDelsCauseFileEndpointsToOverlap*"`

Reviewed By: ajkr

Differential Revision: D51288985

Pulled By: cbi42

fbshipit-source-id: 70085db5f5c3b15300bcbc39057d57b83fd9902a
2023-11-17 10:50:40 -08:00
Radek Hubner 2f9ea8193f Add HyperClockCache Java API. (#12065)
Summary:
Fix https://github.com/facebook/rocksdb/issues/11510

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12065

Reviewed By: ajkr

Differential Revision: D51406695

Pulled By: cbi42

fbshipit-source-id: b9e32da5f9bcafb5365e4349f7295be90d5aa7ba
2023-11-16 15:46:31 -08:00
nccx a9bd525b52 Add Qdrant to USERS.md (#12072)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12072

Reviewed By: cbi42

Differential Revision: D51398080

Pulled By: ajkr

fbshipit-source-id: 1043f2b012bd744e9c53c638e1ba56a3e0392e11
2023-11-16 10:35:08 -08:00