Commit Graph

284 Commits

Author SHA1 Message Date
Akanksha Mahajan 2acbf386a3 Provide support for direct_reads with async_io (#10197)
Summary:
Provide support for use_direct_reads with async_io.

TestPlan:
-  Updated unit tests
-  db_bench: Results in https://github.com/facebook/rocksdb/pull/10197#issuecomment-1159239420
- db_stress
```
export CRASH_TEST_EXT_ARGS=" --async_io=1 --use_direct_reads=1"
make crash_test -j
```
- Ran db_bench on previous RocksDB version before any async_io implementation (as there have many changes in different PRs in this area) https://github.com/facebook/rocksdb/pull/10197#issuecomment-1160781563.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10197

Reviewed By: anand1976

Differential Revision: D37255646

Pulled By: akankshamahajan15

fbshipit-source-id: fec61ae15bf4d625f79dea56e4f86e0e307ba920
2022-07-06 11:42:59 -07:00
Bo Wang c073ed7601 Fix typo in comments and code (#10233)
Summary:
Fix typo in comments and code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10233

Test Plan: Existing unit tests should pass.

Reviewed By: jay-zhuang, anand1976

Differential Revision: D37356702

Pulled By: gitbw95

fbshipit-source-id: 32c019adcc6dcc95a9882b38147a310091368e51
2022-06-22 15:45:21 -07:00
Yanqin Jin 36fefd7e22 Continue to deflake BackupEngineTest.Concurrency (#10228)
Summary:
Even after https://github.com/facebook/rocksdb/issues/10069, `BackupEngineTest.Concurrency` is still flaky with decreased probability of failure.

Repro steps as follows
```bash
make backup_engine_test
gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
```

The first two commits of this PR demonstrate how the test is flaky. https://github.com/facebook/rocksdb/issues/10069 handles the case in which
`Rename()` file returns `IOError` with subcode `PathNotFound`, and `CreateLoggerFromOptions()`
allows the operation to succeed, as expected by the test. However, `BackupEngineTest` uses
`RemapFileSystem` on top of `ChrootFileSystem` which can return `NotFound` instead of `IOError`.

This behavior is different from `Env::Default()` which returns PathNotFound if the src of `rename()`
does not exist. We should make the behaviors of the test Env/FS match a real Env/FS.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10228

Test Plan:
```bash
make check
gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
```

Reviewed By: pdillinger

Differential Revision: D37337241

Pulled By: riversand963

fbshipit-source-id: 07a53115e424467b55a731866e571f0ad4c6635d
2022-06-22 08:50:05 -07:00
sdong 69a32eecab Use madvise() for mmaped file advise (#10170)
Summary:
A recent PR https://github.com/facebook/rocksdb/pull/10142 enabled fadvise for mmaped file. However, we were told that it might not take effective and madvise() should be used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10170

Test Plan:
Run existing tests
Run a benchmark using mmap with advise random and see I/O size is indeed small.

Reviewed By: anand1976

Differential Revision: D37158582

fbshipit-source-id: 8b3a74f0e89d2e16aac78ee4124c05841d4135c3
2022-06-15 13:05:58 -07:00
Akanksha Mahajan 19345de60d fix cancel argument for latest liburing (#10168)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10168

the arg changed to u64

Reviewed By: ajkr

Differential Revision: D37155407

fbshipit-source-id: 464eab2806675f148fce075a6fea369fa3d7a9bb
2022-06-15 09:10:19 -07:00
Akanksha Mahajan 40d19bc12c Fix the failure related to io_uring_prep_cancel (#10165)
Summary:
Fix for Internal jobs are failing with
```
 error: no matching function for call to 'io_uring_prep_cancel'
      io_uring_prep_cancel(sqe, posix_handle, 0);
      ^~~~~~~~~~~~~~~~~~~~
note: candidate function not viable: no known conversion from 'rocksdb::Posix_IOHandle *' to '__u64' (aka 'unsigned long long') for 2nd argument
static inline void io_uring_prep_cancel(struct io_uring_sqe *sqe,
```

User data is set using `io_uring_set_data` API so no need to pass posix_handle here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10165

Test Plan: CircleCI jobs

Reviewed By: jay-zhuang

Differential Revision: D37145233

Pulled By: akankshamahajan15

fbshipit-source-id: 05da650e1240e9c6fcc8aed5f0067308dccb164a
2022-06-14 12:35:11 -07:00
Akanksha Mahajan c0e0f30667 Implement AbortIO using io_uring (#10125)
Summary:
Implement AbortIO in posix using io_uring to cancel any pending read requests submitted. Its cancelled using io_uring_prep_cancel which sets the IORING_OP_ASYNC_CANCEL flag.

To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT.

Reference: https://kernel.dk/io_uring-whatsnew.pdf,
https://lore.kernel.org/io-uring/d9a8d76d23690842f666c326631ecc2d85b6c1bc.1615566409.git.asml.silence@gmail.com/

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10125

Test Plan: Existing Posix tests.

Reviewed By: anand1976

Differential Revision: D36946970

Pulled By: akankshamahajan15

fbshipit-source-id: 3bc1f1521b3151d01a348fc6431eb3fc85db3a14
2022-06-13 18:07:24 -07:00
sdong 80afa77660 Run fadvise with mmap file (#10142)
Summary:
Right now with mmap file, we don't run fadvise following users' requests. There is no reason for that so this diff does that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10142

Test Plan:
A simple readrandom against files with page cache dropped shows latency improvement from 7.8 us to 2.8:

./db_bench -use_existing_db --benchmarks=readrandom --num=100

Reviewed By: anand1976

Differential Revision: D37074975

fbshipit-source-id: ccc72bcac1b5fd634eb8fa2b6a5d9afe332e0bf6
2022-06-10 16:34:01 -07:00
zczhu e88d8935ae Add comments/permit unchecked error to close_db_dir pull requests (#10093)
Summary:
In [close_db_dir](https://github.com/facebook/rocksdb/pull/10049) pull request, some merging conflicts occurred (some comments and one line `s.PermitUncheckedError()` are missing). This pull request aims to put them back.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10093

Reviewed By: ajkr

Differential Revision: D36884117

Pulled By: littlepig2013

fbshipit-source-id: 8c0e2a8793fc52804067c511843bd1ff4912c1c3
2022-06-02 21:52:35 -07:00
Zichen Zhu 65893ad959 Explicitly closing all directory file descriptors (#10049)
Summary:
Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
```
strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
```
There is one directory file descriptor that is not closed in the strace log.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049

Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.

Reviewed By: ajkr, hx235

Differential Revision: D36722135

Pulled By: littlepig2013

fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
2022-06-01 18:03:34 -07:00
Gang Liao e228515740 Pass the size of blob files to SstFileManager during DB open (#10062)
Summary:
RocksDB uses the (no longer aptly named) SST file manager (see https://github.com/facebook/rocksdb/wiki/Managing-Disk-Space-Utilization) to track and potentially limit the space used by SST and blob files (as well as to rate-limit the deletion of these data files). The SST file manager tracks the SST and blob file sizes in an in-memory hash map, which has to be rebuilt during DB open. File sizes can be generally obtained by querying the file system; however, there is a performance optimization possibility here since the sizes of SST and blob files are also tracked in the RocksDB MANIFEST, so we can simply pass the file sizes stored there instead of consulting the file system for each file. Currently, this optimization is only implemented for SST files; we would like to extend it to blob files as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10062

Test Plan:
Add unit tests for the change to the test suite
ltamasi riversand963  akankshamahajan15

Reviewed By: ltamasi

Differential Revision: D36726621

Pulled By: gangliao

fbshipit-source-id: 4010dc46ef7306142f1c2e0d1c3bf75b196ef82a
2022-05-27 05:58:43 -07:00
Akanksha Mahajan 28ea1fb44a Provide support for IOTracing for ReadAsync API (#9833)
Summary:
Same as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9833

Test Plan:
Add unit test and manually check the output of tracing logs
For fixed readahead_size it logs as:
```
Access Time : 193352113447923     , File Name: 000026.sst          , File Operation: ReadAsync         , Latency: 15075     , IO Status: OK, Length: 12288, Offset: 659456
Access Time : 193352113465232     , File Name: 000026.sst          , File Operation: ReadAsync         , Latency: 14425     , IO Status: OK, Length: 12288, Offset: 671744
Access Time : 193352113481539     , File Name: 000026.sst          , File Operation: ReadAsync         , Latency: 13062     , IO Status: OK, Length: 12288, Offset: 684032
Access Time : 193352113497692     , File Name: 000026.sst          , File Operation: ReadAsync         , Latency: 13649     , IO Status: OK, Length: 12288, Offset: 696320
Access Time : 193352113520043     , File Name: 000026.sst          , File Operation: ReadAsync         , Latency: 19384     , IO Status: OK, Length: 12288, Offset: 708608
Access Time : 193352113538401     , File Name: 000026.sst          , File Operation: ReadAsync         , Latency: 15406     , IO Status: OK, Length: 12288, Offset: 720896
Access Time : 193352113554855     , File Name: 000026.sst          , File Operation: ReadAsync         , Latency: 13670     , IO Status: OK, Length: 12288, Offset: 733184
Access Time : 193352113571624     , File Name: 000026.sst          , File Operation: ReadAsync         , Latency: 13855     , IO Status: OK, Length: 12288, Offset: 745472
Access Time : 193352113587924     , File Name: 000026.sst          , File Operation: ReadAsync         , Latency: 13953     , IO Status: OK, Length: 12288, Offset: 757760
Access Time : 193352113603285     , File Name: 000026.sst          , File Operation: Prefetch          , Latency: 59        , IO Status: Not implemented: Prefetch not supported, Length: 8868, Offset: 898349
```

For implicit readahead:
```
Access Time : 193351865156587     , File Name: 000026.sst          , File Operation: Prefetch          , Latency: 48        , IO Status: Not implemented: Prefetch not supported, Length: 12266, Offset: 391174
Access Time : 193351865160354     , File Name: 000026.sst          , File Operation: Prefetch          , Latency: 51        , IO Status: Not implemented: Prefetch not supported, Length: 12266, Offset: 395248
Access Time : 193351865164253     , File Name: 000026.sst          , File Operation: Prefetch          , Latency: 49        , IO Status: Not implemented: Prefetch not supported, Length: 12266, Offset: 399322
Access Time : 193351865165461     , File Name: 000026.sst          , File Operation: ReadAsync         , Latency: 222871    , IO Status: OK, Length: 135168, Offset: 401408
```

Reviewed By: anand1976

Differential Revision: D35601634

Pulled By: akankshamahajan15

fbshipit-source-id: 5a4f32a850af878efa0767bd5706380152a1f26e
2022-05-25 19:47:03 -07:00
Levi Tamasi af7ae912e2 Fix potential ambiguities in/around port/sys_time.h (#10045)
Summary:
There are some time-related POSIX APIs that are not available on Windows
(e.g. `localtime_r`), which we have worked around by providing our own
implementations in `port/sys_time.h`. This workaround actually relies on
some ambiguity: on Windows, a call to `localtime_r` calls
`ROCKSDB_NAMESPACE::port::localtime_r` (which is pulled into
`ROCKSDB_NAMESPACE` by a using-declaration), while on other platforms
it calls the global `localtime_r`. This works fine as long as there is only one
candidate function; however, it breaks down when there is more than one
`localtime_r` visible in a scope.

The patch fixes this by introducing `ROCKSDB_NAMESPACE::port::{TimeVal, GetTimeOfDay, LocalTimeR}`
to eliminate any ambiguity.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10045

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D36639372

Pulled By: ltamasi

fbshipit-source-id: fc13dbfa421b7c8918111a6d9e24ce77e91a7c50
2022-05-24 18:20:17 -07:00
Changyu Bi 8515bd50c9 Support read rate-limiting in SequentialFileReader (#9973)
Summary:
Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now).

The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973

Test Plan:
- `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter.
- Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart.
  - Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb`
  - Benchmark:
```
strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db
strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db
```
- db bench on backup and restore to ensure no performance regression.
  - backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%)
  - restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%)

```
# Set up
./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000

# benchmark
TEST_TMPDIR=/tmp/test_rocksdb
NUM_RUN=50
for ((j=0;j<$NUM_RUN;j++))
do
   ./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup'
  # Restore
  #./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db
done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt
```

Reviewed By: hx235

Differential Revision: D36327418

Pulled By: cbi42

fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691
2022-05-24 10:28:57 -07:00
mrambacher b11ff347b4 Use STATIC_AVOID_DESTRUCTION for static objects with non-trivial destructors (#9958)
Summary:
Changed the static objects that had non-trivial destructors to use the STATIC_AVOID_DESTRUCTION construct.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9958

Reviewed By: pdillinger

Differential Revision: D36442982

Pulled By: mrambacher

fbshipit-source-id: 029d47b1374d30d198bfede369a4c0ae7a4eb519
2022-05-17 09:39:22 -07:00
mrambacher 204a42ca97 Added GetFactoryCount/Names/Types to ObjectRegistry (#9358)
Summary:
These methods allow for more thorough testing of the ObjectRegistry and Customizable infrastructure in a simpler manner.  With this change, the Customizable tests can now check what factories are registered and attempt to create each of them in a systematic fashion.

With this change, I think all of the factories registered with the ObjectRegistry/CreateFromString are now tested via the customizable_test classes.

Note that there were a few other minor changes.  There was a "posix://*" register with the ObjectRegistry which was missed during the PatternEntry conversion -- these changes found that.  The nickname and default names for the FileSystem classes was also inverted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9358

Reviewed By: pdillinger

Differential Revision: D33433542

Pulled By: mrambacher

fbshipit-source-id: 9a32da74e6620745b4eeffb2712be70eeeadfa7e
2022-05-16 09:44:43 -07:00
mrambacher bfc6a8ee4a Option type info functions (#9411)
Summary:
Add methods to set the various functions (Parse, Serialize, Equals) to the OptionTypeInfo.  These methods simplify the number of constructors required for OptionTypeInfo and make the code a little clearer.

Add functions to the OptionTypeInfo for Prepare and Validate.  These methods allow types other than Configurable and Customizable to have Prepare and Validate logic.  These methods could be used by an option to guarantee that its settings were in a range or that a value was initialized.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9411

Reviewed By: pdillinger

Differential Revision: D36174849

Pulled By: mrambacher

fbshipit-source-id: 72517d8c6bab4723788a4c1a9e16590bff870125
2022-05-13 04:57:08 -07:00
sdong 736a7b5433 Remove own ToString() (#9955)
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955

Test Plan: Watch CI tests

Reviewed By: riversand963

Differential Revision: D36176799

fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
2022-05-06 13:03:58 -07:00
Akanksha Mahajan fce65e7e4f Fix bug in async_io path which reads incorrect length (#9916)
Summary:
In FilePrefetchBuffer, in case data is overlapping between two
buffers and more data is required to read and copy that to third buffer,
incorrect length was updated resulting in
```
Iterator diverged from control iterator which has value 00000000000310C3000000000000012B0000000000000274 total_order_seek: 1 auto_prefix_mode: 0 S 000000000002C37F000000000000012B000000000000001C NNNPPPPPNN; total_order_seek: 1 auto_prefix_mode: 0 S 000000000002F10B00000000000000BF78787878787878 NNNPNNNNPN; total_order_seek: 1 auto_prefix_mode: 0 S 00000000000310C3000000000000012B000000000000026B
iterator is not valid
Control CF default
db_stress: db_stress_tool/db_stress_test_base.cc:1388: void rocksdb::StressTest::VerifyIterator(rocksdb::ThreadState*, rocksdb::ColumnFamilyHandle*, const rocksdb::ReadOptions&, rocksdb::Iterator*, rocksdb::Iterator*, rocksdb::StressTest::LastIterateOp, const rocksdb::Slice&, const string&, bool*): Assertion `false' failed.
Aborted (core dumped)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9916

Test Plan:
```
- CircleCI jobs
- Ran db_stress with OPTIONS file which caught the bug
 ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=42.26248932628998 --bottommost_compression_type=disable --cache_index_and_filter_blocks=0 --cache_size=8388608 --checkpoint_one_in=0 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_buffer_bytes=1073741823 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/ --db_write_buffer_size=134217728 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --fail_if_options_file_error=0 --file_checksum_impl=none --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=12 --index_type=2 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=0 --mark_for_compaction_one_file_in=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=8388608 --memtable_prefix_bloom_size_ratio=0.001 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --read_only=0 --readpercent=50 --recycle_log_file_num=1 --reopen=0 --reserve_table_reader_memory=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --secondary_catch_up_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_blob_db=0 --use_block_based_filter=0 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=0 --use_txn=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35 --options_file=/home/akankshamahajan/OPTIONS.orig -column_families=1

db_bench with async_io enabled to make sure db_bench completes successfully without any failure.
- ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1
```

crash_test in progress

Reviewed By: anand1976

Differential Revision: D35985789

Pulled By: akankshamahajan15

fbshipit-source-id: 5abe185f34caa99ca587d4bdc8954bd0802b1bf9
2022-04-27 22:33:29 -07:00
Akanksha Mahajan 5bd374b392 Add experimental new FS API AbortIO to cancel read request (#9901)
Summary:
Add experimental new API AbortIO in FileSystem to abort the
read requests submitted asynchronously through ReadAsync API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9901

Test Plan: Existing tests

Reviewed By: anand1976

Differential Revision: D35885591

Pulled By: akankshamahajan15

fbshipit-source-id: df3944e6e9e6e487af1fa688376b4abb6837fb02
2022-04-25 14:20:03 -07:00
Akanksha Mahajan 36bc3da97f Fix segfault in FilePrefetchBuffer with async_io enabled (#9777)
Summary:
If FilePrefetchBuffer object is destroyed and then later Poll() calls callback on object which has been destroyed, it gives segfault on accessing destroyed object. It was caught after adding unit tests that tests Posix implementation of ReadAsync and Poll APIs.
This PR also updates and fixes existing IOURing tests which were not running locally because RocksDbIOUringEnable function wasn't defined and IOUring was disabled for those tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9777

Test Plan: Added new unit test

Reviewed By: anand1976

Differential Revision: D35254002

Pulled By: akankshamahajan15

fbshipit-source-id: 68e80054ffb14ae25c255920ebc6548ca5f130a1
2022-04-04 15:35:43 -07:00
KNOEEE cb4d188a34 Fix a bug in PosixClock (#9695)
Summary:
Multiplier here should be 1e6 to get microseconds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9695

Reviewed By: ajkr

Differential Revision: D34897086

Pulled By: jay-zhuang

fbshipit-source-id: 9c1d0811ea740ba0a007edc2da199edbd000b88b
2022-03-21 16:11:02 -07:00
Yanqin Jin 3bdbf67e1a Fix race condition caused by concurrent accesses to forceMmapOff_ when opening Posix WritableFile (#9685)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9685

Our TSAN reports a race condition as follows when running test
```
gtest-parallel -r 100 ./external_sst_file_test --gtest_filter=ExternalSSTFileTest.MultiThreaded
```
leads to the following

```
WARNING: ThreadSanitizer: data race (pid=2683148)
  Write of size 1 at 0x556fede63340 by thread T7:
    #0 rocksdb::(anonymous namespace)::PosixFileSystem::OpenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileOptions const&, bool, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::IODebugContext*) internal_repo_rocksdb/repo/env/fs_posix.cc:334 (external_sst_file_test+0xb61ac4)
    #1 rocksdb::(anonymous namespace)::PosixFileSystem::ReopenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileOptions const&, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::IODebugContext*) internal_repo_rocksdb/repo/env/fs_posix.cc:382 (external_sst_file_test+0xb5ba96)
    #2 rocksdb::CompositeEnv::ReopenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<rocksdb::WritableFile, std::default_delete<rocksdb::WritableFile> >*, rocksdb::EnvOptions const&) internal_repo_rocksdb/repo/env/composite_env.cc:334 (external_sst_file_test+0xa6ab7f)
    #3 rocksdb::EnvWrapper::ReopenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<rocksdb::WritableFile, std::default_delete<rocksdb::WritableFile> >*, rocksdb::EnvOptions const&) internal_repo_rocksdb/repo/include/rocksdb/env.h:1428 (external_sst_file_test+0x561f3e)
Previous read of size 1 at 0x556fede63340 by thread T4:
    #0 rocksdb::(anonymous namespace)::PosixFileSystem::OpenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileOptions const&, bool, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::IODebugContext*) internal_repo_rocksdb/repo/env/fs_posix.cc:328 (external_sst_file_test+0xb61a70)
    #1 rocksdb::(anonymous namespace)::PosixFileSystem::ReopenWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator
...
```

Fix by making sure the following block gets executed only once:
```
      if (!checkedDiskForMmap_) {
        // this will be executed once in the program's lifetime.
        // do not use mmapWrite on non ext-3/xfs/tmpfs systems.
        if (!SupportsFastAllocate(fname)) {
          forceMmapOff_ = true;
        }
        checkedDiskForMmap_ = true;
      }
```

Reviewed By: pdillinger

Differential Revision: D34780308

fbshipit-source-id: b761f66b24c8b5b8389d86ea371c8542b8d869d5
2022-03-17 19:50:30 -07:00
Jay Zhuang efd767d14a Fix build for io_uring (#9690)
Summary:
Minor fix for build failure:
```
./env/io_posix.h:68:33: error: unused parameter 'len' [-Werror=unused-parameter]
   68 |                          size_t len, size_t iov_len, bool async_read,
      |                          ~~~~~~~^~~
```
Only happens for release build with io_uring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9690

Test Plan: build pass with io_uring

Reviewed By: akankshamahajan15

Differential Revision: D34846347

Pulled By: jay-zhuang

fbshipit-source-id: 2d7afb585097262d7722ef1beac486fc8ef28419
2022-03-12 22:12:18 -08:00
Akanksha Mahajan 8465cccde2 Posix API support for Async Read and Poll APIs (#9578)
Summary:
Provide support for Async Read and Poll in Posix file system using IOUring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9578

Test Plan: In progress

Reviewed By: anand1976

Differential Revision: D34690256

Pulled By: akankshamahajan15

fbshipit-source-id: 291cbd1380a3cb904b726c34c0560d1b2ce44a2e
2022-03-10 18:28:31 -08:00
anand76 7574841aac Fix issue #9627 (#9657)
Summary:
SMB mounts do not support hard links. The ENOTSUP error code is
returned, which should be interpreted by PosixFileSystem as
IOStatus::NotSupported().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9657

Reviewed By: mrambacher

Differential Revision: D34634783

Pulled By: anand1976

fbshipit-source-id: 0d57f5b2e6118e4c20e9ed1a293327428c3aecac
2022-03-07 11:39:31 -08:00
Peter Dillinger ce60d0cbe5 Test refactoring for Backups+Temperatures (#9655)
Summary:
In preparation for more support for file Temperatures in BackupEngine,
this change does some test refactoring:
* Move DBTest2::BackupFileTemperature test to
BackupEngineTest::FileTemperatures, with some updates to make it work
in the new home. This test will soon be expanded for deeper backup work.
* Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to
support sharing because of above moved test, but split off the "no link"
part to the test needing it.
* Use custom FileSystems in backupable_db_test rather than custom Envs,
because going through Env file interfaces doesn't support temperatures.
* Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name
parameter to FsyncWithDirOptions, which was required because this
limitation caused a crash only after moving to higher fidelity of
FileSystem interface (vs. LegacyDirectoryWrapper throwing away some
parameter details)
* `backupable_options_` -> `engine_options_` as part of the ongoing
work to get rid of the obsolete "backupable" naming.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655

Test Plan: test code updates only

Reviewed By: jay-zhuang

Differential Revision: D34622183

Pulled By: pdillinger

fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f
2022-03-04 12:32:30 -08:00
Akanksha Mahajan d74468e348 Update Poll and ReadAsync API in File System (#9623)
Summary:
Update the signature of Poll and ReadAsync APIs in filesystem.
Instead of unique_ptr, void** will be passed as io_handle and the delete function.
io_handle and delete function should be provided by underlying
FileSystem and its lifetime will be maintained by RocksDB. io_handle
will be deleted by RocksDB once callback is made to update the results or Poll is
called to get the results.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9623

Test Plan: Add a new unit test.

Reviewed By: anand1976

Differential Revision: D34403529

Pulled By: akankshamahajan15

fbshipit-source-id: ea185a5f4c7bec334631e4f781ea7ba4135645f0
2022-03-01 17:11:42 -08:00
mrambacher fe9d495112 Return different Status based on ObjectRegistry::NewObject calls (#9333)
Summary:
This fix addresses https://github.com/facebook/rocksdb/issues/9299.

If attempting to create a new object via the ObjectRegistry and a factory is not found, the ObjectRegistry will return a "NotSupported" status.  This is the same behavior as previously.

If the factory is found but could not successfully create the object, an "InvalidArgument" status is returned.  If the factory returned a reason why (in the errmsg), this message will be in the returned status.

In practice, there are two options in the ConfigOptions that control how these errors are propagated:
- If "ignore_unknown_options=true", then both InvalidArgument and NotSupported status codes will be swallowed internally.  Both cases will return success
- If "ignore_unsupported_options=true", then having no factory will return success but a failing factory will return an error
- If both options are false, both cases (no and failing factory) will return errors.

In practice this likely only changes Customizable that may be partially available.  For example, the JEMallocMemoryAllocator is a built-in allocator that is registered with the system but may not be compiled in.  In this case, the status code for this allocator changed from NotSupported("JEMalloc not available") to InvalidArgumen("JEMalloc not available").  Other Customizable builtins/plugins would have the same semantics.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9333

Reviewed By: pdillinger

Differential Revision: D33517681

Pulled By: mrambacher

fbshipit-source-id: 8033052d4a4a7b88c2d9f90147b1b4467e51f6fd
2022-02-11 05:11:24 -08:00
Jay Zhuang b69f4360ea Fix flaky test EnvPosixTestWithParam.RunMany (#9502)
Summary:
Thread-pool pops a thread function and then run the function,
which may cause thread-pool is empty but the last function is still
running.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9502

Test Plan:
`gtest-parallel ./env_test
--gtest_filter=DefaultEnvWithoutDirectIO/EnvPosixTestWithParam.RunMany/0
-r 10000 -w 1000`

Reviewed By: ajkr

Differential Revision: D34011184

Pulled By: jay-zhuang

fbshipit-source-id: 8c38bef155205bef96fd1c988dcc643a6b2ac270
2022-02-07 09:07:59 -08:00
Peter Dillinger fd3e0f43b3 Require C++17 (#9481)
Summary:
Drop support for some old compilers by requiring C++17 standard
(or higher). See https://github.com/facebook/rocksdb/issues/9388

First modification based on this is to remove some conditional compilation in slice.h (also
better for ODR)

Also in this PR:
* Fix some Makefile formatting that seems to affect ASSERT_STATUS_CHECKED config in
some cases
* Add c_test to NON_PARALLEL_TEST in Makefile
* Fix a clang-analyze reported "potential leak" in lru_cache_test
* Better "compatibility" definition of DEFINE_uint32 for old versions of gflags
* Fix a linking problem with shared libraries in Makefile (`./random_test: error while loading shared libraries: librocksdb.so.6.29: cannot open shared object file: No such file or directory`)
* Always set ROCKSDB_SUPPORT_THREAD_LOCAL and use thread_local (from C++11)
  * TODO in later PR: clean up that obsolete flag
* Fix a cosmetic typo in c.h (https://github.com/facebook/rocksdb/issues/9488)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9481

Test Plan:
CircleCI config substantially updated.

* Upgrade to latest Ubuntu images for each release
* Generally prefer Ubuntu 20, but keep a couple Ubuntu 16 builds with oldest supported
compilers, to ensure compatibility
* Remove .circleci/cat_ignore_eagain except for Ubuntu 16 builds, because this is to work
around a kernel bug that should not affect anything but Ubuntu 16.
* Remove designated gcc-9 build, because the default linux build now uses GCC 9 from
Ubuntu 20.
* Add some `apt-key add` to fix some apt "couldn't be verified" errors
* Generally drop SKIP_LINK=1; work-around no longer needed
* Generally `add-apt-repository` before `apt-get update` as manual testing indicated the
reverse might not work.

Travis:
* Use gcc-7 by default (remove specific gcc-7 and gcc-4.8 builds)
* TODO in later PR: fix s390x "Assembler messages: Error: invalid switch -march=z14" failure

AppVeyor:
* Completely dropped because we are dropping VS2015 support and CircleCI covers
VS >= 2017

Also local testing with old gflags (out of necessity when using ROCKSDB_NO_FBCODE=1).

Reviewed By: mrambacher

Differential Revision: D33946377

Pulled By: pdillinger

fbshipit-source-id: ae077c823905b45370a26c0103ada119459da6c1
2022-02-04 17:13:10 -08:00
mrambacher aae3093719 Introduce a CountedFileSystem for counting file operations (#9283)
Summary:
Added a CountedFileSystem that tracks a number of file operations (opens, closes, deletes, renames, flushes, syncs, fsyncs, reads, writes).    This class was based on the ReportFileOpEnv from db_bench.

This is a stepping stone PR to be able to change the SpecialEnv into a SpecialFileSystem, where several of the file varieties wish to do operation counting.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9283

Reviewed By: pdillinger

Differential Revision: D33062004

Pulled By: mrambacher

fbshipit-source-id: d0d297a7fb9c48c06cbf685e5fa755c27193b6f5
2022-02-03 15:01:23 -08:00
Yanqin Jin 50135c1bf3 Move HDFS support to separate repo (#9170)
Summary:
This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point,
people can start from the example repo and fork.

Java/JNI is not included yet, and needs to be done later if necessary.

The goal is to include this commit in RocksDB 7.0 release.

Reference:
https://github.com/ajkr/dedupfs by ajkr

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170

Test Plan:
Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress.

make check

Reviewed By: ajkr

Differential Revision: D33751662

Pulled By: riversand963

fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f
2022-01-24 20:23:54 -08:00
sdong 1cecd22de9 Increase wait time within EnvPosixTestWithParam.RunMany (#9413)
Summary:
We see:

[ RUN      ] ChrootEnvWithDirectIO/EnvPosixTestWithParam.RunMany/0
env/env_test.cc:464: Failure
Expected equality of these values:
  4
  cur
    Which is: 0

The suspicious is that the wait time is not long enough. Increase the wait time to 10s and allows earlier check.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9413

Test Plan: Run the test

Reviewed By: riversand963

Differential Revision: D33697715

fbshipit-source-id: 3d71715562a8cceb694b773276dd9e4e451a18bc
2022-01-24 12:50:18 -08:00
Peter Dillinger e7ac7363b4 Add to HISTORY and minor loose ends from #9294, #9254 (#9386)
Summary:
Loose ends relate to mmap on 32-bit systems. (Testing is more
complicated when the feature was completely disabled on 32-bit.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9386

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D33590715

Pulled By: pdillinger

fbshipit-source-id: f2637036a538a552200adee65b6765fce8cae27b
2022-01-21 13:04:19 -08:00
Yanqin Jin 1a8e9f0e07 Use fcntl(F_FULLFSYNC) on OS X (#9356)
Summary:
Closing https://github.com/facebook/rocksdb/issues/5954

fsync/fdatasync on Linux:
```
(fsync/fdatasync) includes writing through or flushing a disk cache if present.
```

However, on OS X and iOS:
```
(fsync) will flush all data from the host to the drive (i.e. the "permanent storage device"),
the drive itself may not physically write the data to the platters for quite some time and it
may be written in an out-of-order sequence.
```

Solution is to use `fcntl(F_FULLFSYNC)` on OS X so that we get the same
persistence guarantee.

According to OSX man page,
```
The F_FULLFSYNC fcntl asks the drive to flush **all** buffered data to permanent storage.
```
This suggests that it will be no faster than `fsync` on Linux, since Linux, according to its man page,
```
writing through or flushing a disk cache if present
```
It means Linux may not flush **all** data from disk cache.

This is similar to bug reports/fixes in:
- golang: https://github.com/golang/go/issues/26650
- leveldb: 296de8d5b8.

Not sure if we should fallback to fsync since we break persistence contract.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9356

Reviewed By: jay-zhuang

Differential Revision: D33417416

Pulled By: riversand963

fbshipit-source-id: 475548ff9c5eaccde325e0f6842694271cbc8cb7
2022-01-18 20:23:11 -08:00
Si Ke 93b1de4f45 Enable db_test running in Centos 32 bit OS and Alpine 32 bit OS (#9294)
Summary:
Closes https://github.com/facebook/rocksdb/issues/9271

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9294

Reviewed By: riversand963, hx235

Differential Revision: D33586002

Pulled By: pdillinger

fbshipit-source-id: 3d1a2fa71023e108613ff03dbd37a5f954fc4920
2022-01-14 11:58:18 -08:00
mrambacher 1973fcba11 Restore Regex support for ObjectLibrary::Register, rename new APIs to allow old one to be deprecated in the future (#9362)
Summary:
In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions.  The PatternEntry methods were left in place but renamed to AddFactory.  The goal is to allow for the deprecation of the original regex Registry method in an upcoming release.

Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362

Reviewed By: pdillinger

Differential Revision: D33432562

Pulled By: mrambacher

fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02
2022-01-11 06:33:48 -08:00
mrambacher fe31dc53ca Make the Env class Customizable (#9293)
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293

Reviewed By: pdillinger, zhichao-cao

Differential Revision: D33181591

Pulled By: mrambacher

fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
2022-01-04 16:45:49 -08:00
mrambacher 1c39b7952b Remove/Reduce use of Regex in ObjectRegistry/Library (#9264)
Summary:
Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex.  For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes.  For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes.  There is also a class for Customizable::IndividualId format matches.

Added tests for the new derivative classes and got all unit tests to pass.

Resolves https://github.com/facebook/rocksdb/issues/9225.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9264

Reviewed By: pdillinger

Differential Revision: D33062001

Pulled By: mrambacher

fbshipit-source-id: c2d2143bd2d38bdf522705c8280c35381b135c03
2021-12-29 07:56:23 -08:00
Peter Dillinger 0050a73a4f New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).

The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.

This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)

The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.

Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.

Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126

Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.

### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)

### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime

We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:

```
Total cache or DBs size: 32TiB  Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```

These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key.  With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.

More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day

After enough data, we get a result at the end:

```
(keep 40 bits)  17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```

If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:

```
(keep 41 bits)  16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits)  19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```

The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:

```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```

I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.

Reviewed By: zhichao-cao

Differential Revision: D33171746

Pulled By: pdillinger

fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-16 17:15:13 -08:00
Adam Retter c879910102 Fix fstatfs call for compilation on 32 bit systems (#9251)
Summary:
On some 32-bit systems, BTRFS_SUPER_MAGIC is unsigned while __fsword_t is signed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9251

Reviewed By: ajkr

Differential Revision: D32961651

Pulled By: pdillinger

fbshipit-source-id: 78e85fc1336f304a21e4d5961e60957c90daed63
2021-12-08 22:01:23 -08:00
jsteemann 5384f0af6e Fix compile warnings (#9199)
Summary:
* added missing override specifiers for overriden methods
  this fixes compiler warnings emitted by g++ and clang++ when compile option `-Wsuggest-override` is turned on.
* fix compile warning with -Wmaybe-uninitialized
  g++-11 warns about a _potentially_ uninitialized variable when using `-Wmaybe_uninitialized`:
  ```
      env/env.cc: In member function ‘virtual rocksdb::Status rocksdb::Env::GetHostNameString(std::string*)’:
      env/env.cc:738:66: error: ‘hostname_buf’ may be used uninitialized [-Werror=maybe-uninitialized]
        738 |   Status s = GetHostName(hostname_buf.data(), hostname_buf.size());
            |                                                                  ^
      In file included from /usr/include/c++/11/tuple:39,
                       from /usr/include/c++/11/functional:54,
                       from ./include/rocksdb/env.h:22,
                       from env/env.cc:10:
      /usr/include/c++/11/array:176:7: note: by argument 1 of type ‘const std::array<char, 256>*’ to ‘constexpr std::array<_Tp, _Nm>::size_type std::array<_Tp, _Nm>::size() const [with _Tp = char; long unsigned int _Nm = 256]’ declared here
        176 |       size() const noexcept { return _Nm; }
            |       ^~~~
      env/env.cc:737:37: note: ‘hostname_buf’ declared here
        737 |   std::array<char, kMaxHostNameLen> hostname_buf;
  ```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9199

Reviewed By: jay-zhuang

Differential Revision: D32630703

Pulled By: pdillinger

fbshipit-source-id: 9ea3010b1105a582548e3c3c0db4475b201e4a10
2021-11-24 11:19:06 -08:00
Andrew Kryczka 9bb13c56b3 Use system-wide thread ID in info log lines (#9164)
Summary:
This makes it easier to debug with tools like `ps`. The change only
applies to builds with glibc 2.30+ and _GNU_SOURCE extensions enabled.
We could adopt it in more cases by using the syscall but this is enough
for our build.

Replaces https://github.com/facebook/rocksdb/issues/2973.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9164

Test Plan:
- ran some benchmarks and correlated logged thread IDs with those shown by `ps -L`.
- verified no noticeable regression in throughput for log heavy (more than 700k log lines and over 5k / second) scenario.

Benchmark command:

```
$ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom -compression_type=none -max_bytes_for_level_multiplier=2 -write_buffer_size=262144 -num_levels=7 -max_bytes_for_level_base=2097152 -target_file_size_base=524288 -level_compaction_dynamic_level_bytes=true -max_background_jobs=12 -num=20000000
```

Results before: 15.9MB/s, 15.8MB/s, 16.0MB/s
Results after: 16.3MB/s, 16.3MB/s, 15.8MB/s

- Rely on CI to test the fallback behavior

Reviewed By: riversand963

Differential Revision: D32399660

Pulled By: ajkr

fbshipit-source-id: c24d44fdf7782faa616ef0a0964eaca3539d9c24
2021-11-12 19:46:06 -08:00
Jay Zhuang 29102641dd Skip directory fsync for filesystem btrfs (#8903)
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
   new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
   synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
   to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
   `IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
   forced, the same as above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903

Test Plan: run tests on btrfs and non btrfs

Reviewed By: ajkr

Differential Revision: D30885059

Pulled By: jay-zhuang

fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
2021-11-03 12:21:27 -07:00
mrambacher f72c834eab Make FileSystem a Customizable Class (#8649)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8649

Reviewed By: zhichao-cao

Differential Revision: D32036059

Pulled By: mrambacher

fbshipit-source-id: 4f1e7557ecac52eb849b83ae02b8d7d232112295
2021-11-02 09:07:11 -07:00
Peter Dillinger ad5325a736 Experimental support for SST unique IDs (#8990)
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).

Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990

Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D31582865

Pulled By: pdillinger

fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
2021-10-18 23:32:01 -07:00
jsteemann ab6755711b add IsSyncThreadSafe() override to EncryptedWritableFile (#8993)
Summary:
EncryptedWritableFile is derived from FSWritableFile, which implements
the `IsSyncThreadSafe()` function as

    bool IsSyncThreadSafe() const { return false; }

EncryptedWritableFile does not override this method from the base class,
so the `IsSyncThreadSafe()` function on an EncryptedWritableFile will
always return false.
This change adds an override of `IsSyncThreadSafe()` to
EncryptedWritableFile so that the latter will now ask its underlying
`file_` object for the thread-safety of sync operations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8993

Reviewed By: jay-zhuang

Differential Revision: D31613123

Pulled By: ajkr

fbshipit-source-id: b18625e21a9911744eef3215c29913490e4b6001
2021-10-14 16:14:26 -07:00
mrambacher 13ae16c315 Cleanup includes in dbformat.h (#8930)
Summary:
This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.

Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.

Hopefully, this sort of code hygiene cleanup will speed up the builds...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930

Reviewed By: pdillinger

Differential Revision: D31142788

Pulled By: mrambacher

fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
2021-09-29 04:04:40 -07:00
mrambacher 6924869867 Make SystemClock into a Customizable Class (#8636)
Summary:
Made SystemClock into a Customizable class, complete with CreateFromString.

Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv).

Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636

Reviewed By: zhichao-cao

Differential Revision: D30483360

Pulled By: mrambacher

fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1
2021-09-21 09:23:48 -07:00