Commit graph

91 commits

Author SHA1 Message Date
Yanqin Jin c93ba7db5d Revise LockWAL/UnlockWAL implementation (#11020)
Summary:
RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and
releases the internal `DBImpl::log_write_mutex_`.

According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288
> Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_.

This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application
should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing
the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()`
is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`.

There are two issues:
1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after
calling `LockWAL()` because log_write_mutex is not recursive.
2. Two threads may dead lock due to lock order inversion.

To fix these issues, we can modify the implementation of LockWAL so that it does not keep
`log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can
instead manually inject a write stall so that all future writes will be stopped.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D41785203

Pulled By: riversand963

fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce
2022-12-13 21:45:00 -08:00
Hui Xiao 15bb4ea084 Deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL (#11016)
Summary:
**Context/Summary:**
Credit to ajkr's https://github.com/facebook/rocksdb/pull/11016#pullrequestreview-1205020134,
flaky test https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 is due to `Flush()` called in the test returned earlier than obsoleted WAL being found in background flush and SyncWAL() was called (i.e, "sync_point_called" sets to true).  Fix this by making checking `sync_point_called == true` after obsoleted WAL is found and `SyncWAL()` is called. Also rename the "sync_point_called" to be something more specific.

Also, fix a potential flakiness due to manually setting a log threshold to force new manifest creation. This is unreliable so I decided to use sync point to force new manifest creation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11016

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D41717786

Pulled By: hx235

fbshipit-source-id: ad1e4701a987285bbe6c8e7d9b05c4db06b4edf4
2022-12-06 18:31:43 -08:00
Hui Xiao 2f76ab150d Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest (#10892)
Summary:
**Context**
`Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown.

`DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest.
And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open.
- Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`.

As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown.

**Summary**
This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest.

**Test**
- Make check
- Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after
- [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears
- [Ongoing]Regular CI stress test to confirm such fix didn't harm anything

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892

Reviewed By: ajkr

Differential Revision: D40778965

Pulled By: hx235

fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f
2022-11-29 14:14:43 -08:00
Yanqin Jin 538df26fcc Deflake DBWALTest.RaceInstallFlushResultsWithWalObsoletion (#10456)
Summary:
Existing DBWALTest.RaceInstallFlushResultsWithWalObsoletion test relies
on a specific interleaving of two background flush threads. We call them
bg1 and bg2, and assume bg1 starts to install flush results ahead of
bg2. After bg1 enters `ProcessManifestWrites`, bg1 waits for bg2 to also
enter `MemTableList::TryInstallMemtableFlushResults()` before bg1 can
proceed with MANIFEST write. However, if bg2 called `SyncClosedLogs()`
and needed to commit to the MANIFEST but falls behind bg1, then bg2
needs to wait for bg1 to finish writing to MANIFEST. This is a circular
dependency.

Fix this by allowing bg2 to start only after bg1 grabs the chance to
sync the WAL and commit to MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10456

Test Plan:
1. make check

2. export TEST_TMPDIR=/dev/shm && gtest-parallel -r 1000 -w 32 ./db_wal_test --gtest_filter=DBWALTest.RaceInstallFlushResultsWithWalObsoletion

Reviewed By: ltamasi

Differential Revision: D38391856

Pulled By: riversand963

fbshipit-source-id: 55f647d5b94e534c008a4dd2fb082675ddf58c96
2022-08-04 12:14:28 -07:00
Changyu Bi 5f9fe7f21e Added WAL compression checksum (#10319)
Summary:
Enabled zstd checksum flag in StreamingCompress so that WAL (de)compreression is protected by a checksum per compression frame.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10319

Test Plan:
- `make check`
- WAL perf: average ops/sec over 10 runs is 161226 pre PR and 159635 post PR (1% drop).
```
sudo TEST_TMPDIR=/dev/shm/memtable_write ./db_bench_checksum -benchmarks=fillseq -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=10 -wal_compression=zstd
```

Reviewed By: ajkr

Differential Revision: D37673311

Pulled By: cbi42

fbshipit-source-id: 9f34a3bfc2a82e5c80b1ec63bb339a7465108ec9
2022-07-13 15:29:20 -07:00
Yanqin Jin b012d23557 Destroy iniital db dir for a test in DBWALTest (#10221)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10221

Reviewed By: hx235

Differential Revision: D37316280

Pulled By: riversand963

fbshipit-source-id: 062781acec2f36beebc62003bcc8ec280488d572
2022-06-21 11:27:10 -07:00
Gang Liao deff48bcef Add blob source to retrieve blobs in RocksDB (#10198)
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB.  BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.

Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198

Reviewed By: ltamasi

Differential Revision: D37294735

Pulled By: gangliao

fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
2022-06-20 20:58:11 -07:00
Akanksha Mahajan 8273435c22 Bypass tests instead of skipping to resolve internal failure (#10148)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10148

Reviewed By: hx235

Differential Revision: D37092202

Pulled By: akankshamahajan15

fbshipit-source-id: 12fae5641a1c4ab584e586db95f4044273aba23a
2022-06-12 12:05:11 -07:00
Akanksha Mahajan f85b31a2e9 Fix bug for WalManager with compressed WAL (#10130)
Summary:
RocksDB uses WalManager to manage WAL files. In WalManager::ReadFirstLine(), the assumption is that reading the first record of a valid WAL file will return OK status and set the output sequence to non-zero value.
This assumption has been broken by WAL compression which writes a `kSetCompressionType` record which is not associated with any sequence number.
Consequently, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10130

Test Plan: - Newly Added test

Reviewed By: riversand963

Differential Revision: D36985744

Pulled By: akankshamahajan15

fbshipit-source-id: dfde7b3be68b6a30b75b49479779748eedf29f7f
2022-06-08 14:16:43 -07:00
Peter Dillinger a00cffaf69 Reduce risk of backup or checkpoint missing a WAL file (#10083)
Summary:
We recently saw a case in crash test in which a WAL file in the
middle of the list of live WALs was not included in the backup, so the
DB was not openable due to missing WAL. We are not sure why, but this
change should at least turn that into a backup-time failure by ensuring
all the WAL files expected by the manifest (according to VersionSet) are
included in `GetSortedWalFiles()` (used by `GetLiveFilesStorageInfo()`,
`BackupEngine`, and `Checkpoint`)

Related: to maximize the effectiveness of
track_and_verify_wals_in_manifest with GetSortedWalFiles() during
checkpoint/backup, we will now sync WAL in GetLiveFilesStorageInfo()
when track_and_verify_wals_in_manifest=true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10083

Test Plan: added new unit test for the check in GetSortedWalFiles()

Reviewed By: ajkr

Differential Revision: D36791608

Pulled By: pdillinger

fbshipit-source-id: a27bcf0213fc7ab177760fede50d4375d579afa6
2022-06-01 11:02:27 -07:00
sdong 736a7b5433 Remove own ToString() (#9955)
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955

Test Plan: Watch CI tests

Reviewed By: riversand963

Differential Revision: D36176799

fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
2022-05-06 13:03:58 -07:00
sdong 49628c9a83 Use std::numeric_limits<> (#9954)
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954

Test Plan: See CI Runs.

Reviewed By: riversand963

Differential Revision: D36173954

fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
2022-05-05 13:08:21 -07:00
Akanksha Mahajan ae82d91492 Remove corrupted WAL files in kPointRecoveryMode with avoid_flush_duing_recovery set true (#9634)
Summary:
1) In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
2) For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.

If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.

As a solution,
1. the corrupted WALs whose numbers are larger than the
corrupted wal and smaller than the new WAL will be moved to archive folder.
2. Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9634

Test Plan:
1. Added new unit tests
                2. make crast_test -j

Reviewed By: riversand963

Differential Revision: D34463666

Pulled By: akankshamahajan15

fbshipit-source-id: e233d3af0ed4e2028ca0cf051e5a334a0fdc9d19
2022-04-11 15:39:31 -07:00
Yanqin Jin e0c84aa0dc Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.

The race condition is between two background flush threads trying to install flush results to the MANIFEST.

Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.

1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst

```
Time  BgFlushThread1                                    BgFlushThread2
 |    mutex_.Lock()
 |    precompute min_wal_to_keep as 6
 |    mutex_.Unlock()
 |                                                     mutex_.Lock()
 |                                                     precompute min_wal_to_keep as 6
 |                                                     join MANIFEST write queue and mutex_.Unlock()
 |    write to MANIFEST
 |    mutex_.Lock()
 |    cfd1->log_number = 7
 |    Signal bg_flush_2 and mutex_.Unlock()
 |                                                     wake up and mutex_.Lock()
 |                                                     cfd0->log_number = 8
 |                                                     FindObsoleteFiles() with job_context->log_number == 7
 |                                                     mutex_.Unlock()
 |                                                     PurgeObsoleteFiles() deletes 6.log
 V
```

As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.

We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715

Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```

Reviewed By: ltamasi

Differential Revision: D34984412

Pulled By: riversand963

fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
2022-03-23 19:41:31 -07:00
Siddhartha Roychowdhury 39b0d92153 Add record to set WAL compression type if enabled (#9556)
Summary:
When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs.
Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556

Reviewed By: anand1976

Differential Revision: D34308216

Pulled By: sidroyc

fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705
2022-02-17 16:19:31 -08:00
Levi Tamasi 320d9a8e8a Use a sorted vector instead of a map to store blob file metadata (#9526)
Summary:
The patch replaces `std::map` with a sorted `std::vector` for
`VersionStorageInfo::blob_files_` and preallocates the space
for the `vector` before saving the `BlobFileMetaData` into the
new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`.
These changes reduce the time the DB mutex is held while
saving new `Version`s, and using a sorted `vector` also makes
lookups faster thanks to better memory locality.

In addition, the patch introduces helper methods
`VersionStorageInfo::GetBlobFileMetaData` and
`VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by
clients to perform lookups in the `vector`, and does some general
cleanup in the parts of code where blob file metadata are used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526

Test Plan:
Ran `make check` and the crash test script for a while.

Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced:

```
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value>
```

Final statistics before the patch:

```
Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s
Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s
```

With the patch:

```
Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s
Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s
```

Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs.

Reviewed By: riversand963

Differential Revision: D34082728

Pulled By: ltamasi

fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 12:36:43 -08:00
Yanqin Jin bd513fd075 Add commit marker with timestamp (#9266)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266

This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.

Reviewed By: ltamasi

Differential Revision: D31721350

fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
2021-12-10 11:05:35 -08:00
Drewryz 3b27725245 Fix a minor issue with initializing the test path (#8555)
Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555

Reviewed By: ajkr

Differential Revision: D29758399

Pulled By: jay-zhuang

fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011
2021-07-23 08:38:45 -07:00
Zhichao Cao f44e69c64a Use DbSessionId as cache key prefix when secondary cache is enabled (#8360)
Summary:
Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360

Test Plan: add the test to lru_cache_test

Reviewed By: pdillinger

Differential Revision: D29006215

Pulled By: zhichao-cao

fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
2021-06-10 11:02:43 -07:00
Peter (Stig) Edwards c75ef03e58 Do not truncate WAL if in read_only mode (#8313)
Summary:
I noticed ```openat``` system call with ```O_WRONLY``` flag and ```sync_file_range``` and ```truncate``` on WAL file when using ```rocksdb::DB::OpenForReadOnly``` by way of ```db_bench --readonly=true --benchmarks=readseq --use_existing_db=1 --num=1 ...```

Noticed in ```strace``` after seeing the last modification time of the WAL file change after each run (with ```--readonly=true```).

  I think introduced by 7d7f14480e from https://github.com/facebook/rocksdb/pull/8122

I added a test to catch the WAL file being truncated and the modification time on it changing.
I am not sure if a mock filesystem with mock clock could be used to avoid having to sleep 1.1s.
The test could also check the set of files is the same and that the sizes are also unchanged.

Before:

```
[ RUN      ] DBBasicTest.ReadOnlyReopenMtimeUnchanged
db/db_basic_test.cc:182: Failure
Expected equality of these values:
  file_mtime_after_readonly_reopen
    Which is: 1621611136
  file_mtime_before_readonly_reopen
    Which is: 1621611135
  file is: 000010.log
[  FAILED  ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms)
```

After:

```
[ RUN      ] DBBasicTest.ReadOnlyReopenMtimeUnchanged
[       OK ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8313

Reviewed By: pdillinger

Differential Revision: D28656925

Pulled By: jay-zhuang

fbshipit-source-id: ea9e215cb53e7c830e76bc5fc75c45e21f12a1d6
2021-05-27 10:27:55 -07:00
Giuseppe Ottaviano 48cd7a3aae Fix flush reason attribution (#8150)
Summary:
Current flush reason attribution is misleading or incorrect (depending on what the original intention was):

- Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager`
- Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull`

This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150

Reviewed By: zhichao-cao

Differential Revision: D27569645

Pulled By: ot

fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc
2021-04-07 23:18:37 -07:00
anand76 7d7f14480e Always truncate the latest WAL file on DB Open (#8122)
Summary:
Currently, we only truncate the latest alive WAL files when the DB is opened. If the latest WAL file is empty or was flushed during Open, its not truncated since the file will be deleted later on in the Open path. However, before deletion, a new WAL file is created, and if the process crash loops between the new WAL file creation and deletion of the old WAL file, the preallocated space will keep accumulating and eventually use up all disk space. To prevent this, always truncate the latest WAL file, even if its empty or the data was flushed.

Tests:
Add unit tests to db_wal_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8122

Reviewed By: riversand963

Differential Revision: D27366132

Pulled By: anand1976

fbshipit-source-id: f923cc03ef033ccb32b140d36c6a63a8152f0e8e
2021-03-28 10:00:08 -07:00
Jay Zhuang 45c65d6dcf Use thread-safe strerror_r() to get error message (#8087)
Summary:
`strerror()` is not thread-safe, using `strerror_r()` instead. The API could be different on the different platforms, used the code from 0deef031cb/folly/String.cpp (L457)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8087

Reviewed By: mrambacher

Differential Revision: D27267151

Pulled By: jay-zhuang

fbshipit-source-id: 4b8856d1ec069d5f239b764750682c56e5be9ddb
2021-03-24 23:07:27 -07:00
Levi Tamasi a46f080cce Break down the amount of data written during flushes/compactions per file type (#8013)
Summary:
The patch breaks down the "bytes written" (as well as the "number of output files")
compaction statistics into two, so the values are logged separately for table files
and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
This will also come in handy for fixing the write amplification statistics, which currently
do not consider the amount of data read from blob files during compaction. (This will
be fixed by an upcoming patch.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26742156

Pulled By: ltamasi

fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
2021-03-02 09:48:00 -08:00
Zhichao Cao d1c510baec Handoff checksum Implementation (#7523)
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523

Test Plan: add new unit test, pass make check/ make asan_check

Reviewed By: pdillinger

Differential Revision: D24313271

Pulled By: zhichao-cao

fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
2021-02-10 22:20:32 -08:00
mrambacher 4a09d632c4 Remove Legacy and Custom FileWrapper classes from header files (#7851)
Summary:
Removed the uses of the Legacy FileWrapper classes from the source code.  The wrappers were creating an additional layer of indirection/wrapping, as the Env already has a FileSystem.

Moved the Custom FileWrapper classes into the CustomEnv, as these classes are really for the private use the the CustomEnv class.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7851

Reviewed By: anand1976

Differential Revision: D26114816

Pulled By: mrambacher

fbshipit-source-id: db32840e58d969d3a0fa6c25aaf13d6dcdc74150
2021-01-28 22:10:32 -08:00
Levi Tamasi c696f27432 Accumulate blob file additions in VersionEdit during recovery (#7903)
Summary:
During recovery, RocksDB performs a kind of dummy flush; namely, entries
from the WAL are added to memtables, which then get written to SSTs and
blob files (if enabled) just like during a regular flush. Note that
multiple memtables might be flushed during recovery for the same column
family, for example, if the DB is reopened with a lower write buffer size,
and therefore, we need to make sure to collect all SST and blob file
additions. The patch fixes a bug in the earlier logic which resulted in
later blob file additions overwriting earlier ones.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7903

Test Plan: Added a unit test and ran `db_stress`.

Reviewed By: jay-zhuang

Differential Revision: D26110847

Pulled By: ltamasi

fbshipit-source-id: eddb50a608a88f54f3cec3a423de8235aba951fd
2021-01-27 18:46:15 -08:00
Levi Tamasi 431e8afba7 Do not explicitly flush blob files when using the integrated BlobDB (#7892)
Summary:
In the original stacked BlobDB implementation, which writes blobs to blob files
immediately and treats blob files as logs, it makes sense to flush the file after
writing each blob to protect against process crashes; however, in the integrated
implementation, which builds blob files in the background jobs, this unnecessarily
reduces performance. This patch fixes this by simply adding a `do_flush` flag to
`BlobLogWriter`, which is set to `true` by the stacked implementation and to `false`
by the new code. Note: the change itself is trivial but the tests needed some work;
since in the new implementation, blobs are now buffered, adding a blob to
`BlobFileBuilder` is no longer guaranteed to result in an actual I/O. Therefore, we can
no longer rely on `FaultInjectionTestEnv` when testing failure cases; instead, we
manipulate the return values of I/O methods directly using `SyncPoint`s.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7892

Test Plan: `make check`

Reviewed By: jay-zhuang

Differential Revision: D26022814

Pulled By: ltamasi

fbshipit-source-id: b3dce419f312137fa70d84cdd9b908fd5d60d8cd
2021-01-25 13:32:33 -08:00
Adam Retter 4926b33742 Improvements to Env::GetChildren (#7819)
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html

There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.

Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819

Reviewed By: ajkr

Differential Revision: D25837394

Pulled By: jay-zhuang

fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
2021-01-09 09:44:34 -08:00
Adam Retter 8ff6557e7f Add further tests to ASSERT_STATUS_CHECKED (2) (#7698)
Summary:
Second batch of adding more tests to ASSERT_STATUS_CHECKED.

* external_sst_file_basic_test
* checkpoint_test
* db_wal_test
* db_block_cache_test
* db_logical_block_size_cache_test
* db_blob_index_test
* optimistic_transaction_test
* transaction_test
* point_lock_manager_test
* write_prepared_transaction_test
* write_unprepared_transaction_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7698

Reviewed By: cheng-chang

Differential Revision: D25441664

Pulled By: pdillinger

fbshipit-source-id: 9e78867f32321db5d4833e95eb96c5734526ef00
2020-12-09 21:21:16 -08:00
Yanqin Jin e062a719cc Fix assertion failure in bg flush (#7362)
Summary:
https://github.com/facebook/rocksdb/issues/7340 reports and reproduces an assertion failure caused by a combination of the following:
- atomic flush is disabled.
- a column family can appear multiple times in the flush queue at the same time. This behavior was introduced in release 5.17.

Consequently, it is possible that two flushes race with each other. One bg flush thread flushes all memtables. The other thread calls `FlushMemTableToOutputFile()` afterwards, and hits the assertion error below.

```
  assert(cfd->imm()->NumNotFlushed() != 0);
  assert(cfd->imm()->IsFlushPending());
```

Fix this by reverting the behavior. In non-atomic-flush case, a column family can appear in the flush queue at most once at the same time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7362

Test Plan:
make check
Also run stress test successfully for 10 times.
```
make crash_test
```

Reviewed By: ajkr

Differential Revision: D25172996

Pulled By: riversand963

fbshipit-source-id: f1559b6366cc609e961e3fc83fae548f1fad08ce
2020-12-02 09:31:14 -08:00
Andrew Kryczka eb65d673fe Fix kPointInTimeRecovery handling of truncated WAL (#7701)
Summary:
WAL may be truncated to an incomplete record due to crash while writing
the last record or corruption. In the former case, no hole will be
produced since no ACK'd data was lost. In the latter case, a hole could
be produced without this PR since we proceeded to recover the next WAL
as if nothing happened. This PR changes the record reading code to
always report a corruption for incomplete records in
`kPointInTimeRecovery` mode, and the upper layer will only ignore them
if the next WAL has consecutive seqnum (i.e., we are guaranteed no
hole).

While this solves the hole problem for the case of incomplete
records, the possibility is still there if the WAL is corrupted by
truncation to an exact record boundary. This PR also regresses how much data
can be recovered when writes are mixed with/without
`WriteOptions::disableWAL`, as then we can not distinguish between a
seqnum gap caused by corruption and a seqnum gap caused by a `disableWAL` write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7701

Test Plan:
Interestingly there already was a test for this case
(`DBWALTestWithParams.kPointInTimeRecovery`); it just had a typo bug in
the verification that prevented it from noticing holes in recovery.

Reviewed By: anand1976

Differential Revision: D25111765

Pulled By: ajkr

fbshipit-source-id: 5e330b13b1ee2b5be096cea9d0ff6075843e57b6
2020-11-30 18:11:38 -08:00
Yanqin Jin 0b94468bba Avoid skipping a test in db_wal_test (#7628)
Summary:
Recent test report shows that some tests have been skipped.

For DBWALTest that inherits from DBTestBase, the following will always be
true, since `env_` is an instance of `SpecialEnv`, not `Env::Default()`. Thus the test
will always be skipped.

```
if (options.env != Env::Default()) {
  ROCKSDB_GTEST_SKIP("Test requires default environment");
  return;
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7628

Test Plan:
./db_wal_test --gtest_filter=DBWALTest.TruncateLastLogAfterRecoverWithoutFlush
MEM_ENV=1 ./db_wal_test --gtest_filter=DBWALTest.TruncateLastLogAfterRecoverWithoutFlush
make check

Reviewed By: jay-zhuang

Differential Revision: D24693006

Pulled By: riversand963

fbshipit-source-id: 7f2a772492a0f11bff17bbf5e9f493e9e9a1c125
2020-11-03 09:48:16 -08:00
mrambacher f35f7f2704 Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:

1.  The MockFileSystem class was split out from the MockEnv.  This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one).  The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.

2.  Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set.  To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).

3.  Updated the test framework to have a ROCKSDB_GTEST_SKIP macro.  This can be used to flag tests that are skipped.  Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).

I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV,  both, and neither under both MacOS and RedHat.  A few tests were disabled/skipped for the MEM/ENCRYPTED cases.  The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem.  (I will also push a change to disable those tests soon).  There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.

Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.

Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale.  I do not know how to do that, so if someone could write that job, it would be appreciated :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566

Reviewed By: zhichao-cao

Differential Revision: D24408980

Pulled By: jay-zhuang

fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 10:33:09 -07:00
Levi Tamasi a7a04b6898 Integrate BlobFileBuilder into the compaction process (#7573)
Summary:
Similarly to how https://github.com/facebook/rocksdb/issues/7345
integrated blob file writing into the flush process,
the patch adds support for writing blob files to the compaction logic.
Namely, if `enable_blob_files` is set, large values encountered during
compaction are extracted to blob files and replaced with blob indexes.
The resulting blob files are then logged to the MANIFEST as part of the
compaction job's `VersionEdit` and added to the `Version` alongside any
table files written by the compaction. Any errors during blob file building fail
the compaction job.

There will be a separate follow-up patch to perform blob garbage collection
during compactions.

In addition, the patch continues to chip away at the mess around computing
various compaction related statistics by eliminating some code duplication
and by making the `num_output_files` and `bytes_written` stats more consistent
for flushes, compactions, and recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7573

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24404696

Pulled By: ltamasi

fbshipit-source-id: 21216af3a172ad3ce8f85d11cd30923784ae426c
2020-10-26 13:51:55 -07:00
Cheng Chang fc9b416013 Fix typo in db_wal_test (#7571)
Summary:
as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7571

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D24392577

Pulled By: cheng-chang

fbshipit-source-id: c94f92db48270d0e215aa0f2782b0ff2e31bf708
2020-10-20 11:50:30 -07:00
Levi Tamasi e8cb32ed67 Introduce BlobFileCache and add support for blob files to Get() (#7540)
Summary:
The patch adds blob file support to the `Get` API by extending `Version` so that
whenever a blob reference is read from a file, the blob is retrieved from the corresponding
blob file and passed back to the caller. (This is assuming the blob reference is valid
and the blob file is actually part of the given `Version`.) It also introduces a cache
of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
`max_open_files` (if specified) limits the total number of open (table + blob) files.

TODO: proactively open/cache blob files and pin the cache handles of the readers in the
metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
table files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24260219

Pulled By: ltamasi

fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
2020-10-15 13:04:47 -07:00
Levi Tamasi bf1aeebb6c Integrate blob file writing with recovery (#7388)
Summary:
The patch adds support for extracting large values into blob files when
performing a flush during recovery (when `avoid_flush_during_recovery` is
`false`). Blob files are built and added to the `Version` similarly to flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7388

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23709912

Pulled By: ltamasi

fbshipit-source-id: ce48b4227849cf25429ae98574e72b0e1cb9c67d
2020-09-15 17:14:10 -07:00
sdong b194c21bba Whole DBTest to skip fsync (#7274)
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.

This commit reinstates https://github.com/facebook/rocksdb/issues/7049, whose un-revert was lost in an automatic
infrastructure mis-merge.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7274

Test Plan: Run all existing files.

Reviewed By: pdillinger

Differential Revision: D23177444

fbshipit-source-id: 1f61690b2ac6333c3b2c87176fef6b2cba086b33
2020-08-17 18:42:25 -07:00
Andrew Kryczka 5d5ff82408 Disable recycle_log_file_num with kTolerateCorruptedTailRecords (#7271)
Summary:
The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271

Reviewed By: riversand963

Differential Revision: D23169923

Pulled By: ajkr

fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50
2020-08-17 18:21:10 -07:00
Akanksha Mahajan 1f9f630b27 Store FileSystemPtr object that contains FileSystem ptr (#7180)
Summary:
As part of the IOTracing project, this PR
    1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer.
    2. FileSystemPtr object is created using FileSystem pointer and IOTracer
    pointer.
    3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor.
    4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr
    returns FileSystemTracingWrapper pointer for tracing purpose and when
    it is disabled underlying FileSystem pointer is returned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180

Test Plan:
make check -j64
                COMPILE_WITH_TSAN=1 make check -j64

Reviewed By: anand1976

Differential Revision: D22987117

Pulled By: akankshamahajan15

fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1
2020-08-12 17:31:23 -07:00
mrambacher c7c7b07f06 More Makefile Cleanup (#7097)
Summary:
Cleans up some of the dependencies on test code in the Makefile while building tools:
- Moves the test::RandomString, DBBaseTest::RandomString into Random
- Moves the test::RandomHumanReadableString into Random
- Moves the DestroyDir method into file_utils
- Moves the SetupSyncPointsToMockDirectIO into sync_point.
- Moves the FaultInjection Env and FS classes under env

These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies.  By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated.

Tested both release and debug builds via Make and CMake for both static and shared libraries.

More work remains to clean up how the tools are built and remove some unnecessary dependencies.  There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097

Reviewed By: riversand963

Differential Revision: D22463160

Pulled By: pdillinger

fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2
2020-07-09 14:35:17 -07:00
Yanqin Jin f70ad03137 Parameterize a few tests in DBWALTest (#7105)
Summary:
As title. The goal is to shorten the execution time of several tests
when they are combined together in a single TEST_F.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7105

Test Plan:
make db_wal_test
./db_wal_test

Reviewed By: ltamasi

Differential Revision: D22442705

Pulled By: riversand963

fbshipit-source-id: 0ad49b8f21fa86dcd5a4d3c9a06af313735ac217
2020-07-09 11:31:06 -07:00
Peter Dillinger 52d59e0c93 Revert "Whole DBTest to skip fsync (#7049)" (#7070)
Summary:
This reverts commit 4f1534bdb0.

This commit caused failures and deadlocks in
MultiThreadedDBTest.MultiThreaded/69 and others.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7070

Reviewed By: riversand963

Differential Revision: D22358778

Pulled By: pdillinger

fbshipit-source-id: faf8f2cb469a7063a113921c8e9c64a9f7610dac
2020-07-02 10:22:43 -07:00
sdong 4f1534bdb0 Whole DBTest to skip fsync (#7049)
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7049

Test Plan: Run all existing files.

Reviewed By: pdillinger

Differential Revision: D22301700

fbshipit-source-id: f9a9e3b3b26ce640665a47cb8bff33ba0c89b565
2020-07-01 19:37:56 -07:00
mrambacher 394f2bbd13 Add OptionTypeInfo::Enum and related methods (#6423)
Summary:
Add methods and constructors for handling enums to the OptionTypeInfo.  This change allows enums to be converted/compared without adding a special "type" to the OptionType.

This change addresses a couple of issues:
- It allows new enumerated types to be added to the options without editing the OptionType base class (and related methods)
- It standardizes the procedure for adding enumerated types to the options, reducing potential mistakes
- It moves the enum maps to the location where they are used, allowing them to be static file members rather than global values
- It reduces the number of types and cases that need to be handled in the various OptionType methods
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6423

Reviewed By: siying

Differential Revision: D21408713

Pulled By: zhichao-cao

fbshipit-source-id: fc492af285d011822578b95d186a0fce25d35626
2020-05-05 15:04:04 -07:00
Yuqi Gu e171a219d5 Fix db_wal_test::TruncateLastLogAfterRecoverWithoutFlush failure (#6437)
Summary:
`TruncateLastLogAfterRecoverWithoutFlush` case depends on fallocate support
of underlying file system.

On a file system which lacks of this feature, like zfs, it will fail to allocate predefined file size as this test case intends to do;

So a check block is added to detect fallocate support and skip test if not.
The related work is done by JunHe77. Thanks!

Signed-off-by: Yuqi Gu <yuqi.gu@arm.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6437

Differential Revision: D20145032

Pulled By: pdillinger

fbshipit-source-id: c8b691dc508e95acfa2a004ddbc07e2faa76680d
2020-03-05 17:18:16 -08:00
sdong fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Andrew Kryczka c6abe30ee3 Fix concurrent full purge and WAL recycling (#5900)
Summary:
We were removing the file from `log_recycle_files_` before renaming it
with `ReuseWritableFile()`. Since `ReuseWritableFile()` occurs outside
the DB mutex, it was possible for a concurrent full purge to sneak in
and delete the file before it could be renamed. Consequently, `SwitchMemtable()`
would fail and the DB would enter read-only mode.

The fix is to hold the old file number in `log_recycle_files_` until
after the file has been renamed. Full purge uses that list to decide
which files to keep, so it can no longer delete a file pending recycling.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5900

Test Plan: new unit test

Differential Revision: D19771719

Pulled By: ajkr

fbshipit-source-id: 094346349ca3fb499712e62de03905acc30b5ce8
2020-02-18 13:54:13 -08:00
anand76 afa2420c2b Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:48:41 -08:00