Summary:
We recently noticed that some memtable flushed and file
ingestions could proceed during LockWAL, in violation of its stated
contract. (Note: we aren't 100% sure its actually needed by MySQL, but
we want it to be in a clean state nonetheless.)
Despite earlier skepticism that this could be done safely (https://github.com/facebook/rocksdb/issues/12666), I
found a place to wait to wait for LockWAL to be cleared before allowing
these operations to proceed: WaitForPendingWrites()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12652
Test Plan:
Added to unit tests. Extended how db_stress validates LockWAL
and re-enabled combination of ingestion and LockWAL in crash test, in
follow-up to https://github.com/facebook/rocksdb/issues/12642
Ran blackbox_crash_test for a long while with relevant features
amplified.
Suggested follow-up: fix FaultInjectionTestFS to report file sizes
consistent with what the user has requested to be flushed.
Reviewed By: jowlyzhang
Differential Revision: D57622142
Pulled By: pdillinger
fbshipit-source-id: aef265fce69465618974b4ec47f4636257c676ce
Summary:
When `recycle_log_file_num` is changed from 0 to non-zero and the DB is reopened, any log files from the previous session that are still alive get reused. However, the WAL records in those files are not in the recyclable format. If one of those files is reused and is empty, a subsequent re-open, in `RecoverLogFiles`, can replay those records and insert stale data into the memtable. Another manifestation of this is an assertion failure `first_seqno_ == 0 || s >= first_seqno_` in `rocksdb::MemTable::Add`.
We could fix this by either 1) Writing a special record when reusing a log file, or 2) Implement more rigorous checking in `RecoverLogFiles` to ensure we don't replay stale records, or 3) Not reuse files created by a previous DB session. We choose option 3 as its the simplest, and flipping `recycle_log_file_num` is expected to be a rare event.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12591
Test Plan: 1. Add a unit test to verify the bug and fix
Reviewed By: jowlyzhang
Differential Revision: D56655812
Pulled By: anand1976
fbshipit-source-id: aa3a26b4a5e892d39a54b5a0658233cbebebac87
Summary:
Example failure (cannot reproduce):
```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBWriteTestInstance/DBWriteTest
[ RUN ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
[ FAILED ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0, where GetParam() = 0 (49 ms)
[----------] 1 test from DBWriteTestInstance/DBWriteTest (49 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (49 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0, where GetParam() = 0
```
I have no idea why `SyncWAL()` would not be supported from what is presumably a `SpecialEnv` so added more debug info in case it fails again in CI. The last failure was https://github.com/facebook/rocksdb/actions/runs/8731304938/job/23956487511?fbclid=IwAR2jyXgVQtCezri3axV5MwMdI7D6VIudMk1xkiN_FL9-x2dkBv4IqIjjgB4 and it only happened once ever AFAIK.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12580
Reviewed By: hx235
Differential Revision: D56541996
Pulled By: ajkr
fbshipit-source-id: 1eab17567db783c11054fa85dd8b8880eacd3a50
Summary:
This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL.
This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403
Test Plan:
1. Add new unit tests to verify behavior upon corruption
2. Re-enable disabled tests for verifying recycling behavior
Reviewed By: ajkr
Differential Revision: D54544824
Pulled By: anand1976
fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a
Summary:
Issue https://github.com/facebook/rocksdb/issues/12421 describes a regression in the migration from CircleCI to GitHub Actions in which failing build steps no longer fail Windows CI jobs. In GHA with pwsh (new preferred powershell command), only the last non-builtin command (or something like that) affects the overall success/failure result, and failures in external commands do not exit the script, even with `$ErrorActionPreference = 'Stop'` and `$PSNativeCommandErrorActionPreference = $true`. Switching to `powershell` causes some obscure failure (not seen in CircleCI) about the `-Lo` option to `curl`.
Here we work around this using the only reasonable-but-ugly way known: explicitly check the result after every non-trivial build step. This leaves us highly susceptible to future regressions with unchecked build steps in the future, but a clean solution is not known.
This change also fixes the build errors that were allowed to creep in because of the CI regression. Also decreased the unnecessarily long running time of DBWriteTest.WriteThreadWaitNanosCounter.
For background, this problem explicitly contradicts GitHub's documentation, and GitHub has known about the problem for more than a year, with no evidence of caring or intending to fix. https://github.com/actions/runner-images/issues/6668 Somehow CircleCI doesn't have this problem. And even though cmd.exe and powershell have been perpetuating DOS-isms for decades, they still seem to be a somewhat active "hot mess" when it comes to sensible, consistent, and documented behavior.
Fixes https://github.com/facebook/rocksdb/issues/12421
A history of some things I tried in development is here: https://github.com/facebook/rocksdb/compare/main...pdillinger:rocksdb:debug_windows_ci_orig
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12426
Test Plan: CI, including https://github.com/facebook/rocksdb/issues/12434 where I have temporarily enabled other Windows builds on PR with this change
Reviewed By: cbi42
Differential Revision: D54903698
Pulled By: pdillinger
fbshipit-source-id: 116bcbebbbf98f347c7cf7dfdeebeaaed7f76827
Summary:
Enabling time PerfCounter stats in RocksDB is currently very expensive, as it enables all sorts of relatively uninteresting stats, such as iteration, point lookup breakdown etc. This PR adds a new perf level between `kEnableCount` and `kEnableTimeExceptForMutex` to enable stats for time spent by user (i.e a RocksDB user) threads blocked by other RocksDB threads or events, such as a write group leader, write delay or stalls etc. It does not include time spent waiting to acquire mutexes, or waiting for IO.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12368
Test Plan: Add a unit test for write_thread_wait_nanos
Reviewed By: ajkr
Differential Revision: D54021583
Pulled By: anand1976
fbshipit-source-id: 3f6fcf71010132ffffca0391a5565f3b59fddd48
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.
I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.
With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.
A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).
Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.
I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308
Test Plan: existing tests, CI
Reviewed By: ltamasi
Differential Revision: D53204947
Pulled By: pdillinger
fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by 9c2ebcc2c3/Makefile (L243)
Applying the change in PR https://github.com/facebook/rocksdb/issues/11675 shows a lot of missing status checks. This PR adds the missing status checks.
Most of changes are just adding asserts in unit tests. I'll add pr comment around more interesting changes that need review.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11686
Test Plan: change Makefile as in https://github.com/facebook/rocksdb/issues/11675, and run `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 check`
Reviewed By: hx235
Differential Revision: D48176132
Pulled By: cbi42
fbshipit-source-id: 6758946cfb1c6ff84c4c1e0ca540d05e6fc390bd
Summary:
This test exhibited the following flaky failure:
```
db/db_write_test.cc:653: Failure
db_->Resume()
Corruption: Not active
```
I was able to repro it by applying the following patch to coerce a specific race condition:
```
diff --git a/db/db_write_test.cc b/db/db_write_test.cc
index d82c57376..775ba3cde 100644
--- a/db/db_write_test.cc
+++ b/db/db_write_test.cc
@@ -636,6 +636,10 @@ TEST_P(DBWriteTest, LockWALInEffect) {
ASSERT_TRUE(dbfull()->WALBufferIsEmpty());
ASSERT_OK(db_->UnlockWAL());
+ // Test thread: sleep interval: [0, 3)
+ // In this interval, the file system is active
+ sleep(3);
+
// Fail the WAL flush if applicable
fault_fs->SetFilesystemActive(false);
Status s = Put("key2", "value");
@@ -649,6 +653,11 @@ TEST_P(DBWriteTest, LockWALInEffect) {
ASSERT_OK(db_->LockWAL());
ASSERT_OK(db_->UnlockWAL());
}
+
+ // Test thread: sleep interval: [3, 6)
+ // In this interval, the file system is inactive
+ sleep(3);
+
fault_fs->SetFilesystemActive(true);
ASSERT_OK(db_->Resume());
// Writes should work again
diff --git a/db/flush_job.cc b/db/flush_job.cc
index 8193f594f..602ee2c9f 100644
--- a/db/flush_job.cc
+++ b/db/flush_job.cc
@@ -979,6 +979,10 @@ Status FlushJob::WriteLevel0Table() {
DirFsyncOptions(DirFsyncOptions::FsyncReason::kNewFileSynced));
}
TEST_SYNC_POINT_CALLBACK("FlushJob::WriteLevel0Table", &mems_);
+ // Flush thread: sleep interval: [0, 4)
+ // Upon awakening, the file system will be inactive. Then the MANIFEST
+ // update will fail.
+ sleep(4);
db_mutex_->Lock();
}
base_->Unref();
```
The fix for this scenario is explained in the code change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11382
Reviewed By: cbi42
Differential Revision: D45027632
Pulled By: ajkr
fbshipit-source-id: 6bfa35a5781c0c080fb74e13f2b2c9f871f7effb
Summary:
In rare cases seeing failures like this
```
[ RUN ] DBWriteTestInstance/DBWriteTest.LockWALInEffect/2
db/db_write_test.cc:653: Failure
Put("key3", "value")
Corruption: Not active
```
in a test with no explicit threading. This is likely because of the unpredictability of background auto-resume. I didn't really know this feature, in part because DB::Resume() was undocumented. So I believe I have fixed the test and documented the API function.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11290
Test Plan: 1000s of stress runs of the test with gtest-parallel
Reviewed By: anand1976
Differential Revision: D43984583
Pulled By: pdillinger
fbshipit-source-id: d30dec120b4864e193751b2e33ff16834d313db3
Summary:
Example failure:
```
[ RUN ] DBWriteTestInstance/DBWriteTest.LockWALInEffect/1
db/db_write_test.cc:646: Failure
Put("key3", "value")
Corruption: Not active
```
Presumably from a background compaction prior to Put.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11209
Test Plan: watch CI
Reviewed By: akankshamahajan15
Differential Revision: D43147727
Pulled By: pdillinger
fbshipit-source-id: a1c34ac5ab124bfe2f23205a30777990056e9082
Summary:
The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:
* Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
* Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
* Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
* Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
* Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
* Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143
Test Plan: unit tests added / updated, added to stress/crash test
Reviewed By: ajkr
Differential Revision: D42848627
Pulled By: pdillinger
fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
Summary:
Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910
Reviewed By: riversand963
Differential Revision: D40880683
Pulled By: ajkr
fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
Summary:
**Context/Summary:**
Introduce `manual_wal_flush_one_in` as titled.
- When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in`
Incompatibilities fixed along the way:
```
db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed.
```
- It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false.
- As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698
Test Plan:
- Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3`
- Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs
Reviewed By: ajkr
Differential Revision: D39593752
Pulled By: ajkr
fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce
Summary:
WAL append and switch can both happen between `FlushWAL(true /* sync */)`'s sync operations and its call to `MarkLogsSynced()`. We permit this since locks need to be released for the sync operations. Such an appended/switched WAL is both inactive and incompletely synced at the time `MarkLogsSynced()` processes it.
Prior to this PR, `MarkLogsSynced()` assumed all inactive WALs were fully synced and removed them from consideration for future syncs. That was wrong in the scenario described above and led to the latest append(s) never being synced. This PR changes `MarkLogsSynced()` to only remove inactive WALs from consideration for which all flushed data has been synced.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10560
Test Plan: repro unit test for the scenario described above. Without this PR, it fails on "key2" not found
Reviewed By: riversand963
Differential Revision: D38957391
Pulled By: ajkr
fbshipit-source-id: da77175eba97ff251a4219b227b3bb2d4843ed26
Summary:
Resolves https://github.com/facebook/rocksdb/issues/9692
This PR adds a unit test that reproduces the race described in https://github.com/facebook/rocksdb/issues/9692 and an according fix.
The unit test does not have any assertions, because I could not find a reliable and save way to assert that the writers list does not form a cycle. So with the old (buggy) code, the test would simply hang, while with the fix the test passes successfully.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9944
Reviewed By: pdillinger
Differential Revision: D36134604
Pulled By: riversand963
fbshipit-source-id: ef636c5a79ddbef18658ab2f19ca9210a427324a
Summary:
`FlushWAL(true /* sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced.
The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`.
An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB.
```
Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185
Test Plan:
- repro unit test
- the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list"
Reviewed By: riversand963
Differential Revision: D37200993
Pulled By: ajkr
fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
Test Plan: Watch CI tests
Reviewed By: riversand963
Differential Revision: D36176799
fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
Summary:
Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087
Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc.
Reviewed By: riversand963
Differential Revision: D32007222
Pulled By: anand1976
fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792
Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555
Reviewed By: ajkr
Differential Revision: D29758399
Pulled By: jay-zhuang
fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
Summary:
`BeginWriteStall()` removes no_slowdown write from the write
list and updates `link_newer`, which makes `CreateMissingNewerLinks()`
thought all write list has valid `link_newer` and failed to create link
for all writers.
It caused flaky test and SegFault for release build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7508
Test Plan: Add unittest to reproduce the issue.
Reviewed By: anand1976
Differential Revision: D24126601
Pulled By: jay-zhuang
fbshipit-source-id: f8ac5dba653f7ee1b0950296427d4f5f8ee34a06
Summary:
Make the test robust to spurious wakeups on condition variable,
and clear sync points to ensure no use-after-free.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7418
Test Plan: repeated runs on updated test, watch CircleCI for recurrence
Reviewed By: jay-zhuang
Differential Revision: D23828823
Pulled By: pdillinger
fbshipit-source-id: af85117d9c02602541a90252840e0e5a6996de5b
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.
This commit reinstates https://github.com/facebook/rocksdb/issues/7049, whose un-revert was lost in an automatic
infrastructure mis-merge.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7274
Test Plan: Run all existing files.
Reviewed By: pdillinger
Differential Revision: D23177444
fbshipit-source-id: 1f61690b2ac6333c3b2c87176fef6b2cba086b33
Summary:
Cleans up some of the dependencies on test code in the Makefile while building tools:
- Moves the test::RandomString, DBBaseTest::RandomString into Random
- Moves the test::RandomHumanReadableString into Random
- Moves the DestroyDir method into file_utils
- Moves the SetupSyncPointsToMockDirectIO into sync_point.
- Moves the FaultInjection Env and FS classes under env
These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies. By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated.
Tested both release and debug builds via Make and CMake for both static and shared libraries.
More work remains to clean up how the tools are built and remove some unnecessary dependencies. There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097
Reviewed By: riversand963
Differential Revision: D22463160
Pulled By: pdillinger
fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7049
Test Plan: Run all existing files.
Reviewed By: pdillinger
Differential Revision: D22301700
fbshipit-source-id: f9a9e3b3b26ce640665a47cb8bff33ba0c89b565
Summary:
This reverts commit 8d87e9cea1.
Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923
Reviewed By: pdillinger
Differential Revision: D21864799
fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
When there is a write stall, the active write group leader calls ```BeginWriteStall()``` to walk the queue of writers and remove any with the ```no_slowdown``` option set. There was a bug in the code which updated the back pointer but not the forward pointer (```link_newer```), corrupting the list and causing some threads to wait forever. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6322
Test Plan: Add a unit test in db_write_test
Differential Revision: D19538313
Pulled By: anand1976
fbshipit-source-id: 6fbed819e594913f435886606f5d36f74f235c3a
Summary:
When we do concurrently writes, and different write operations will have WAL enable or disable.
But the data from write operation with WAL disabled will still be logged into log files, which will lead to extra disk write/sync since we do not want any guarantee for these part of data.
Detail can be found in https://github.com/facebook/rocksdb/issues/6280. This PR avoid mixing the two types in a write group. The advantage is simpler reasoning about the write group content
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6290
Differential Revision: D19448598
Pulled By: maysamyabandeh
fbshipit-source-id: 3d990a0f79a78ea1bfc90773f6ebafc1884c20de
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377
Differential Revision: D15551366
Pulled By: siying
fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
Summary:
Expose DB methods to lock and unlock the WAL.
These methods are intended to use by MyRocks in order to obtain WAL
coordinates in consistent way.
Usage scenario is following:
MySQL has performance_schema.log_status which provides information that
enables a backup tool to copy the required log files without locking for
the duration of copy. To populate this table MySQL does following:
1. Lock the binary log. Transactions are not allowed to commit now
2. Save the binary log coordinates
3. Walk through the storage engines and lock writes on each engine. For
InnoDB, redo log is locked. For MyRocks, WAL should be locked.
4. Ask storage engines for their coordinates. InnoDB reports its current
LSN and checkpoint LSN. MyRocks should report active WAL files names
and sizes.
5. Release storage engine's locks
6. Unlock binary log
Backup tool will then use this information to copy InnoDB, RocksDB and
MySQL binary logs up to specified positions to end up with consistent DB
state after restore.
Currently, RocksDB allows to obtain the list of WAL files. Only missing
bit is the method to lock the writes to WAL files.
LockWAL method must flush the WAL in order for the reported size to be
accurate (GetSortedWALFiles is using file system stat call to return the
file size), also, since backup tool is going to copy the WAL, it is
better to be flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5146
Differential Revision: D14815447
Pulled By: maysamyabandeh
fbshipit-source-id: eec9535a6025229ed471119f19fe7b3d8ae888a3
Summary:
There is a potential failure case in DBImpl::SwitchMemtable() that is not handled properly. The call to cur_log_writer->WriteBuffer() can fail due to an IO error. In that case, we need to call SetBGError() in order set the background error since the WriteBuffer() failure may result in data loss.
Also, the asserts for !new_mem and !new_log are incorrect, as those would have been allocated by the time this failure is detected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5072
Differential Revision: D14461384
Pulled By: anand1976
fbshipit-source-id: fb59bce9d61378f37d2dfcd28c0b704b0f43c3cf
Summary:
TSAN reports a false alarm for lock-order-inversion in DBWriteTest.IOErrorOnWALWritePropagateToWriteThreadFollower but Open and FlushWAL are not run concurrently. Suppressing the error by skipping FlushWAL in the test until TSAN is fixed.
The alternative would be to use
```
TSAN_OPTIONS="suppressions=tsan-suppressions.txt" ./db_write_test
```
but it does not seem straightforward to integrate it to our test infra.
Closes https://github.com/facebook/rocksdb/pull/3854
Differential Revision: D8000202
Pulled By: maysamyabandeh
fbshipit-source-id: fde33483d963a7ad84d3145123821f64960a4802
Summary:
Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests.
This PR is built on top of https://github.com/facebook/rocksdb/pull/3756
Closes https://github.com/facebook/rocksdb/pull/3824
Differential Revision: D7909153
Pulled By: maysamyabandeh
fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce
Summary:
It was misnamed. It actually updates `bg_error_` if `PreprocessWrite()` or `WriteToWAL()` fail, not related to the user callback.
Closes https://github.com/facebook/rocksdb/pull/3485
Differential Revision: D6955787
Pulled By: ajkr
fbshipit-source-id: bd7afc3fdb7a52830c021cbfc25fcbc3ab7d5e10
Summary:
Previously we store sequence number range of each blob files, and use the sequence number range to check if the file can be possibly visible by a snapshot. But it adds complexity to the code, since the sequence number is only available after a write. (The current implementation get sequence number by calling GetLatestSequenceNumber(), which is wrong.) With the patch, we are not storing sequence number range, and check if snapshot_sequence < obsolete_sequence to decide if the file is visible by a snapshot (previously we check if first_sequence <= snapshot_sequence < obsolete_sequence).
Closes https://github.com/facebook/rocksdb/pull/3274
Differential Revision: D6571497
Pulled By: yiwu-arbug
fbshipit-source-id: ca06479dc1fcd8782f6525b62b7762cd47d61909
Summary:
This is a simpler version of #3097 by removing all unrelated changes.
Fixing the bug where concurrent writes may get Status::OK while it actually gets IOError on WAL write. This happens when multiple writes form a write batch group, and the leader get an IOError while writing to WAL. The leader failed to pass the error to followers in the group, and the followers end up returning Status::OK() while actually writing nothing. The bug only affect writes in a batch group. Future writes after the batch group will correctly return immediately with the IOError.
Closes https://github.com/facebook/rocksdb/pull/3201
Differential Revision: D6421644
Pulled By: yiwu-arbug
fbshipit-source-id: 1c2a455c5b73f6842423785eb8a9dbfbb191dc0e
Summary:
write_options.sync = true and write_options.disableWAL is incompatible. When WAL is disabled, we are not able to persist the write immediately. Return an error in this case to avoid misuse of the options.
Closes https://github.com/facebook/rocksdb/pull/3086
Differential Revision: D6176822
Pulled By: yiwu-arbug
fbshipit-source-id: 1eb10028c14fe7d7c13c8bc12c0ef659f75aa071
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)
The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.
Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.
Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345
Differential Revision: D5210732
Pulled By: maysamyabandeh
fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
Summary:
Make default impl return NoSupported so the db_blob
tests exist in a meaningful manner.
Replace std::thread to port::Thread
Closes https://github.com/facebook/rocksdb/pull/2465
Differential Revision: D5275563
Pulled By: yiwu-arbug
fbshipit-source-id: cedf1a18a2c05e20d768c1308b3f3224dbd70ab6