Basic RocksDB follower implementation (#12540)
Summary:
A basic implementation of RocksDB follower mode, which opens a remote database (referred to as leader) on a distributed file system by tailing its MANIFEST. It leverages the secondary instance mode, but is different in some key ways -
1. It has its own directory with links to the leader's database
2. Periodically refreshes itself
3. (Future) Snapshot support
4. (Future) Garbage collection of obsolete links
5. (Long term) Memtable replication
There are two main classes implementing this functionality - `DBImplFollower` and `OnDemandFileSystem`. The former is derived from `DBImplSecondary`. Similar to `DBImplSecondary`, it implements recovery and catch up through MANIFEST tailing using the `ReactiveVersionSet`, but does not consider logs. In a future PR, we will implement memtable replication, which will eliminate the need to catch up using logs. In addition, the recovery and catch-up tries to avoid directory listing as repeated metadata operations are expensive.
The second main piece is the `OnDemandFileSystem`, which plugs in as an `Env` for the follower instance and creates the illusion of the follower directory as a clone of the leader directory. It creates links to SSTs on first reference. When the follower tails the MANIFEST and attempts to create a new `Version`, it calls `VerifyFileMetadata` to verify the size of the file, and optionally the unique ID of the file. During this process, links are created which prevent the underlying files from getting deallocated even if the leader deletes the files.
TODOs: Deletion of obsolete links, snapshots, robust checking against misconfigurations, better observability etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12540
Reviewed By: jowlyzhang
Differential Revision: D56315718
Pulled By: anand1976
fbshipit-source-id: d19e1aca43a6af4000cb8622a718031b69ebd97b
2024-04-20 02:13:31 +00:00
|
|
|
// Copyright (c) 2024-present, Facebook, Inc. All rights reserved.
|
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
|
|
|
|
#pragma once
|
|
|
|
|
|
|
|
#include <string>
|
|
|
|
#include <vector>
|
|
|
|
|
|
|
|
#include "db/db_impl/db_impl.h"
|
|
|
|
#include "db/db_impl/db_impl_secondary.h"
|
|
|
|
#include "logging/logging.h"
|
|
|
|
#include "port/port.h"
|
|
|
|
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
|
|
|
|
class DBImplFollower : public DBImplSecondary {
|
|
|
|
public:
|
|
|
|
DBImplFollower(const DBOptions& db_options, std::unique_ptr<Env>&& env,
|
|
|
|
const std::string& dbname, std::string src_path);
|
|
|
|
~DBImplFollower();
|
|
|
|
|
|
|
|
Status Close() override;
|
|
|
|
|
|
|
|
protected:
|
|
|
|
bool OwnTablesAndLogs() const override {
|
|
|
|
// TODO: Change this to true once we've properly implemented file
|
|
|
|
// deletion for the read scaling case
|
2024-05-18 02:13:33 +00:00
|
|
|
return true;
|
Basic RocksDB follower implementation (#12540)
Summary:
A basic implementation of RocksDB follower mode, which opens a remote database (referred to as leader) on a distributed file system by tailing its MANIFEST. It leverages the secondary instance mode, but is different in some key ways -
1. It has its own directory with links to the leader's database
2. Periodically refreshes itself
3. (Future) Snapshot support
4. (Future) Garbage collection of obsolete links
5. (Long term) Memtable replication
There are two main classes implementing this functionality - `DBImplFollower` and `OnDemandFileSystem`. The former is derived from `DBImplSecondary`. Similar to `DBImplSecondary`, it implements recovery and catch up through MANIFEST tailing using the `ReactiveVersionSet`, but does not consider logs. In a future PR, we will implement memtable replication, which will eliminate the need to catch up using logs. In addition, the recovery and catch-up tries to avoid directory listing as repeated metadata operations are expensive.
The second main piece is the `OnDemandFileSystem`, which plugs in as an `Env` for the follower instance and creates the illusion of the follower directory as a clone of the leader directory. It creates links to SSTs on first reference. When the follower tails the MANIFEST and attempts to create a new `Version`, it calls `VerifyFileMetadata` to verify the size of the file, and optionally the unique ID of the file. During this process, links are created which prevent the underlying files from getting deallocated even if the leader deletes the files.
TODOs: Deletion of obsolete links, snapshots, robust checking against misconfigurations, better observability etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12540
Reviewed By: jowlyzhang
Differential Revision: D56315718
Pulled By: anand1976
fbshipit-source-id: d19e1aca43a6af4000cb8622a718031b69ebd97b
2024-04-20 02:13:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
Status Recover(const std::vector<ColumnFamilyDescriptor>& column_families,
|
|
|
|
bool /*readonly*/, bool /*error_if_wal_file_exists*/,
|
|
|
|
bool /*error_if_data_exists_in_wals*/,
|
|
|
|
bool /*is_retry*/ = false, uint64_t* = nullptr,
|
|
|
|
RecoveryContext* /*recovery_ctx*/ = nullptr,
|
|
|
|
bool* /*can_retry*/ = nullptr) override;
|
|
|
|
|
|
|
|
private:
|
|
|
|
friend class DB;
|
|
|
|
|
|
|
|
Status TryCatchUpWithLeader();
|
|
|
|
void PeriodicRefresh();
|
|
|
|
|
|
|
|
std::unique_ptr<Env> env_guard_;
|
|
|
|
std::unique_ptr<port::Thread> catch_up_thread_;
|
|
|
|
std::atomic<bool> stop_requested_;
|
|
|
|
std::string src_path_;
|
|
|
|
port::Mutex mu_;
|
|
|
|
port::CondVar cv_;
|
2024-05-18 02:13:33 +00:00
|
|
|
std::unique_ptr<std::list<uint64_t>::iterator> pending_outputs_inserted_elem_;
|
Basic RocksDB follower implementation (#12540)
Summary:
A basic implementation of RocksDB follower mode, which opens a remote database (referred to as leader) on a distributed file system by tailing its MANIFEST. It leverages the secondary instance mode, but is different in some key ways -
1. It has its own directory with links to the leader's database
2. Periodically refreshes itself
3. (Future) Snapshot support
4. (Future) Garbage collection of obsolete links
5. (Long term) Memtable replication
There are two main classes implementing this functionality - `DBImplFollower` and `OnDemandFileSystem`. The former is derived from `DBImplSecondary`. Similar to `DBImplSecondary`, it implements recovery and catch up through MANIFEST tailing using the `ReactiveVersionSet`, but does not consider logs. In a future PR, we will implement memtable replication, which will eliminate the need to catch up using logs. In addition, the recovery and catch-up tries to avoid directory listing as repeated metadata operations are expensive.
The second main piece is the `OnDemandFileSystem`, which plugs in as an `Env` for the follower instance and creates the illusion of the follower directory as a clone of the leader directory. It creates links to SSTs on first reference. When the follower tails the MANIFEST and attempts to create a new `Version`, it calls `VerifyFileMetadata` to verify the size of the file, and optionally the unique ID of the file. During this process, links are created which prevent the underlying files from getting deallocated even if the leader deletes the files.
TODOs: Deletion of obsolete links, snapshots, robust checking against misconfigurations, better observability etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12540
Reviewed By: jowlyzhang
Differential Revision: D56315718
Pulled By: anand1976
fbshipit-source-id: d19e1aca43a6af4000cb8622a718031b69ebd97b
2024-04-20 02:13:31 +00:00
|
|
|
};
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|