add EdenServer recovery step and recover after failed takeover data send handshake

Summary:
* This adds a `EdenServer::recover()` method to start back up on unsuccessful takeover data send.
    * On an unsuccessful ping, filfill the `shutdownPromise` with a `TakeoverSendError` continaing the constructed `TakeoverData`. After this `recover` function is called, `takeoverPromise_` is reset, `takeoverShutdown` is set to `false`, and the `runningState_` is set to `RUNNING`.
With taking over from the returned `TakeoverData`, the user will not encounter `Transport not connected` errors on recovery.

* This adds a `EdenServer::closeStorage()` method to defer closing the `backingStore_` and `localStore_` until after our ready handshake is successful.
* This defers the shutdown of the `PrivHelper` until a successful ready handshake.

I also update the takeover documentation here with the new logic (and fix some formatting issues)

Reviewed By: simpkins

Differential Revision: D20433433

fbshipit-source-id: f59e660922674d281957e80aee5049735b901a2c
This commit is contained in:
Genevieve Helsel 2020-04-07 09:50:06 -07:00 committed by Facebook GitHub Bot
parent 8bb3b33f8a
commit 9944a5dff5
14 changed files with 285 additions and 99 deletions

View File

@ -12,33 +12,42 @@ library, client, server, data, and handler.
### Thrift serialization library
There are two main message classes:
* `struct TakeoverVersionQuery` - A list of takeover data serialization versions
There are three main message classes that are exchanged over the takeover socket:
* `struct TakeoverVersionQuery` - A list of takeover data serialization versions
that the client supports
* `union SerializedTakeoverData` - A list of `SerializedMountInfo` or a string
* empty "ready" ping - An empty ping sent by the server to ensure the client is
still alive and ready to receive takeover data
* `union SerializedTakeoverData` - A list of `SerializedMountInfo` or a string
error.
** `struct SerializedMountInfo` - Contains the mount path, state directory, a
list of bind mount paths (which is no longer used), connection information, and
a `SerializedInodeMap`
** `struct SerializedInodeMapEntry` - contains inode information like
inodeNumber, parentInode, name, isUnlinked, numFuseReferences, hash, and mode.
** `struct SerializedInodeMap` - A list of `SerializedInodeMapEntry` unloaded
inodes
** `struct SerializedFileHandleMap` - currently empty
* `struct SerializedMountInfo` - Contains the mount path, state directory, a
list of bind mount paths (which is no longer used), connection information, and
a `SerializedInodeMap`
* `struct SerializedInodeMap` - A list of `SerializedInodeMapEntry` unloaded
inodes
* `struct SerializedInodeMapEntry` - contains inode information like
inodeNumber, parentInode, name, isUnlinked, numFuseReferences, hash,
and mode.
* `struct SerializedFileHandleMap` - currently empty
### Client
The client has one function - `takeoverMounts`. This function requests to take
over mount points from an existing edenfs process. On success, it returns a
`TakeoverData` object, and it throws an exception on error. It takes two
parameters: a socketPath, and a set of integers of supported takeover versions.
`TakeoverData` object, and it throws an exception on error. It takes three
parameters: a socketPath, a bool shouldPing, and a set of integers of supported
takeover versions. The last two parameters are for testing purposes and should
not be used in productions builds.
This has a takeover timeout of 5 minutes for receiving takeover data from old
process.
We connect to the socket at the given path, then send our send our protocol
version so that the server knows whether we're capable of handshaking
successfully. We then wait for the takeover data response.
successfully. We then wait for the server to send us a "ready" ping, making sure
we are still listening on the socket. We respond to this ping and then wait for
the takeover data response. It is possible that we will not recieve this ping,
and instead just recieve the takeover data response.
After we get the takeover data response, we either throw an exception if we do
not get a message, or we deserialize the message and check its contents. We
@ -62,10 +71,15 @@ It has a few functions:
that the client process is from the same user ID, and that the client and
server support a compatible takeover protocol version. If the versions are
compatible, then the server starts to initiate shutdown by calling return
`server_->getTakeoverHandler()->startTakeoverShutdown()` Then, it sends the
takeover data over the takeover socket by serializing the information
(version, lock file, thrift socket, mount file descriptor) or error, and
sending it.
`server_->getTakeoverHandler()->startTakeoverShutdown()`. After the shutdown
is completed, the takeover server pings the takeover client to ensure it is
still waiting for the data. If the ping is unsuccessful (timeout, error, etc),
the takeover server stops the takeover process and returns the untransmitted
`TakeoverData` in an exception in order to let the `EdenServer` recover itself
and start serving again. Finally, it closes its storage (local and backing stores)
and sends the takeover data over the takeover socket by serializing the
information (version, lock file, thrift socket, mount file descriptor) or error,
and sending it.
* private functions:
* `connectionAccepted` - callback function for allocating a connection
handler when the server gets a client.
@ -90,11 +104,17 @@ graceful takeover functionality. This is primarily implemented by the
`EdenServer` class. However, there are also alternative implementations used
for unit testing.
It has one pure virtual function called `startTakeoverShutdown()`.
startTakeoverShutdown() will be called when a graceful shutdown has been
It has two pure virtual functions: `startTakeoverShutdown()` and `closeStorage()`.
`startTakeoverShutdown()` will be called when a graceful shutdown has been
requested, with a remote process attempting to take over the currently running
mount points.
When implemented, this should return a Future that will produce the
`TakeoverData` to send to the remote edenfs process once the edenfs process is
ready to transfer its mounts.
`closeStorage()` will be called before sending the `TakeoverData` to the client,
conditionally on a successful ready handshake (if applicable). This function should
close storage used by the server. In the case of an `EdenServer`, this function
allows for locks to be released in order for the new process to take over this storage.

View File

@ -94,6 +94,10 @@ std::string DefaultEdenMain::getLocalHostname() {
return getHostname();
}
void DefaultEdenMain::prepare(const EdenServer& /*server*/) {
fb303::registerFollyLoggingOptionHandlers();
}
void DefaultEdenMain::runServer(const EdenServer& server) {
// ThriftServer::serve() will drive the current thread's EventBase.
// Verify that we are being called from the expected thread, and will end up
@ -102,7 +106,6 @@ void DefaultEdenMain::runServer(const EdenServer& server) {
server.getMainEventBase(),
folly::EventBaseManager::get()->getEventBase());
fb303::registerFollyLoggingOptionHandlers();
fb303::withThriftFunctionStats(kServiceName, server.getHandler().get(), [&] {
server.getServer()->serve();
});
@ -289,8 +292,15 @@ int runEdenMain(EdenMain&& main, int argc, char** argv) {
DaemonStart{startTimeInSeconds, takeover, true /*success*/});
});
main.runServer(server.value());
server->performCleanup();
main.prepare(server.value());
while (true) {
main.runServer(server.value());
if (server->performCleanup()) {
break;
}
// performCleanup() returns false if a takeover shutdown attempt
// failed. Continue and re-run the server in this case.
}
XLOG(INFO) << "edenfs exiting successfully";
return EX_OK;

View File

@ -25,6 +25,7 @@ class EdenMain {
virtual std::string getEdenfsBuildName() = 0;
virtual std::string getEdenfsVersion() = 0;
virtual std::string getLocalHostname() = 0;
virtual void prepare(const EdenServer& server) = 0;
virtual void runServer(const EdenServer& server) = 0;
};
@ -36,6 +37,7 @@ class DefaultEdenMain : public EdenMain {
virtual std::string getEdenfsBuildName() override;
virtual std::string getEdenfsVersion() override;
virtual std::string getLocalHostname() override;
virtual void prepare(const EdenServer& server) override;
virtual void runServer(const EdenServer& server) override;
};

View File

@ -539,6 +539,43 @@ void EdenServer::scheduleInodeUnload(std::chrono::milliseconds timeout) {
},
timeout);
}
Future<Unit> EdenServer::recover(TakeoverData&& data) {
return recoverImpl(std::move(data))
.ensure(
// Mark the server state as RUNNING once we finish setting up the
// mount points. Even if an error occurs we still transition to the
// running state. Additionally, set the takeoverShutdown state to
// false in order to allow for future graceful restart requests.
[this] {
auto state = runningState_.wlock();
state->takeoverShutdown = false;
state->takeoverPromise = folly::Promise<TakeoverData>();
state->state = RunState::RUNNING;
});
}
Future<Unit> EdenServer::recoverImpl(TakeoverData&& takeoverData) {
auto thriftRunningFuture = createThriftServer();
const auto takeoverPath = edenDir_.getTakeoverSocketPath();
// Recover the eden lock file and the thrift server socket.
edenDir_.takeoverLock(std::move(takeoverData.lockFile));
server_->useExistingSocket(takeoverData.thriftSocket.release());
// Remount our mounts from our prepared takeoverData
std::vector<Future<Unit>> mountFutures;
mountFutures = prepareMountsTakeover(
std::make_unique<ForegroundStartupLogger>(),
std::move(takeoverData.mountPoints));
// Return a future that will complete only when all mount points have
// started and the thrift server is also running.
mountFutures.emplace_back(std::move(thriftRunningFuture));
return folly::collectAllUnsafe(mountFutures).unit();
}
#endif // !_WIN32
Future<Unit> EdenServer::prepare(
@ -841,9 +878,35 @@ void EdenServer::incrementStartupMountFailures() {
fb303::fbData->incrementCounter("startup_mount_failures");
}
void EdenServer::performCleanup() {
void EdenServer::closeStorage() {
// Destroy the local store and backing stores.
// We shouldn't access the local store any more after giving up our
// lock, and we need to close it to release its lock before the new
// edenfs process tries to open it.
backingStores_.wlock()->clear();
// Explicitly close the LocalStore
// Since we have a shared_ptr to it, other parts of the code can
// theoretically still maintain a reference to it after the EdenServer is
// destroyed. We want to ensure that it is really closed and no subsequent
// I/O can happen to it after the EdenServer is shut down and the main Eden
// lock is released.
localStore_->close();
}
bool EdenServer::performCleanup() {
bool takeover = false;
#ifndef _WIN32
bool takeover;
folly::stop_watch<> shutdown;
bool shutdownSuccess = true;
SCOPE_EXIT {
auto shutdownTimeInSeconds =
std::chrono::duration<double>{shutdown.elapsed()}.count();
serverState_->getStructuredLogger()->logEvent(
DaemonStop{shutdownTimeInSeconds, takeover, shutdownSuccess});
};
#endif
folly::File thriftSocket;
{
auto state = runningState_.wlock();
@ -853,82 +916,70 @@ void EdenServer::performCleanup() {
}
state->state = RunState::SHUTTING_DOWN;
}
folly::stop_watch<> shutdown;
auto shutdownFuture = takeover
? performTakeoverShutdown(std::move(thriftSocket))
: performNormalShutdown();
#else
auto shutdownFuture = performNormalShutdown();
#endif
: performNormalShutdown().thenValue([](auto&&) { return std::nullopt; });
// Drive the main event base until shutdownFuture completes
CHECK_EQ(mainEventBase_, folly::EventBaseManager::get()->getEventBase());
while (!shutdownFuture.isReady()) {
mainEventBase_->loopOnce();
}
auto&& shutdownResult = shutdownFuture.getTry();
#ifndef _WIN32
std::move(shutdownFuture)
.thenTry([shutdown,
takeover,
structuredLogger = serverState_->getStructuredLogger()](
folly::Try<Unit>&& result) {
auto shutdownTimeInSeconds =
std::chrono::duration<double>{shutdown.elapsed()}.count();
structuredLogger->logEvent(DaemonStop{
shutdownTimeInSeconds, takeover, !result.hasException()});
})
.get();
#else
std::move(shutdownFuture).get();
shutdownSuccess = !shutdownResult.hasException();
// We must check if the shutdownResult contains TakeoverData, and if so
// we must recover
if (shutdownResult.hasValue()) {
auto&& shutdownValue = shutdownResult.value();
if (shutdownValue.has_value()) {
// shutdownValue only contains a value if a takeover was not successful.
shutdownSuccess = false;
XLOG(INFO)
<< "edenfs encountered a takeover error, attempting to recover";
// We do not wait here for the remounts to succeed, and instead will
// let runServer() drive the mainEventBase loop to finish this call
(void)recover(std::move(shutdownValue).value());
return false;
}
}
#endif
// Explicitly close the LocalStore
// Since we have a shared_ptr to it, other parts of the code can theoretically
// still maintain a reference to it after the EdenServer is destroyed.
// We want to ensure that it is really closed and no subsequent I/O can happen
// to it after the EdenServer is shut down and the main Eden lock is released.
localStore_->close();
closeStorage();
// Stop the privhelper process.
shutdownPrivhelper();
shutdownResult.throwIfFailed();
return true;
}
Future<optional<TakeoverData>> EdenServer::performTakeoverShutdown(
folly::File thriftSocket) {
#ifndef _WIN32
Future<Unit> EdenServer::performTakeoverShutdown(folly::File thriftSocket) {
// stop processing new FUSE requests for the mounts,
return stopMountsForTakeover().thenValue(
[this,
socket = std::move(thriftSocket)](TakeoverData&& takeover) mutable {
// Destroy the local store and backing stores.
// We shouldn't access the local store any more after giving up our
// lock, and we need to close it to release its lock before the new
// edenfs process tries to open it.
backingStores_.wlock()->clear();
// Explicit close the LocalStore to ensure we release the RocksDB lock.
// Note that simply resetting the localStore_ pointer is insufficient,
// since there may still be other outstanding reference counts to the
// object.
localStore_->close();
// Stop the privhelper process.
shutdownPrivhelper();
takeover.lockFile = edenDir_.extractLock();
auto future = takeover.takeoverComplete.getFuture();
takeover.thriftSocket = std::move(socket);
takeoverPromise_.setValue(std::move(takeover));
runningState_.wlock()->takeoverPromise.setValue(std::move(takeover));
return future;
});
}
#else
NOT_IMPLEMENTED();
#endif // !_WIN32
}
Future<Unit> EdenServer::performNormalShutdown() {
#ifndef _WIN32
takeoverServer_.reset();
// Clean up all the server mount points before shutting down the privhelper.
return unmountAll().thenTry([this](folly::Try<Unit>&& result) {
shutdownPrivhelper();
result.throwIfFailed();
});
// Return an uninitalized optional here to avoid an attempted recovery
return unmountAll();
#else
NOT_IMPLEMENTED();
#endif // !_WIN32
@ -948,8 +999,6 @@ void EdenServer::shutdownPrivhelper() {
<< privhelperExitCode;
}
}
#else
NOT_IMPLEMENTED();
#endif
}
@ -1448,6 +1497,7 @@ folly::Future<TakeoverData> EdenServer::startTakeoverShutdown() {
// Make sure we aren't already shutting down, then update our state
// to indicate that we should perform mount point takeover shutdown
// once runServer() returns.
auto result = Future<TakeoverData>::makeEmpty();
{
auto state = runningState_.wlock();
if (state->state != RunState::RUNNING) {
@ -1478,14 +1528,15 @@ folly::Future<TakeoverData> EdenServer::startTakeoverShutdown() {
"error duplicating thrift server socket during graceful takeover");
state->takeoverThriftSocket =
folly::File{takeoverThriftSocket, /* ownsFd */ true};
result = state->takeoverPromise.getFuture();
}
shutdownSubscribers();
// Stop the thrift server. We will fulfill takeoverPromise_ once it
// Stop the thrift server. We will fulfill takeoverPromise once it
// stops.
server_->stop();
return takeoverPromise_.getFuture();
return result;
#else
NOT_IMPLEMENTED();
#endif // !_WIN32

View File

@ -150,6 +150,20 @@ class EdenServer : private TakeoverHandler {
std::shared_ptr<StartupLogger> logger,
bool waitForMountCompletion = true);
#ifndef _WIN32
/**
* Recover the EdenServer after a failed takeover request.
*
* This function is very similar to prepare() implementation-wise,
* but uses a TakeoverData object from a failed takeover request
* to recover itself.
*
* This function resets the TakeoverServer, resets the takeoverPromise,
* sets takeoverShutdown to false, and sets the state to RUNNING
*/
FOLLY_NODISCARD folly::Future<folly::Unit> recover(TakeoverData&& data);
#endif // _WIN32
/**
* Shut down the EdenServer after it has stopped running.
*
@ -161,7 +175,12 @@ class EdenServer : private TakeoverHandler {
* Otherwise performCleanup() will unmount and shutdown all currently running
* mounts.
*/
void performCleanup();
bool performCleanup();
/**
* Close the backingStore and the localStore.
*/
void closeStorage() override;
/**
* Stops this server, which includes the underlying Thrift server.
@ -410,6 +429,13 @@ class EdenServer : private TakeoverHandler {
std::shared_ptr<StartupLogger> logger);
static void incrementStartupMountFailures();
#ifndef _WIN32
/**
* recoverImpl() contains the bulk of the implementation of recover()
*/
FOLLY_NODISCARD folly::Future<folly::Unit> recoverImpl(TakeoverData&& data);
#endif // !_WIN32
/**
* Create config file if this the first time running the server, otherwise
* parse existing config file.
@ -435,8 +461,11 @@ class EdenServer : private TakeoverHandler {
std::optional<TakeoverData::MountInfo> takeover);
FOLLY_NODISCARD folly::Future<folly::Unit> performNormalShutdown();
FOLLY_NODISCARD folly::Future<folly::Unit> performTakeoverShutdown(
folly::File thriftSocket);
// If the takeover was successful, this returns std::nullopt. If a
// TakeoverData object is returned, that means the takeover attempt failed and
// the server should be resumed using the given TakeoverData.
FOLLY_NODISCARD folly::Future<std::optional<TakeoverData>>
performTakeoverShutdown(folly::File thriftSocket);
void shutdownPrivhelper();
// Starts up a new fuse mount for edenMount, starting up the thread
@ -536,8 +565,6 @@ class EdenServer : private TakeoverHandler {
* a graceful restart, taking over our running mount points.
*/
std::unique_ptr<TakeoverServer> takeoverServer_;
folly::Promise<TakeoverData> takeoverPromise_;
#endif // !_WIN32
/**
@ -548,6 +575,7 @@ class EdenServer : private TakeoverHandler {
RunState state{RunState::STARTING};
bool takeoverShutdown{false};
folly::File takeoverThriftSocket;
folly::Promise<TakeoverData> takeoverPromise;
};
folly::Synchronized<RunStateData> runningState_;

View File

@ -32,6 +32,7 @@ namespace eden {
TakeoverData takeoverMounts(
AbsolutePathPiece socketPath,
bool shouldPing,
const std::set<int32_t>& supportedVersions) {
folly::EventBase evb;
folly::Expected<UnixSocket::Message, folly::exception_wrapper>
@ -56,16 +57,23 @@ TakeoverData takeoverMounts(
auto timeout = std::chrono::seconds(FLAGS_takeoverReceiveTimeout);
return socket.receive(timeout);
})
.thenValue([&socket](UnixSocket::Message&& msg) {
.thenValue([&socket, shouldPing](UnixSocket::Message&& msg) {
if (TakeoverData::isPing(&msg.data)) {
// Just send an empty message back here, the server knows it sent a
// ping so it does not need to parse the message.
UnixSocket::Message ping;
return socket.send(std::move(ping)).thenValue([&socket](auto&&) {
// Wait for the takeover data response
auto timeout = std::chrono::seconds(FLAGS_takeoverReceiveTimeout);
return socket.receive(timeout);
});
if (shouldPing) {
// Just send an empty message back here, the server knows it sent a
// ping so it does not need to parse the message.
UnixSocket::Message ping;
return socket.send(std::move(ping)).thenValue([&socket](auto&&) {
// Wait for the takeover data response
auto timeout = std::chrono::seconds(FLAGS_takeoverReceiveTimeout);
return socket.receive(timeout);
});
} else {
// This should only be hit during integration tests.
return folly::makeFuture<UnixSocket::Message>(
folly::exception_wrapper(std::runtime_error(
"ping received but should not respond")));
}
} else {
// Older versions of EdenFS will not send a "ready" ping and
// could simply send the takeover data.

View File

@ -20,8 +20,9 @@ namespace eden {
*/
TakeoverData takeoverMounts(
AbsolutePathPiece socketPath,
// this parameter is present for testing purposes and should not normally
// be used in the production build.
// the following parameters are present for testing purposes and should not
// normally be used in the production build.
bool shouldPing = true,
const std::set<int32_t>& supportedTakeoverVersions =
kSupportedTakeoverVersions);

View File

@ -153,7 +153,7 @@ class TakeoverData {
* The takeoverComplete promise will be fulfilled by the TakeoverServer code
* once the TakeoverData has been sent to the remote process.
*/
folly::Promise<folly::Unit> takeoverComplete;
folly::Promise<std::optional<TakeoverData>> takeoverComplete;
private:
/**

View File

@ -38,6 +38,8 @@ class TakeoverHandler {
* its mounts.
*/
virtual folly::Future<TakeoverData> startTakeoverShutdown() = 0;
virtual void closeStorage() = 0;
};
} // namespace eden

View File

@ -208,10 +208,15 @@ Future<Unit> TakeoverServer::ConnHandler::pingThenSendTakeoverData(
folly::Try<UnixSocket::Message>&& msg) mutable {
if (msg.hasException()) {
// If we got an exception on sending or receiving here, we should
// bubble up an exception and recover. It is important to mark the
// takeover as completed here so the promise fulfills and we
// continue the cleanup process inside of EdenServer
data.takeoverComplete.setException(msg.exception());
// bubble up an exception and recover.
// We must save the original takeoverComplete promise
// since we will move the TakeoverData into the takeoverComplete
// promise and the EdenServer waits on this to be fulfilled to
// determine to recover or not
auto takeoverPromise = std::move(data.takeoverComplete);
takeoverPromise.setValue(std::move(data));
return makeFuture<Unit>(msg.exception());
}
return sendTakeoverData(std::move(data));
@ -220,6 +225,11 @@ Future<Unit> TakeoverServer::ConnHandler::pingThenSendTakeoverData(
Future<Unit> TakeoverServer::ConnHandler::sendTakeoverData(
TakeoverData&& data) {
// Before sending the takeover data, we must close the server's
// local and backing store. This is important for ensuring the RocksDB
// lock is released so the client can take over.
server_->getTakeoverHandler()->closeStorage();
UnixSocket::Message msg;
try {
msg.data = data.serialize(protocolVersion_);
@ -240,7 +250,12 @@ Future<Unit> TakeoverServer::ConnHandler::sendTakeoverData(
return socket_.send(std::move(msg))
.thenTry([promise = std::move(data.takeoverComplete)](
folly::Try<Unit>&& sendResult) mutable {
promise.setTry(std::move(sendResult));
if (sendResult.hasException()) {
promise.setException(sendResult.exception());
} else {
// Set an uninitalized optional here to avoid an attempted recovery
promise.setValue(std::nullopt);
}
});
}

View File

@ -42,6 +42,8 @@ class TestHandler : public TakeoverHandler {
return makeFuture(std::move(data_));
}
void closeStorage() override {}
private:
TakeoverData data_;
};
@ -55,6 +57,7 @@ class ErrorHandler : public TakeoverHandler {
return makeFuture<TakeoverData>(
std::logic_error("purposely failing for testing"));
}
void closeStorage() override {}
};
/**
@ -70,7 +73,9 @@ Future<TakeoverData> takeoverViaEventBase(
std::thread thread([path = AbsolutePath{socketPath},
supportedVersions,
promise = std::move(promise)]() mutable {
promise.setWith([&] { return takeoverMounts(path, supportedVersions); });
promise.setWith([&] {
return takeoverMounts(path, /*shouldPing=*/true, supportedVersions);
});
});
return future.via(evb).ensure(

View File

@ -22,6 +22,11 @@ DEFINE_string(edenDir, "", "The path to the .eden directory");
*/
DEFINE_int32(takeoverVersion, 0, "The takeover version number to send");
DEFINE_bool(
shouldPing,
true,
"This is used by integration tests to avoid sending a ping");
FOLLY_INIT_LOGGING_CONFIG("eden=DBG2");
using namespace facebook::eden::path_literals;
@ -47,10 +52,11 @@ int main(int argc, char* argv[]) {
facebook::eden::TakeoverData data;
if (FLAGS_takeoverVersion == 0) {
data = facebook::eden::takeoverMounts(takeoverSocketPath);
data = facebook::eden::takeoverMounts(takeoverSocketPath, FLAGS_shouldPing);
} else {
auto takeoverVersion = std::set<int32_t>{FLAGS_takeoverVersion};
data = facebook::eden::takeoverMounts(takeoverSocketPath, takeoverVersion);
data = facebook::eden::takeoverMounts(
takeoverSocketPath, FLAGS_shouldPing, takeoverVersion);
}
for (const auto& mount : data.mountPoints) {
XLOG(INFO) << "mount " << mount.mountPath << ": fd=" << mount.fuseFD.fd();

View File

@ -191,6 +191,17 @@ class EdenFS(object):
cmd.extend(args)
return cmd
def wait_for_is_healthy(self, timeout: float = 30) -> bool:
process = self._process
assert process is not None
health = util.wait_for_daemon_healthy(
proc=process,
config_dir=self._eden_dir,
get_client=lambda: self.get_thrift_client(),
timeout=timeout,
)
return health.is_healthy()
def start(
self,
timeout: float = 60,
@ -407,6 +418,27 @@ class EdenFS(object):
]
self.run_takeover_tool(cmd)
def takeover_without_ping_response(self) -> None:
"""
Execute a fake takeover to explicitly test a failed takeover. The
takeover client does not send a ping with the nosendPing flag,
so the subprocess call will throw, and we expect the old process
to recover
"""
# pyre-ignore[9]: T38947910
cmd: List[str] = [
FindExe.TAKEOVER_TOOL,
"--edenDir",
str(self._eden_dir),
"--noshouldPing",
]
try:
subprocess.check_call(cmd)
except Exception:
# We expect the new process to fail starting.
pass
def add_repository(self, name: str, repo_path: str) -> None:
"""
Run "eden repository" to define a repository configuration

View File

@ -304,6 +304,12 @@ class TakeoverTest(testcase.EdenRepoTest):
"""
self.eden.fake_takeover_with_version(3)
def test_takeover_failure(self) -> None:
print("=== beginning restart ===", file=sys.stderr)
self.eden.takeover_without_ping_response()
print("=== restart complete ===", file=sys.stderr)
self.assertTrue(self.eden.wait_for_is_healthy())
@testcase.eden_repo_test
class TakeoverRocksDBStressTest(testcase.EdenRepoTest):